爬虫学习笔记：酷狗音乐榜单TOP500

2023年4月12日下午8:17 • 爬虫

yizhihongxing

一、背景

酷狗音乐热门榜单-酷狗TOP500（网页版）链接为：

# 链接
https://www.kugou.com/yy/rank/home/1-8888.html?from=rank
# 网页版并无下一页 只能通过自己构造链接实现
# 经发现 2-8888 3-8888 替换即可

二、实操

1.加载模块

import pandas as pd
import numpy as np
import time
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud

2.测试单独爬取

# 待爬取网页
url = r'https://www.kugou.com/yy/rank/home/1-8888.html?from=rank'
# 头部文件
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36', 'Connection': 'close'
}
# 请求
r = requests.get(url, headers=headers)
r.status_code # 200 正常返回

3.解析

# bs4解析
soup = BeautifulSoup(r.text, 'lxml')
titles = soup.select('.pc_temp_songname')
href = soup.select('.pc_temp_songname')
times = soup.select('.pc_temp_time')

# 存储列表
data_all = []
for titles, times, href in zip(titles, times, href):
    data = {
        '歌名':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[0].strip(),
        '歌手':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[1].strip(),
        '时长':times.get_text().replace('\n', '').replace('\t', '').replace('\r', '').strip(),
        '链接':href.get('href')
        }
    print(data)
    data_all.append(data)

df = pd.DataFrame(data_all)
'''
      歌名               歌手    时长                                           链接
0    孤勇者              陈奕迅  4:16  https://www.kugou.com/mixsong/5rcb3re6.html
1   一路生花              温奕心  4:16  https://www.kugou.com/mixsong/592l9gb7.html
2      叹  黄龄、Tăng Duy Tân  4:11  https://www.kugou.com/mixsong/5w42mq78.html
3  好想抱住你          程jiajia  3:42  https://www.kugou.com/mixsong/5uhaec79.html
4     下潜      川青、Morerare  3:37  https://www.kugou.com/mixsong/5sewos85.html
'''

三、函数封装

def get_data():
    dic = {}
    data_all = []
    for i in range(1, 24):
        url = f'https://www.kugou.com/yy/rank/home/{i}-8888.html?from=rank'
        # urls = 'https://www.kugou.com/yy/rank/home/%d-8888.html?from=rank' % i
        # 头部文件
        headers = {
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36', 
                'Connection': 'close'
                }
        # 请求
        r = requests.get(url, headers=headers)
        # bs4解析
        soup = BeautifulSoup(r.text, 'lxml')
        titles = soup.select('.pc_temp_songname')
        href = soup.select('.pc_temp_songname')
        times = soup.select('.pc_temp_time')
        # 存储列表
        for titles, times, href in zip(titles, times, href):
            data = {
                '歌名':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[0].strip(),
                '歌手':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[1].strip(),
                '时长':times.get_text().replace('\n', '').replace('\t', '').replace('\r', '').strip(),
                '链接':href.get('href')
                }
            print(data)
            data_all.append(data)
            if data['歌手'] not in dic:
                dic[data['歌手']] = 1
            else:
                dic[data['歌手']] += 1
        time.sleep(2)
    return data_all, dic

# 调用
data_all, dic = get_data()
df = pd.DataFrame(data_all)

四、完整版

import pandas as pd
import numpy as np
import time
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud

def cnt_songer(songer, dic):
    if songer not in dic:
        dic[songer] = 1
    else:
        dic[songer] += 1

def get_data():
    dic = {}
    data_all = []
    for i in range(1, 24):
        url = f'https://www.kugou.com/yy/rank/home/{i}-8888.html?from=rank'
        # urls = 'https://www.kugou.com/yy/rank/home/%d-8888.html?from=rank' % i
        # 头部文件
        headers = {
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36', 
                'Connection': 'close'
                }
        # 请求
        r = requests.get(url, headers=headers)
        # bs4解析
        soup = BeautifulSoup(r.text, 'lxml')
        titles = soup.select('.pc_temp_songname')
        href = soup.select('.pc_temp_songname')
        times = soup.select('.pc_temp_time')
        # 存储列表
        for titles, times, href in zip(titles, times, href):
            data = {
                '歌名':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[0].strip(),
                '歌手':titles.get_text().replace('\n', '').replace('\t', '').replace('\r', '').split('-')[1].strip(),
                '时长':times.get_text().replace('\n', '').replace('\t', '').replace('\r', '').strip(),
                '链接':href.get('href')
                }
            print(data)
            data_all.append(data)
            cnt_songer(data['歌手'], dic)
        time.sleep(2)
    return data_all, dic

def process_data(dic):
    items = dict(sorted(dic.items(), key=lambda x: x[1], reverse=True))
    items = {key: value for key, value in items.items() if value > 1}
    # print(items)
    return items

def main():
    data_all, dic = get_data()
    df = pd.DataFrame(data_all)
    items = process_data(dic)
    print(len(items))
    return df, items

if __name__ == '__main__':
    data, dic_result = main()

五、词云图

有待继续学习！

To be continue.........

参考链接：华语乐坛到底姓什么？------酷狗篇

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：爬虫学习笔记：酷狗音乐榜单TOP500 - Python技术站

赞 (0)

微信扫一扫

微信扫一扫

支付宝扫一扫

支付宝扫一扫

利用Abot爬虫和visjs 呈现漫威宇宙

上一篇 2023年4月12日

爬虫系列之第3章-Selenium模块

下一篇 2023年4月12日

Python爬虫基础之lxml

一、Python lxml的基本应用 1 <html> 2 <head> 3 <title> 4 The Dormouse’s story 5 </title> 6 </head> 7 <body> 8 <p class=”title”> 9 <b> 10 The…

爬虫 2023年4月11日
000
python动态网页批量爬取

关于“Python动态网页批量爬取”的攻略，一般需要实现以下几个步骤：确定网页的动态内容与Ajax请求动态网页一般是指，其内容是通过Ajax请求异步获取的，而不是直接在一次请求中获取全部内容。因此，在爬取这样的网页时，我们需要首先找到对应的Ajax请求，获取其中的网页内容。可以使用浏览器开发者工具或者第三方库来帮助定位Ajax请求。模拟Ajax请求并获…

python 2023年5月14日
000
python构建基础的爬虫教学

Python构建基础的爬虫教学概述爬虫是一种自动化抓取网页数据的程序，可以帮助我们快速获取海量数据。Python作为一种易于学习、简洁明了、功能齐全的编程语言，是非常适用于构建爬虫应用的语言。在本篇教程中，我们将介绍Python构建基础的爬虫应用的入门知识，包括Python爬虫的基本原理、库的使用以及实战案例。基本原理 Python爬虫的基本原理是通过…

python 2023年5月14日
000
04 爬虫解析库之xpath库

一. 什么是Xpath? XPath 是 XML 的查询语言，本节介绍该语言的语法。 XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。二. 快速使用 1. 前期准备 doc = ”’ <html> <head> <base href=’htt…

爬虫 2023年4月16日
000
如何设置Python爬虫定时任务

记得以前的Windows任务定时是可以正常使用的，今天试了下，发现不能正常使用了，任务计划总是挂起。接下来记录下Python爬虫定时任务的几种解决方法。方法一、while True 首先最容易的是while true死循环挂起，不废话，直接上代码： import os import time import sys from datetime import …

爬虫 2023年4月11日
000
python程序爬虫总是崩溃

写的一个爬虫程序，主要用到以下库。但是伴随着代码增多，功能增多。经常性的程序崩溃现象，逐渐显现。 pyqt5_5.8.2，requests.get，selenium+chorme，threading.Thread，queue.Queue 多次完善代码与程序，甚至已经尝试了各种python版本，与pyqt5版本。甚至pyqt5-tools的版本也换了，都无法…

爬虫 2023年4月11日
000
爬虫

Python爬虫实战三之爬取嗅事百科段子

俗话说，上班时间是公司的，下班了时间才是自己的。搞点事情，写个爬虫程序，每天定期爬取点段子，看着自己爬的段子，也是一种乐趣。二、Python爬取嗅事百科段子 1.确定爬取的目标网页首先我们要明确目标，本次爬取的是糗事百科文字模块的段子。（糗事百科）->分析目标（策略：url格式（范围）、数据格式、网页编码）->编写代码->执行…

2023年4月11日
000
Scrapy学习-15-降低被识别为爬虫的方法

3种常见的方法 1. 在settings中配置禁用cookies 1 COOKIES_ENABLED = False 2. scrapy限速处理，scrapy为我们提供了扩展模块，它能动态的限制下载速度 # http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/autothrottle.html # 在se…

爬虫 2023年4月13日
000

合作推广

合作推广

返回顶部