接下来我会为大家详细讲解Python爬虫采集Tripadvisor数据案例实现的完整攻略。
一、准备工作
在开始爬虫之前,我们需要做好以下准备工作:
1.安装Python
由于本案例使用Python进行爬虫,因此需要在电脑上安装Python3.7或以上版本的解释器。
2.安装相关库
在进行爬虫操作之前,我们还需要安装一些Python库,包括requests库、BeautifulSoup库等。可以通过以下命令进行安装:
pip install requests
pip install beautifulsoup4
3.分析网页结构
在进行数据爬取之前,我们需要先分析目标网页的结构,进而确定爬虫获取数据的方式。
二、目标网站
本案例中,我们选择了Tripadvisor这个旅行网站作为目标网站,该网站的网址为 https://www.tripadvisor.com。
三、爬虫流程
接下来我们将会介绍本案例中爬虫的详细流程:
1.确定目标网址
我们需要确定爬取数据的目标网址,本案例中我们选择的是Tripadvisor的某个城市的景点页面,例如:
https://www.tripadvisor.com/Attractions-g293917-Activities-Ho_Chi_Minh_City.html
2.获取网页源码
在确定了目标网址之后,我们需要通过requests库获取该网址的源码,用以后续的数据处理。具体代码实现如下:
import requests
url = 'https://www.tripadvisor.com/Attractions-g293917-Activities-Ho_Chi_Minh_City.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
3.解析网页源码
在获取了网页源码之后,我们需要通过BeautifulSoup库进行数据解析,来获取我们需要的数据。具体代码实现如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
names = soup.select('.attractions-attraction-overview-pois-PoiTitle__poiTitle--2FLHC')
ranks = soup.select('.attractions-attraction-overview-pois-PoiInfo__isAligned--3Qoqq .attractions-attraction-overview-pois-PartOfPrime__part-of-prime-ribbon--2n2M3')
reviews = soup.select('.attractions-attraction-overview-pois-PoiInfo__isAligned--3Qoqq .attractions-attraction-overview-pois-ReviewCount__reviewCount--2lT0I')
for name, rank, review in zip(names, ranks, reviews):
data = {
'name': name.get_text(),
'rank': rank.get_text(),
'review': review.get_text()
}
print(data)
4.存储数据
在获取了我们需要的数据之后,我们可以将其存储到本地文本或者数据库中,以供后续使用。这里我们采用文件存储的方式,具体代码实现如下:
import csv
with open('tripadvisor.csv', 'w', encoding='utf-8-sig', newline='') as csvfile:
fieldnames = ['name', 'rank', 'review']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for name, rank, review in zip(names, ranks, reviews):
writer.writerow({'name': name.get_text(), 'rank': rank.get_text(), 'review': review.get_text())
至此,本次爬虫实例的完整流程便介绍完毕。
四、示例说明
以下为两个爬虫实例的具体实现:
示例一:获取某个城市的景点排名及评价
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tripadvisor.com/Attractions-g293917-Activities-Ho_Chi_Minh_City.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'lxml')
names = soup.select('.attractions-attraction-overview-pois-PoiTitle__poiTitle--2FLHC')
ranks = soup.select('.attractions-attraction-overview-pois-PoiInfo__isAligned--3Qoqq .attractions-attraction-overview-pois-PartOfPrime__part-of-prime-ribbon--2n2M3')
reviews = soup.select('.attractions-attraction-overview-pois-PoiInfo__isAligned--3Qoqq .attractions-attraction-overview-pois-ReviewCount__reviewCount--2lT0I')
with open('tripadvisor.csv', 'w', encoding='utf-8-sig', newline='') as csvfile:
fieldnames = ['name', 'rank', 'review']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for name, rank, review in zip(names, ranks, reviews):
writer.writerow({'name': name.get_text(), 'rank': rank.get_text(), 'review': review.get_text()})
示例二:获取某个城市的景点信息
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tripadvisor.com/Attractions-g293917-Activities-Ho_Chi_Minh_City.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'lxml')
info_links = soup.select('.attractions-attraction-overview-pois-PoiTitle__poiTitle--2FLHC a')
with open('tripadvisor_info.csv', 'w', encoding='utf-8-sig', newline='') as csvfile:
fieldnames = ['name', 'type', 'review', 'address', 'phone', 'website']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for link in info_links:
if link.get('href').startswith('/Attraction_Review'):
url = 'https://www.tripadvisor.com' + link.get('href')
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'lxml')
name = soup.select_one('.heading_title').get_text().strip()
review = soup.select_one('.header_rating span').get_text().strip()
address = soup.select_one('.address').get_text().strip()
phone = soup.select_one('.phone').get_text().strip()
if not phone.startswith('+'):
phone = None
website = soup.select_one('.website-link a')
if website:
website = website.get('href')
else:
website = None
types = [type.get_text().strip() for type in soup.select('.attractions-attraction-detail-about-card-SingleItemDetail__detail--1obvs .attractions-attraction-detail-about-card-SingleItemDetail__category--1Tkrc a')]
type_str = ', '.join(types)
writer.writerow({'name': name, 'type': type_str, 'review': review, 'address': address, 'phone': phone, 'website': website})
以上就是本篇攻略的全部内容,希望对大家在进行Python爬虫采集Tripadvisor数据的时候有所帮助。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:Python爬虫采集Tripadvisor数据案例实现 - Python技术站