python使用bs4爬取boss直聘静态页面

在本攻略中，我们将介绍如何使用Python的BeautifulSoup库爬取BOSS直聘的静态页面。我们将提供两个示例，演示如何使用BeautifulSoup库提取职位信息和公司信息。

步骤1：获取页面内容

在开始之前，我们需要获取目标页面的内容。我们可以使用Python的requests库来获取页面内容。在本攻略中，我们将使用requests库来获取页面内容。

import requests

url = 'https://www.zhipin.com/c101280100/?query=python&page=1'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
html = response.text

在上面的代码中，我们首先定义了一个名为url的变量，存储了目页面的URL。然后，我们定义了一个名为headers的字典，存储了请求头信息。接着，我们使用requests库的get()方法发送HTTP请求并获取响应数据的文本内容。

步骤2：使用BeautifulSoup库提取职位信息

我们可以按照以下步骤来使用BeautifulSoup库提取职位信息：

导入BeautifulSoup库。

from bs4 import BeautifulSoup

创建BeautifulSoup对象。

soup = BeautifulSoup(html, 'html.parser')

在上面的代码中，我们使用BeautifulSoup库的构造函数创建了一个名为soup的BeautifulSoup对象，并将目标页面的HTML文本内容作为参数传入。

使用find_all()方法提取职位信息。

job_list = soup.find_all('div', {'class': 'job-primary'})
for job in job_list:
    job_name = job.find('div', {'class': 'job-title'}).text.strip()
    company_name = job.find('div', {'class': 'company-text'}).a.text.strip()
    salary = job.find('span', {'class': 'red'}).text.strip()
    print(job_name, company_name, salary)

在上面的代码中，我们使用BeautifulSoup对象的find_all()方法查找页面内的所有职位信息，并使用for循环遍历每个职位信息。然后，我们使用find()方法查找职位名称、公司名称和薪资信息，并使用text属性获取文本内容。最后，我们打印输出职位名称、公司名称和薪资信息。

以下是一个示例代码，演示如何使用BeautifulSoup库提取BOSS直聘的职位信息：

from bs4 import BeautifulSoup
import requests

url = 'https://www.zhipin.com/c101280100/?query=python&page=1'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
job_list = soup.find_all('div', {'class': 'job-primary'})
for job in job_list:
    job_name = job.find('div', {'class': 'job-title'}).text.strip()
    company_name = job.find('div', {'class': 'company-text'}).a.text.strip()
    salary = job.find('span', {'class': 'red'}).text.strip()
    print(job_name, company_name, salary)

在上面的代码中，我们首先使用requests库获取了目标页面的HTML文本内容。然后，使用BeautifulSoup库的构造函数创建了一个名为soup的BeautifulSoup对象，并将目标页面的HTML文本内容作为参数传入。最后，我们使用find_all()方法提取职位信息，并打印输出职位名称、公司名称和薪资信息。

步骤3：使用BeautifulSoup库提取公司信息

我们可以按照以下步骤来使用BeautifulSoup库提取公司信息：

导入BeautifulSoup库。

from bs4 import BeautifulSoup

创建BeautifulSoup对象。

soup = BeautifulSoup(html, 'html.parser')

在上面的代码中，我们使用BeautifulSoup库的构造函数创建了一个名为soup的BeautifulSoup对象，并将目标页面的HTML文本内容作为参数传入。

使用find_all()方法提取公司信息。

company_list = soup.find_all('div', {'class': 'company-text'})
for company in company_list:
    company_name = company.a.text.strip()
    company_info = company.p.text.strip()
    print(company_name, company_info)

在上面的代码中，我们使用BeautifulSoup对象的find_all()方法查找页面内的所有公司信息，并使用for循环遍历每个公司信息。然后，我们使用find()方法查找公司名称和公司信息，并使用text属性获取文本内容。最后，我们打印输出公司名称和公司信息。

以下是一个示例代码，演示如何使用BeautifulSoup库提取BOSS直聘的公司信息：

from bs4 import BeautifulSoup
import requests

url = 'https://www.zhipin.com/c101280100/?query=python&page=1'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, 'html.parser')
company_list = soup.find_all('div', {'class': 'company-text'})
for company in company_list:
    company_name = company.a.text.strip()
    company_info = company.p.text.strip()
    print(company_name, company_info)

在上面的代码中，我们首先使用requests库获取了目标页面的HTML文本内容。然后，使用BeautifulSoup库的构造函数创建了一个名为soup的BeautifulSoup对象，并将目标页面的HTML文本内容作为参数传入。最后，我们使用find_all()方法提取公司信息，并打印输出公司名称和公司信息。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python使用bs4爬取boss直聘静态页面 - Python技术站

python使用bs4爬取boss直聘静态页面

步骤1：获取页面内容

步骤2：使用BeautifulSoup库提取职位信息

步骤3：使用BeautifulSoup库提取公司信息

相关文章