python 爬虫基本使用——统计杭电oj题目正确率并排序

杭电OJ是一个著名的在线评测系统，提供了大量的算法题目。本攻略将介绍如何使用Python爬虫统计杭电OJ题目的正确率，并按照正确率排序。

爬取题目信息

我们可以使用Python的requests库和BeautifulSoup库爬取杭电OJ的题目信息。以下是一个示例代码，用于爬取杭电OJ的题目信息：

import requests
from bs4 import BeautifulSoup

url = 'http://acm.hdu.edu.cn/listproblem.php?vol=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_text'})
rows = table.find_all('tr')[1:]

problems = []
for row in rows:
    cols = row.find_all('td')
    problem_id = cols[0].text.strip()
    problem_title = cols[2].text.strip()
    problem_url = 'http://acm.hdu.edu.cn/' + cols[2].find('a')['href']
    problems.append({'id': problem_id, 'title': problem_title, 'url': problem_url})

在上面的代码中，我们使用requests库发送HTTP请求，并使用BeautifulSoup库解析HTML响应。我们使用find方法查找HTML响应中的题目表格，并使用find_all方法获取所有行。我们遍历每一行，并使用find_all方法获取所有列。我们使用列的文本内容和链接构造题目信息，并将其添加到problems列表中。

爬取提交记录

我们可以使用Python的requests库和BeautifulSoup库爬取杭电OJ的提交记录。以下是一个示例代码，用于爬取杭电OJ的提交记录：

import requests
from bs4 import BeautifulSoup

url = 'http://acm.hdu.edu.cn/status.php?user_id=your_user_id&status=0'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_text'})
rows = table.find_all('tr')[1:]

submissions = []
for row in rows:
    cols = row.find_all('td')
    problem_id = cols[2].text.strip()
    result = cols[3].text.strip()
    if result == 'Accepted':
        submissions.append(problem_id)

在上面的代码中，我们使用requests库发送HTTP请求，并使用BeautifulSoup库解析HTML响应。我们使用find方法查找HTML响应中的提交记录表格，并使用find_all方法获取所有行。我们遍历每一行，并使用find_all方法获取所有列。我们使用列的文本内容构造提交记录，并将其添加到submissions列表中。

统计正确率并排序

我们可以使用Python的collections库统计每个题目的正确率，并使用sorted函数按照正确率排序。以下是一个示例代码，用于统计杭电OJ题目的正确率并排序：

from collections import Counter

problem_ids = [problem['id'] for problem in problems]
submission_counts = Counter(submissions)
accuracy = {problem_id: submission_counts[problem_id] / len(submissions) for problem_id in problem_ids}
sorted_problems = sorted(problems, key=lambda problem: accuracy[problem['id']], reverse=True)

for problem in sorted_problems:
    print(f"{problem['id']} {problem['title']} {accuracy[problem['id']]:.2%}")

在上面的代码中，我们使用列表推导式获取所有题目的ID，并使用collections库的Counter函数统计每个题目的提交次数。我们使用字典推导式计算每个题目的正确率，并使用sorted函数按照正确率排序。最后，我们遍历排序后的题目列表，并打印每个题目的ID、标题和正确率。

示例1：爬取杭电OJ题目信息

以下是一个示例代码，用于爬取杭电OJ的题目信息：

import requests
from bs4 import BeautifulSoup

url = 'http://acm.hdu.edu.cn/listproblem.php?vol=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_text'})
rows = table.find_all('tr')[1:]

problems = []
for row in rows:
    cols = row.find_all('td')
    problem_id = cols[0].text.strip()
    problem_title = cols[2].text.strip()
    problem_url = 'http://acm.hdu.edu.cn/' + cols[2].find('a')['href']
    problems.append({'id': problem_id, 'title': problem_title, 'url': problem_url})

for problem in problems:
    print(f"{problem['id']} {problem['title']} {problem['url']}")

在上面的代码中，我们使用find方法查找HTML响应中的题目表格，并使用find_all方法获取所有行。我们遍历每一行，并使用find_all方法获取所有列。我们使用列的文本内容和链接构造题目信息，并打印到控制台。

示例2：统计杭电OJ题目正确率并排序

以下是一个示例代码，用于统计杭电OJ题目的正确率并排序：

import requests
from bs4 import BeautifulSoup
from collections import Counter

url = 'http://acm.hdu.edu.cn/status.php?user_id=your_user_id&status=0'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_text'})
rows = table.find_all('tr')[1:]

submissions = []
for row in rows:
    cols = row.find_all('td')
    problem_id = cols[2].text.strip()
    result = cols[3].text.strip()
    if result == 'Accepted':
        submissions.append(problem_id)

url = 'http://acm.hdu.edu.cn/listproblem.php?vol=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_text'})
rows = table.find_all('tr')[1:]

problems = []
for row in rows:
    cols = row.find_all('td')
    problem_id = cols[0].text.strip()
    problem_title = cols[2].text.strip()
    problem_url = 'http://acm.hdu.edu.cn/' + cols[2].find('a')['href']
    problems.append({'id': problem_id, 'title': problem_title, 'url': problem_url})

problem_ids = [problem['id'] for problem in problems]
submission_counts = Counter(submissions)
accuracy = {problem_id: submission_counts[problem_id] / len(submissions) for problem_id in problem_ids}
sorted_problems = sorted(problems, key=lambda problem: accuracy[problem['id']], reverse=True)

for problem in sorted_problems:
    print(f"{problem['id']} {problem['title']} {accuracy[problem['id']]:.2%}")

在上面的代码中，我们首先使用requests库和BeautifulSoup库爬取杭电OJ的提交记录和题目信息。我们使用列表推导式获取所有题目的ID，并使用collections库的Counter函数统计每个题目的提交次数。我们使用字典推导式计算每个题目的正确率，并使用sorted函数按照正确率排序。最后，我们遍历排序后的题目列表，并打印每个题目的ID、标题和正确率。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python 爬虫基本使用——统计杭电oj题目正确率并排序 - Python技术站