Python爬虫使用脚本登录Github并查看信息

讲解"Python爬虫使用脚本登录Github并查看信息"的攻略要分为以下几个步骤：

登录Github账号获取Cookie
使用Cookie请求Github登录后的页面，获取个人信息
整合到脚本中，实现自动登录并获取个人信息

下面分别详细介绍每个步骤。

登录Github账号获取Cookie

我们可以在Chrome浏览器中登录Github并使用F12打开开发者工具，然后在Network中找到登录请求，并查看其中的请求头(header)。可以看到其中有一个cookie字段，这个字段就是我们需要的Cookie。

示例1代码:

import requests

headers = {
    'Host': 'github.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With': 'XMLHttpRequest',
    'Referer': 'https://github.com/login',
    'Connection': 'keep-alive',
}

data = {
  'commit': 'Sign in',
  'utf8': '✓',
  'authenticity_token': 'XXXXXXXXXXXXXXXXXX', # 这里填写登录页面中的authenticity_token字段
  'login': 'github_username',
  'password': 'github_password'
}

response = requests.post('https://github.com/session', headers=headers, data=data)
cookies = response.cookies.get_dict()
print(cookies) # 查看获取到的Cookie

示例1中的代码需要替换其中的authenticity_token为正确的值，该值可以通过抓包或者其他方式获得。示例1中我们使用requests库发送请求，得到响应后可以使用response.cookies.get_dict()方法获取Cookie的字典格式表示。

使用Cookie请求Github登录后的页面，获取个人信息

得到Cookie之后，我们就可以使用它来请求Github登录后的个人页面，然后从页面中获取我们需要的信息。这一步的主要目标就是找到请求的URL和需要的请求头(header)。我们可以在Chrome浏览器中登录Github，并在F12的开发者工具中查看请求的URL和header信息。

示例2代码：

import requests

cookies = {'_ga': 'GA1.2.1001127999.1523271992', '_gid': 'GA1.2.650638013.1523271992', 'user_session': 'xxxxx'} # 此处填写在上一步中获取到的Cookie

headers = {
    'Host': 'github.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
    'Referer': 'https://github.com',
    'Connection': 'keep-alive',
}

response = requests.get('https://github.com/github_username', headers=headers, cookies=cookies)
print(response.text) # 查看获取到的个人页面，可根据需要提取信息

示例2中的代码中我们使用requests库发送请求，由于请求需要使用Cookie来表示身份，所以我们在请求中加上了早先获取到的Cookie。在headers中，我们需要添加一些必要的头信息，可以根据需要自由添加或删除。

整合到脚本中，实现自动登录并获取个人信息

示例1和示例2中获取到的信息分别是登录需要的Cookie和个人信息页面，我们需要把它们整合到一个脚本中，从而实现使用Python脚本自动登录Github并获取个人信息。

步骤如下：

获取登录界面中的authenticity_token字段
使用获取到的authenticity_token和账号密码，发送POST请求获取Cookie，并保存到变量中
使用获取到的Cookie，发送HTTP请求获取登录后的个人页面，提取需要的信息

示例3代码：

import requests
from bs4 import BeautifulSoup

# 获取authenticity_token
login_url = 'https://github.com/login'
login_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
                  'Accept-Language': 'en-US,en;q=0.5',
                  'Accept-Encoding': 'gzip, deflate, br',
                  'Referer': 'https://github.com/'
                }
login_res = requests.get(login_url, headers=login_headers)
soup = BeautifulSoup(login_res.text, 'lxml')
token = soup.find('input',attrs={"name":"authenticity_token"}).get("value")
print(token)

# 使用Cookie请求个人信息页面
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Referer': 'https://github.com/',
}

data = {
  'commit': 'Sign in',
  'utf8': '✓', 
  'authenticity_token': token,
  'login': 'your_github_login_id',
  'password': 'your_github_password'
}

login_url = 'https://github.com/session'
login_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
                  'Referer': 'https://github.com/login',
                  'Content-Type': 'application/x-www-form-urlencoded',
                  'Accept-Encoding': 'gzip, deflate, br',
                  'Accept-Language': 'en-US,en;q=0.5'
                }
login_res = requests.post(login_url, headers=login_headers, data=data)
print('Login Successfully!')
cookies = login_res.cookies.get_dict()

personal_url = 'https://github.com/<your_github_username>'
response = requests.get(personal_url, headers=headers, cookies=cookies)
soup = BeautifulSoup(response.text, 'lxml')
print("Your Name is : "+soup.select_one('span.p-name').text)

示例3中我们先使用requests库向登录页面发送请求，通过BeautifulSoup库解析到authenticity_token值，然后整合cookie、headers和提交的数据。最终请求成功后，我们得到返回的个人主页（response.text）并提取其中的用户名（soup.select_one('span.p-name').text）。

希望这些例子对您有所帮助。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python爬虫使用脚本登录Github并查看信息 - Python技术站

Python爬虫使用脚本登录Github并查看信息

登录Github账号获取Cookie

使用Cookie请求Github登录后的页面，获取个人信息

整合到脚本中，实现自动登录并获取个人信息

相关文章