Python字体反爬实战案例分享

接下来我会详细讲解“Python字体反爬实战案例分享”的攻略。

标题

前言

在网络爬虫中，常常遇到字体反爬的问题。对于这种反爬，我们可以使用 Python 中的 FontTools 库来解决。

步骤

以下是该案例的详细步骤：

首先，我们需要使用 requests 库来获取网页内容。代码示例如下：

```python
import requests

url = 'https://www.example.com'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(url, headers=headers)
```

接着，我们需要使用 bs4 库来解析网页内容。代码示例如下：

```python
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'lxml')
```

获取字体文件链接并下载字体文件，并读取字体信息。代码示例如下：

```python
font_url = soup.select('style')[0].text.split('///')[1].split('\'')[0]

response_font = requests.get(font_url, headers=headers)

with open('font.ttf', 'wb') as f:
f.write(response_font.content)

from fontTools.ttLib import TTFont

font = TTFont('font.ttf')
uni_list = font.getGlyphOrder()[2:]
```

将网页内容中的标签中的字体转化为对应的文字。代码示例如下：

python for uni in uni_list: soup = soup.decode().replace('&#x'+uni[3:].lower()+';', font.getBestCmap()[int(uni[3:], 16)])

最后将处理后的网页内容保存为 html 文件。代码示例如下：

python with open('index.html', 'w') as f: f.write(soup)

示例

这里给出一个使用 FontTools 库对 "ganji.com" 中的字体反爬进行处理的示例。

代码示例

import requests
from bs4 import BeautifulSoup
from io import BytesIO
from PIL import Image
from fontTools.ttLib import TTFont
import re

def translateFont(soup, glyph_map_url):
    glyph_pattern = re.compile(r'&#x([0-9a-f]{4,5});')
    font_re = re.compile(r"fonts-path: url\('(.*?)'\)")
    #获取字体映射的文件路径
    css_text = requests.get(glyph_map_url).text
    font_url = "https:" + font_re.findall(css_text)[0]
    #获取字体文件的二进制内容，并保存
    font_resp = requests.get(font_url)
    font = TTFont(BytesIO(font_resp.content))
    glyphs = font.getGlyphNames()
    cmap = font.getBestCmap()

    #仿照url保存到本地
    with open("downloaded.ttf", "wb") as f:
        f.write(font_resp.content)

    #替换html中的字体
    for (tag, attr, text) in soup.select("*"):
        try:
            if text and glyph_pattern.search(text):
                for match in glyph_pattern.findall(text):
                    cp = int(match, 16)  #转换为 ascii 码
                    name = cmap[cp]  #获取字体文件中字形的名字
                    if name in glyphs:
                        #利用 Hex Glyph 对象（font.getGlyph(name))的 ttGlyph.render() 生成 图像
                        im = Image.new("RGB", (200, 200), "white")
                        font.getGlyph(name).render(
                            im.load(),
                            tuple([int(i * 1.1) for i in im.size])
                        )
                        output = pytesseract.image_to_string(im)
                        #替换原来的 html 代码
                        text = text.replace(f"&#x{match};", output)
                        tag[attr] = text
        except:
            pass
    return soup

使用方法

if __name__ == "__main__":
    import pytesseract


    glyph_map_url = "https://static.ganji.cn/public/file/site/bing0907akgn/f5c859ecbf466fcf3c3a11bc1127aa8e.css"
    headers = {
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36",
    }
    url = "https://bj.ganji.com/shuma/440_1/"

    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    soup = translateFont(soup, glyph_map_url)
    with open("ganji.html", "wb") as f:
        f.write(soup.encode())

运行上面的代码，你将得到一份完全被反爬的代码。后执行python download.py将下载网站中的字体文件。最后运行python ganji.py执行字体反爬处理，保存反爬处理后的网页文件。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python字体反爬实战案例分享 - Python技术站

Python字体反爬实战案例分享

标题

前言

步骤

示例

相关文章