python pytesseract库的实例用法

Python pytesseract库是一个OCR（Optical Character Recognition，光学字符识别）库，可以将图像中的文字转换为可编辑的格式。下面详细讲解如何使用pytesseract库。

安装pytesseract库

在命令行中输入以下命令，安装pytesseract库：

pip install pytesseract

安装tesseract-ocr引擎。对于Windows用户，需前往tesseract-ocr官网下载安装；对于Linux用户，可以在终端执行以下命令安装：

Ubuntu/Debian用户：

sudo apt-get install tesseract-ocr

CentOS/RHEL用户：

yum install tesseract-ocr

基本用法

以下是使用pytesseract库进行基本OCR的示例代码：

import pytesseract
from PIL import Image

# 打开要识别的图像
image = Image.open('example.png')

# 调用pytesseract库进行OCR识别
text = pytesseract.image_to_string(image, lang='chi_sim')

print(text)

其中，Image.open()函数打开要识别的图像，pytesseract.image_to_string()函数将图像中的文字转换为字符串，lang参数为 OCR 识别的语言，默认为英语，如果要识别中文，需指定为'chi_sim'。

高级用法

1. 识别不同语言的文本

import pytesseract
from PIL import Image

# 打开要识别的图像
image = Image.open('example.png')

# 调用pytesseract库进行OCR识别
text_eng = pytesseract.image_to_string(image, lang='eng') # 识别英文
text_ch = pytesseract.image_to_string(image, lang='chi_sim') # 识别中文

print(text_eng)
print(text_ch)

2. 识别图像中的数字

import pytesseract
from PIL import Image
import re

# 打开要识别的图像
image = Image.open('example.png')

# 调用pytesseract库进行OCR识别
text = pytesseract.image_to_string(image, config="--psm 6 outputbase digits")

# 使用正则表达式匹配识别到的数字
nums = re.findall(r'\d+', text)

print(nums)

在上述代码中，config参数设置为“--psm 6 outputbase digits”代表只输出数字，通过正则表达式匹配识别到的数字。

以上是pytesseract库的使用攻略，其中带有识别不同语言的文本和图像中的数字两个示例说明。通过pytesseract库的底层支持加上友好的Python包装，实现 OCR 的识别变得轻而易举，其翻译效果令人惊喜，使用它可以为很多需求提供帮助，比如识别验证码、批量识别脱敏PDF页面信息等。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：python pytesseract库的实例用法 - Python技术站