Python轻量级搜索工具Whoosh的使用教程

什么是Whoosh?

Whoosh是一个基于Python语言开发的轻量级全文搜索引擎库。它提供了一个简单易用的API，使得在Python应用中集成全文搜索变得非常容易。Whoosh可以处理几乎任何类型的文本数据，包括HTML、XML、PDF等格式的文档。

安装Whoosh

在Python中使用pip命令安装Whoosh:

pip install whoosh

创建索引

在使用Whoosh进行搜索之前，首先需要创建索引。索引是一个包含搜索文档的数据结构，它用于加速搜索和排序操作。

以下是创建索引的基本步骤:

1.定义schema

schema用于定义索引包含文档的基本信息，包括文档的字段、字段类型等。定义完成后，可以使用schema创建索引。

from whoosh.fields import Schema, ID, TEXT

# 定义schema
schema = Schema(id=ID(stored=True),
                title=TEXT(stored=True),
                content=TEXT(stored=True))

2.创建索引

使用Schema来创建一个新索引。如果索引目录已经存在，则会打开现有的索引。索引目录是索引存放的文件夹路径。

from whoosh.index import create_in

# 创建一个新的索引
index_dir = 'indexdir'
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

ix = create_in(index_dir, schema)

3.写入文档

创建完成索引后，可以向其中写入文档。文档是一个字典对象，包含了schema定义的字段。

from whoosh.index import open_dir
from whoosh import index

index_dir = 'indexdir'

# 打开索引
ix = open_dir(index_dir)

# 获取writer
writer = ix.writer()

# 写入文档
doc = {'id': u'001', 'title': u'Python搜索引擎', 'content': u'基于Python语言开发的全文搜索引擎库'}
writer.add_document(**doc)

# 提交文档
writer.commit()

进行搜索

完成创建索引操作后，就可以使用Whoosh进行搜索了。以下是搜索的基本步骤:

1.创建查询parser

parser用于将输入的查询语句转换为query对象，query对象可以用于在索引中搜索。

from whoosh.qparser import QueryParser

# 创建查询parser
qp = QueryParser("content", schema=schema)

2.解析查询语句

使用parser将输入的搜索语句转换为query对象。

from whoosh.query import *

# 解析查询语句
q = qp.parse(u"全文搜索引擎库")

3.执行查询

使用解析后的query对象来执行查询，即在索引中搜索与query对象匹配的文档。

from whoosh.searching import Searcher

# 打开索引
ix = open_dir(index_dir)

# 获取Searcher
searcher = ix.searcher()

# 执行查询
results = searcher.search(q)

# 显示查询结果
for hit in results:
    print(hit['title'], hit['content'])

示例

下面是两个使用Whoosh进行搜索的示例。

示例1

在以下的例子中，我们将编写一个简单的Python脚本，来搜索我们的文档。我们对schema和文档内容进行修改，包括增加了一个类型为DATETIME的字段，用于存储文档的日期。

from whoosh.fields import Schema, ID, TEXT, DATETIME
from whoosh.index import create_in
from datetime import datetime

# 定义schema
schema = Schema(id=ID(stored=True),
                title=TEXT(stored=True),
                content=TEXT(stored=True),
                date=DATETIME(stored=True))

# 创建索引
index_dir = 'indexdir'
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

ix = create_in(index_dir, schema)

# 写入文档
with ix.writer() as writer:
    writer.add_document(id=u"001", 
                        title=u"Python搜索引擎", 
                        content=u"基于Python语言开发的全文搜索引擎库",
                        date=datetime(2022, 3, 1))

# 创建查询parser
qp = QueryParser("content", schema=schema)

# 解析查询语句
q = qp.parse(u"全文搜索引擎库")

# 执行查询
with ix.searcher() as searcher:
    results = searcher.search(q)

    # 显示查询结果
    for hit in results:
        print("%s, %s, %s" % (hit['title'], hit['content'], hit['date'].strftime("%Y-%m-%d")))

结果将输出以下内容:

Python搜索引擎, 基于Python语言开发的全文搜索引擎库, 2022-03-01

示例2

在以下的例子中，我们将使用Whoosh搜索本地磁盘上的PDF文件。我们将schema中增加一个属性来存储文件路径，并将PDF文件的信息写入索引。

from whoosh.fields import Schema, ID, TEXT, STORED
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh.filedb.filestore import FileStorage
from whoosh import index

# 定义schema
schema = Schema(id=ID(stored=True),
                title=TEXT(stored=True),
                path=STORED,
                content=TEXT(stored=True))

# 创建索引
index_dir = 'indexdir'
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

ix = create_in(index_dir, schema)

def index_pdf_files(pdf_dir):
    # 获取writer
    writer = ix.writer()

    # 遍历所有PDF文件
    for filename in os.listdir(pdf_dir):
        if not filename.endswith(".pdf"):
            continue

        filepath = os.path.join(pdf_dir, filename)

        # 将PDF转换成文本
        text = extract_text_from_pdf(filepath)

        # 写入索引
        writer.add_document(id=u"{}".format(uuid.uuid1().hex), 
                            title=u"{}".format(filename), 
                            path=u"{}".format(filepath),
                            content=u"{}".format(text))

    # 提交文档
    writer.commit()

def search_index(query_str):
    # 创建查询parser
    qp = QueryParser("content", schema=schema)

    # 解析查询语句
    q = qp.parse(query_str)

    # 执行查询
    with ix.searcher() as searcher:
        results = searcher.search(q)

        # 显示查询结果
        for hit in results:
            print(hit['title'], hit['path'])


pdf_dir = "pdfdir"
index_pdf_files(pdf_dir)
search_index("search_text")

上述代码会搜索pdfdir文件夹下的PDF文件，搜索关键词为"search_text"。如果文档内容中包含了"search_text"，则输出相关文件的名称和路径。

总结

本文简要介绍了Python搜索工具Whoosh的使用教程。通过本文的介绍，您可以了解到Whoosh的基本概念、安装、创建索引、搜索等基本操作。同时，还提供了两个简单的示例，帮助您更好地了解Whoosh的实际使用。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Python轻量级搜索工具Whoosh的使用教程 - Python技术站

Python轻量级搜索工具Whoosh的使用教程

Python轻量级搜索工具Whoosh的使用教程

什么是Whoosh?

安装Whoosh

创建索引

1.定义schema

2.创建索引

3.写入文档

进行搜索

1.创建查询parser

2.解析查询语句

3.执行查询

示例

示例1

示例2

总结

相关文章