Java实现文件检索系统的示例代码攻略

概述

本文将介绍如何使用Java实现一个文件检索系统的示例代码。该系统能够快速、效率地搜索指定文件目录中包含指定内容的文件，并将结果展示出来。

开发环境

JDK 1.8
Apache Maven 3.6.0
IntelliJ IDEA 2021.1

实现过程

引入依赖

使用Maven创建一个Java项目，并在pom文件中引入apache commons-io和org.apache.lucene依赖。

<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.6</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>8.7.0</version>
</dependency>

创建文件搜索类

创建一个Searcher类，编写search方法用于搜索指定文件目录中包含指定内容的文件。

public class Searcher {

    private final String indexDir;

    public Searcher(String indexDir) {
        this.indexDir = indexDir;
    }

    public List<String> search(String text) throws IOException, ParseException, InvalidTokenOffsetsException {

        Directory dir = FSDirectory.open(Paths.get(indexDir));
        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);

        Analyzer analyzer = new StandardAnalyzer();
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse(text);

        TopDocs results = searcher.search(query, 100);
        ScoreDoc[] hits = results.scoreDocs;

        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
        Fragmenter fragmenter = new SimpleFragmenter(200);
        highlighter.setTextFragmenter(fragmenter);

        List<String> list = new ArrayList<>();
        for (ScoreDoc hit : hits) {
            int id = hit.doc;
            Document doc = searcher.doc(id);
            String path = doc.get("path");
            String title = doc.get("title");
            String content = doc.get("content");
            TokenStream tokenStream = TokenSources.getAnyTokenStream(reader, hit.doc, "content", analyzer);
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, content, false, 3);
            StringBuilder sb = new StringBuilder();
            for (TextFragment textFragment : frag) {
                if ((textFragment != null) && (textFragment.getScore() > 0)) {
                    sb.append(textFragment.toString());
                }
            }
            String result = "<p><strong>" + title + "</strong><br/>" + sb.toString() + "<br/>" + "路径：" + path + "</p>";
            list.add(result);
        }
        reader.close();
        return list;
    }
}

该方法中使用org.apache.lucene库进行搜索操作。搜索步骤如下：
* 打开文件目录索引
* 创建IndexSearcher进行搜索
* 创建分析器和解析器
* 通过解析器构建查询语句
* 执行查询并遍历搜索结果（默认返回前100个结果）
* 使用Html Fragment对搜索结果进行高亮处理

创建一个Indexer类，编写indexDirectory方法用于建立文件目录索引。

public class Indexer {

    private final IndexWriter writer;

    public Indexer(String indexDirectoryPath) throws IOException {
        FSDirectory dir = FSDirectory.open(Paths.get(indexDirectoryPath));
        IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
        writer = new IndexWriter(dir, config);
    }

    public void indexDirectory(String dataDirPath) throws IOException {
        File[] files = new File(dataDirPath).listFiles();
        if (files != null) {
            for (File file : files) {
                indexFile(file);
            }
        }
        writer.close();
    }

    private void indexFile(File file) throws IOException {
        System.out.println("Indexing " + file.getCanonicalPath());
        Document document = new Document();
        Field pathField = new StringField("path", file.getCanonicalPath(), Field.Store.YES);
        document.add(pathField);
        Field titleField = new StringField("title", file.getName(), Field.Store.YES);
        document.add(titleField);
        Field contentField = new TextField("content", new String(Files.readAllBytes(file.toPath())), Field.Store.YES);
        document.add(contentField);
        writer.addDocument(document);
    }
}

该方法中使用org.apache.lucene库进行索引操作。索引过程如下：
* 打开文件目录索引
* 创建IndexWriter进行索引
* 遍历指定文件目录中的文件
* 为每个文件建立Documnet对象并添加Field对象
* 执行索引

创建启动类

创建一个Main类，编写main方法作为程序的入口。

public class Main {

    public static void main(String[] args) throws IOException, ParseException, InvalidTokenOffsetsException {
        String indexDir = "index";
        String dataDir = "data";

        Indexer indexer = new Indexer(indexDir);
        indexer.indexDirectory(dataDir);

        Searcher searcher = new Searcher(indexDir);
        List<String> list = searcher.search("lucene");

        System.out.println(list);
    }
}

其中，data目录为搜索的文件目录，我们可以将需要搜索的文件放到该目录下，index目录为文件目录的索引文件目录。

示例说明

以下是两个Searcher类的示例，用于搜索包含指定内容的文件。

示例1：搜索包含“hello”的文件

Searcher searcher = new Searcher(indexDir);
List<String> list = searcher.search("hello");
System.out.println(list);

执行结果：

[<p><strong>file1.txt</strong><br/>hello world<br/>路径：data/file1.txt</p>, <p><strong>file2.txt</strong><br/>this is a test<br/>路径：data/file2.txt</p>]

示例2：搜索包含“test”的文件

Searcher searcher = new Searcher(indexDir);
List<String> list = searcher.search("test");
System.out.println(list);

执行结果：

[<p><strong>file2.txt</strong><br/>this is a test<br/>路径：data/file2.txt</p>, <p><strong>file3.txt</strong><br/>test<br/>路径：data/file3.txt</p>]

结论

本文介绍了如何使用Java实现文件检索系统，并给出了示例代码，可用于快速搜索指定文件目录中包含指定内容的文件。使用Lucene库可快速完成索引和搜索的操作，提升搜索性能和效率。

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：Java实现文件检索系统的示例代码 - Python技术站

Java实现文件检索系统的示例代码