Lucene 3.0.1 全文检目录擎的架构对文件，数据库建索引，及查询(高亮显示)-数据库教程-爱易网页

Lucene 3.0.1 全文检目录擎的架构对文件，数据库建索引，及查询(高亮显示)

日期：2014-05-16　浏览次数：20735 次

Lucene 3.0.1 全文检索引擎的架构对文件，数据库建索引，及查询(高亮显示)
lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。
查询关键词 “唐山” 之后效果图：

对文件创建索引及查询
创建索引 Lucene 3.0(第一步)

package com.gjw.lecence;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.util.Date;

import jxl.Sheet;
import jxl.Workbook;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.Version;
import org.textmining.text.extraction.WordExtractor;

/**
 * 创建索引 Lucene 3.0(第一步)
 * 
 * @author RenWeigang
 * 
 * @version 2010.12.13
 * 
 */
public class Indexer {
    
    //保存索引文件的地方
    private static String INDEX_DIR = "E:\\index";
    //将要搜索TXT文件的地方
    private static String DATA_DIR = "E:\\rr";
    
    public static void main(String[] args) throws Exception {
        long start = new Date().getTime();
        int numIndexed = index(new File(INDEX_DIR), new File(DATA_DIR));
        long end = new Date().getTime();
        System.out.println("Indexing " + numIndexed + " files took "
                + (end - start) + " milliseconds");
    }

    /**
     * 索引dataDir下文件，并储存在indexDir下，返回索引的文件数量
     * 
     * @param indexDir
     * @param dataDir
     * @return
     * @throws IOException
     * isDirectory() 判断
     */
    public static int index(File indexDir, File dataDir) throws IOException {
        if (!dataDir.exists() || !dataDir.isDirectory()) {
            throw new IOException(dataDir
                    + " does not exist or is not a directory");
        }
        /**
         * 创建IndexWriter对象,
         * 第一个参数是Directory,也可以为：Directory dir = new SimpleFSDirectory(new File(indexDir));
         * 第二个是分词器,
         * 这个IndexWriter是针对文件系统的
         * 第三个参数是指:   如果指定为true，表示重新创建索引库，如果已存在，就删除后再创建;
         *               指定为false，表示追加(默认值)
         *               如果不存在，就抛异常.
         * 第四表示表示分词的最大值，比如说new MaxFieldLength(2)，
         * 就表示两个字一分，一般用IndexWriter.MaxFieldLength.LIMITED
         *     
         */

        Directory dir = new SimpleFSDirectory(indexDir);

        IndexWriter writer = new IndexWriter(dir,
                new StandardAnalyzer(Version.LUCENE_30), true,
                IndexWriter.MaxFieldLength.LIMITED);
        indexDirectory(writer, dataDir);
        //查看IndexWriter里面有多少个索引 
        int numIndexed = writer.numDocs();
        writer.optimize();//优化
        writer.commit();//提交
        writer.close();//关闭 使其 不占用资源
        return numIndexed;
    }

    /**
     * 循环遍历dir下的所有文件并进行索引
     * 
     * @param writer
     * @param dir
     * @throws IOException
     */
    private static void indexDirectory(IndexWriter writer, File dir)
            throws IOException {

        File[] files = dir.listFiles();

        for (int i = 0; i < files.length; i++) {
            if (files[i].isDirectory()) {//如果path表示的是一个目录则返回true
                //递归
                indexDirectory(writer,files[i]);
            } else if(files[i].getName().endsWith(".txt")) {
                indexTxtFile(writer,files[i]);
            }else if(files[i].getName().endsWith(".doc")) {
                indexWordFile(writer,files[i]);
            }else if(files[i].getName().endsWith(".xls")) {
                indexExcelFile(writer,files[i]);
            }
        }
    }

    /**
     * 对excel2003文件进行索引
     * 读取word文件有两种方法，用jacob包，可以修改生成word文件内容。
     * 如果只读取word里的文本内容的话，可以用poi读取word文件，
     * 先到http://www.ibiblio.

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

Lucene 3.0.1 全文检目录擎的架构对文件，数据库建索引，及查询(高亮显示)

相关资料更多>

推荐阅读更多>

Lucene 3.0.1 全文检目录擎的架构 对文件，数据库建索引，及查询(高亮显示)

相关资料更多>

推荐阅读更多>

Lucene 3.0.1 全文检目录擎的架构对文件，数据库建索引，及查询(高亮显示)