日期:2014-05-17  浏览次数:20686 次

lucene为pdf文件建立索引并搜索的问题
建立索引的程序如下
Java code
public class testIndexer {
    private IndexWriter writer;
    public testIndexer(String indexDir) throws CorruptIndexException, LockObtainFailedException, IOException
    {
        Directory dir = FSDirectory.open(new File(indexDir));
        writer = new IndexWriter(dir,new IKAnalyzer(),true,IndexWriter.MaxFieldLength.UNLIMITED);
    }
    public void close() throws CorruptIndexException, IOException
    {
        writer.close();
    }
    public void indexPDFile(String filename) throws Exception
    {
        File file = new File(filename);
        String content = PdfExtractor.getText(file);
        Document doc = new Document();
        doc.add(new Field("content",content,Field.Store.YES,Field.Index.ANALYZED));
        writer.addDocument(doc);
    }
    
    public static void main(String args[])
    {
        String path="k:/aaaa";
        String pdfile="k:/kaks.pdf";
        try {
            testIndexer indx = new testIndexer(path);
            indx.indexPDFile(pdfile); 
        }catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
    }

}

搜索的程序如下
Java code

public class testSearch {
    public static void main(String args[]) throws IOException, ParseException
    {
        String indexDir = "K:/aaaa";
        String q = "分子";
        search(indexDir,q);
    }
    public static void search(String indexDir,String q) throws IOException,ParseException
    {
        Directory dir = FSDirectory.open(new File(indexDir));
        IndexSearcher is = new IndexSearcher(dir,true);
        QueryParser parser = new QueryParser(Version.LUCENE_35,"content",new IKAnalyzer());
        Query query = parser.parse(q);
        
        TopDocs hits = is.search(query, 10);
        for(ScoreDoc scoreDoc:hits.scoreDocs)
        {
            Document doc = is.doc(scoreDoc.doc);
            System.out.println(doc.get("content"));
        }
        is.close();
    }

}


建立索引后aaaa文件夹下只有_0.fdt _0.fdx write.lock 三个文件

运行搜索的时候有如下报错
Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@K:\aaaa lockFactory=org.apache.lucene.store.NativeFSLockFactory@ec16a4: files: [write.lock, _0.fdt, _0.fdx]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:712)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:462)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:322)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:110)
at com.index_search.testSearch.search(testSearch.java:24)
at com.index_search.testSearch.main(testSearch.java:19)
想请教一下如何为pdf建立索引,并搜索出其中的关键字呢?

------解决方案--------------------
可以用tika解析
------解决方案--------------------
得用tika解析器提取文本
------解决方案--------------------
你应该是需要做一个关键词和文本地址的关联表,用关键词来做索引,如果你直接做的话,肯定是需要把pdf的内容读出来了。但是如果pdf中都是图片呢?你就没法做了。所以一般来说,建索引都是做专门的映射表来实现的,不会直接的去读文件。
------解决方案--------------------