PDFbox提取文本内容解决思路-ASP.NET教程-爱易网页

PDFbox提取文本内容解决思路

日期：2014-05-17　浏览次数：20587 次

PDFbox提取文本内容
提取所有的文本内容可以得到，按页提取也可以，但是我现在是想提取书签之间的文本。
上官网上查看之后上面说是可以，但是具体到代码就不行了，有没有搞过这个的，最好有代码。

C# code



  private void Form1_Load(object sender, EventArgs e)
        {
            string path = Application.StartupPath + "\\大话设计模式.pdf";
            FileInfo file = new FileInfo(path);
            getStringFormPDF(file);
            //getBook(file);
        }


        public void getBook(FileInfo PDF)
        {
            Hashtable ht = new Hashtable();
            

            PDDocument doc = PDDocument.load(PDF.FullName);
            PDDocumentOutline root = doc.getDocumentCatalog().getDocumentOutline();
            PDOutlineItem item = root.getFirstChild();
            while (item != null)
            {
                MessageBox.Show("Item:" + item.getTitle());
                PDOutlineItem child = item.getFirstChild();
                while (child != null)
                {
                    MessageBox.Show("    Child:" + child.getTitle());
                    child = child.getNextSibling();
                }

                //ht.Add(item,1);
                item = item.getNextSibling();
            }

        }

        public string getStringFormPDF(FileInfo val_PDFInfo)
        {
     
            PDDocument doc = PDDocument.load(val_PDFInfo.FullName);
            PDDocumentOutline root = doc.getDocumentCatalog().getDocumentOutline();
            PDOutlineItem item = root.getFirstChild();
            PDFTextStripper pdfStripper = new PDFTextStripper();
            while (item != null)
            {
                //MessageBox.Show("Item:" + item.getTitle());

                if (item.getTitle().Trim()=="目录")
                {
                    item = pdfStripper.getStartBookmark();//开始书签
                }

//[color=#FF0000]问题在这，进入第一个“目录”后，到这里就报错了。[/color]

                if (item.getTitle().Trim() == "目录2")//[color=#FF0000]这里报错[/color]
                {
                    item = pdfStripper.getEndBookmark();//结束书签
                }

                //PDOutlineItem child = item.getFirstChild();
                //while (child != null)
                //{
                //    MessageBox.Show("    Child:" + child.getTitle());
                //    child = child.getNextSibling();
                //}

                item = item.getNextSibling();

             
            }
            string text = pdfStripper.getText(doc);


            return string.Empty; ;

        }

求大大帮忙把，我两个100分的帖子都是无满意的了。争取这个有解决办法

------解决方案--------------------
谢谢楼主
------解决方案--------------------
设置PDFTextStripper的
setStartBookmark
setEndBookmark
两个属性然后取值
------解决方案--------------------
不懂呢，
以前在 vb 下是用 DynaPDF 对pdf进行相关的操作，很方便的，
不过这个是德国佬的东西，要收费，

免责声明： 本文仅代表作者个人观点，与爱易网无关。其原创性以及文中陈述文字和内容未经本站证实，对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺，请读者仅作参考，并请自行核实相关内容。

PDFbox提取文本内容解决思路

相关资料更多>

推荐阅读更多>