How to parse PDF files(转载)
2010-04-16 22:50
405 查看
Download source files - 3.77 Kb
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.
Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
Read more about using IFilter in Microsoft Office Documents Parsing.
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).
Using PDFBox in .NET requires adding references to:
PDFBox-0.7.2.dll
IKVM.GNU.Classpath
and copying IKVM.Runtime.dll to the bin directory.
Using the PDFBox to parse PDFs is fairly easy:
![](http://www.codeproject.com/images/minus.gif)
Collapse
![](http://www.codeproject.com/images/copy_16.png)
Copy Code
The size of the required assemblies adds up to almost 16 MB:
IKVM.GNU.Classpath.dll (7 MB)
IKVM.Runtime.dll (360 kB)
PDFBox-0.7.2.dll (8 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7 seconds.
A list of licenses authors might use can be found here
转自:http://www.codeproject.com/KB/string/pdf2text.aspx
How to parse PDF files
While extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.
Using Adobe PDF IFilter
Using Adobe PDF IFilter requires:Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
Read more about using IFilter in Microsoft Office Documents Parsing.
Using iTextSharp
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).
Using PDFBox in .NET requires adding references to:
PDFBox-0.7.2.dll
IKVM.GNU.Classpath
and copying IKVM.Runtime.dll to the bin directory.
Using the PDFBox to parse PDFs is fairly easy:
![](http://www.codeproject.com/images/minus.gif)
Collapse
![](http://www.codeproject.com/images/copy_16.png)
Copy Code
private static string parseUsingPDFBox(string filename) { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }
The size of the required assemblies adds up to almost 16 MB:
IKVM.GNU.Classpath.dll (7 MB)
IKVM.Runtime.dll (360 kB)
PDFBox-0.7.2.dll (8 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7 seconds.
Related information
See this article (with future updates) on DotLucene: PDF Documents Parsing.License
This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.A list of licenses authors might use can be found here
转自:http://www.codeproject.com/KB/string/pdf2text.aspx
相关文章推荐
- How to convert web pages and word doc to PDF files?
- Howto Convert PDF files to HTML files
- How to recognize table from PDF files by PageObject Rectangles?
- How to open PDF files in sharepoint
- Git how to restore deleted files
- How to generate Hibernate mapping files & annotation with Hibernate Tools
- how to use automake to build files
- How to install Linux/UNIX *.tar.gz tarball files
- How to embed a True Type font(转载)
- Spring MVC – How to include JS or CSS files in a JSP page
- [转载]How to hide the OK button in the dialog in .Net Compact Framework application?
- how to disable pyc files generated by python
- How to use Multi-touch in Android 2 --转载
- [转载]How to start a startup[2/2]
- How to parse XML file using CParser class
- [转载] How to Install OpenJDK 8 in Ubuntu 14.04 & 12.04 LTS
- How to index email and attachments in nsf files?
- How to share files between Mac and Windows
- how to merge pdf file [iTextSharp]
- How to disable WIFI Auto-connect in Android ( 附带 转载)