Converting PDF to Text in C#(转换PDF为Text)
2012-06-26 13:43
591 查看
原文地址链接:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
June 20, 2012: This may not be the best way to parse PDF files (at least not the most efficient one). PDFBox is a great Java library but the IKVM.NET bridge makes it a little slow.
这可能并不是转换PDF文件最好的方法(至少不是最高效的方法)。
PDFBox是一个很厉害的Java库,IKVM.NET桥使得它的效率有点减缓。
PS:PDFBox是Java中实现PDF到Text转换的库,IKVM.NET将该库进行封装提供C#接口,因此转换速度受到影响。
However, since a lot of people are still coming here for a PDF parsing solution (and it's been almost 7 years since this article was originally published), I have updated the article and the Visual Studio project so it works with the latest PDFBox version.
It's also possible to download the
project with all dependencies - something many people were struggling with.
然而,由于仍然有很多人来这里,为了得到一个PDF解析的解决办法(从这篇文章最初被发布到现在已经将近7年了),我更新了这篇文章,并升级了Visual Studio工程,使得它能够使用最新版的PDFBox。
While extending the
indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.
Adobe PDF IFilter requires:
使用Adobe PDF IFilter需要:
Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome)and
使用不可靠的COM交互。
A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
在目标系统上单独地安装Adobe IFilter。如果你需要发布你的解决方案到别人的机器上,这将是件痛苦的事情。
but there are some classes that allow you to read PDF - especiallyPdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF
is not a simple format, thePDF Reference is 7 MB - compressed - PDF file). I was able to get toPdfArray,
PdfBoolean,
PdfDictionary and other objects but after some hours of trying to resolve
PdfIndirectReference I gave up and threw away the iTextSharp based parser.
Fortunately, there is a .NET version of PDFBox that is created using
IKVM.NET (just
download the PDFBox package).
Using PDFBox in .NET requires adding references to:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.7.0.dll
and copying the following files the bin directory:
commons-logging.dll
fontbox-1.7.0.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
Using the PDFBox to parse PDFs is fairly easy:
The size of the required assemblies adds up to almost 18 MB:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.7.0.dll (4 MB)
commons-logging.dll (82 kB)
fontbox-1.7.0.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
The speed is not so bad: Parsing the
U.S. Copyright Act PDF (1.4 MB) took about 13 seconds.
SquarePDF.NET.
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
Warning
警告June 20, 2012: This may not be the best way to parse PDF files (at least not the most efficient one). PDFBox is a great Java library but the IKVM.NET bridge makes it a little slow.
这可能并不是转换PDF文件最好的方法(至少不是最高效的方法)。
PDFBox是一个很厉害的Java库,IKVM.NET桥使得它的效率有点减缓。
PS:PDFBox是Java中实现PDF到Text转换的库,IKVM.NET将该库进行封装提供C#接口,因此转换速度受到影响。
However, since a lot of people are still coming here for a PDF parsing solution (and it's been almost 7 years since this article was originally published), I have updated the article and the Visual Studio project so it works with the latest PDFBox version.
It's also possible to download the
project with all dependencies - something many people were struggling with.
然而,由于仍然有很多人来这里,为了得到一个PDF解析的解决办法(从这篇文章最初被发布到现在已经将近7年了),我更新了这篇文章,并升级了Visual Studio工程,使得它能够使用最新版的PDFBox。
How to parse PDF files
如何转换PDF文件While extending the
indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.
Using Adobe PDF IFilter
UsingAdobe PDF IFilter requires:
使用Adobe PDF IFilter需要:
Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome)and
使用不可靠的COM交互。
A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
在目标系统上单独地安装Adobe IFilter。如果你需要发布你的解决方案到别人的机器上,这将是件痛苦的事情。
Using iTextSharp
iTextSharp is a .NET port ofiText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFsbut there are some classes that allow you to read PDF - especiallyPdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF
is not a simple format, thePDF Reference is 7 MB - compressed - PDF file). I was able to get toPdfArray,
PdfBoolean,
PdfDictionary and other objects but after some hours of trying to resolve
PdfIndirectReference I gave up and threw away the iTextSharp based parser.
Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (seeLucenePDFDocument).Fortunately, there is a .NET version of PDFBox that is created using
IKVM.NET (just
download the PDFBox package).
Using PDFBox in .NET requires adding references to:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.7.0.dll
and copying the following files the bin directory:
commons-logging.dll
fontbox-1.7.0.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
Using the PDFBox to parse PDFs is fairly easy:
private static string parseUsingPDFBox(string filename) { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }
The size of the required assemblies adds up to almost 18 MB:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.7.0.dll (4 MB)
commons-logging.dll (82 kB)
fontbox-1.7.0.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
The speed is not so bad: Parsing the
U.S. Copyright Act PDF (1.4 MB) took about 13 seconds.
Related information
See this article (with future updates) atSquarePDF.NET.
相关文章推荐
- Converting PDF to Text in C#
- Saving Workbooks to PDF and XPS Formats in Excel 2007 (C#.net word excel powerpoint (ppt) 转换成 pdf 文件)
- PDFToText with ITextSharp--Extract text from PDF in C# (100% .NET)(推荐)
- HTML to Image in C#(C#实现Html转换为Image,即网页截图)
- C#,VB.NET如何将Word转换为PDF和Text
- [Introduction to programming in Java 笔记] 1.3.7 Converting to binary 十进制到二进制的转换
- C# 使用 wkhtmltopdf 把HTML文本或文件转换为PDF
- How to Insert OLE Object (Adobe Acrobat Document) in Word with C#(如何使用C#在Word中插入OLE对象-PDF文件)
- <转>Extract Text from PDF in C# (100% .NET)
- 使用MSOffice .NET API 将文档(Word \ Excel \ PowerPoint \ Visio \ text \ XML \ RTF \ CSV等)转换为PDF
- 【原创】C#两种任意类型转换支持(示例:String convert to Type)
- C#中利用LINQ to XML与反射把任意类型的泛型集合转换成XML格式字符串的方法
- C# 将PDF文件转换为word格式
- An Introduction to Reflection in C#
- C# 枚举类型转换字符串 Enum to string,枚举转换为下拉菜单 Enum DropDownList
- C# 枚举类型转换字符串 Enum to string,枚举转换为下拉菜单 Enum DropDownList
- C# 生转换网页为pdf
- Read a Text File with VBA in Excel, and Write the Text to a Spreadsheet
- Xpdf使用说明之pdftotext