您的位置:首页 > 编程语言 > C#

Converting PDF to Text in C#(转换PDF为Text)

2012-06-26 13:43 591 查看
原文地址链接:

http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

Warning

警告

June 20, 2012: This may not be the best way to parse PDF files (at least not the most efficient one). PDFBox is a great Java library but the IKVM.NET bridge makes it a little slow.

这可能并不是转换PDF文件最好的方法(至少不是最高效的方法)。

PDFBox是一个很厉害的Java库,IKVM.NET桥使得它的效率有点减缓。

PS:PDFBox是Java中实现PDF到Text转换的库,IKVM.NET将该库进行封装提供C#接口,因此转换速度受到影响。

However, since a lot of people are still coming here for a PDF parsing solution (and it's been almost 7 years since this article was originally published), I have updated the article and the Visual Studio project so it works with the latest PDFBox version.
It's also possible to download the
project with all dependencies - something many people were struggling with.

然而,由于仍然有很多人来这里,为了得到一个PDF解析的解决办法(从这篇文章最初被发布到现在已经将近7年了),我更新了这篇文章,并升级了Visual Studio工程,使得它能够使用最新版的PDFBox。

How to parse PDF files

如何转换PDF文件

While extending the
indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.

Using Adobe PDF IFilter

Using
Adobe PDF IFilter requires:

使用Adobe PDF IFilter需要:

Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome)and

使用不可靠的COM交互。

A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.

在目标系统上单独地安装Adobe IFilter。如果你需要发布你的解决方案到别人的机器上,这将是件痛苦的事情。

Using iTextSharp

iTextSharp is a .NET port ofiText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs
but there are some classes that allow you to read PDF - especiallyPdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF
is not a simple format, thePDF Reference is 7 MB - compressed - PDF file). I was able to get toPdfArray,

PdfBoolean,
PdfDictionary and other objects but after some hours of trying to resolve
PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (seeLucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using
IKVM.NET (just
download the PDFBox package).

Using PDFBox in .NET requires adding references to:

IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.7.0.dll

and copying the following files the bin directory:

commons-logging.dll
fontbox-1.7.0.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll

Using the PDFBox to parse PDFs is fairly easy:

private static string parseUsingPDFBox(string filename)
{
PDDocument doc = PDDocument.load(filename);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}


The size of the required assemblies adds up to almost 18 MB:

IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.7.0.dll (4 MB)

commons-logging.dll (82 kB)
fontbox-1.7.0.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)

The speed is not so bad: Parsing the
U.S. Copyright Act PDF (1.4 MB) took about 13 seconds.

Related information

See this article (with future updates) at
SquarePDF.NET.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: