您的位置：首页 > 其它

关于poi读取word docx格式文本和图片功能

2017-08-26 14:29 761 查看

最近接手了一个考试系统的项目，其中有个功能是要批量导入试题，客户提供的试题是docx文档格式的，并且问题和答案都有可能含有图片，所以总结了下使用POI读取试题信息的技术方案。关于POI需要的jar包自己可以网上搜索下载，各种包说明如下：

目前POI的最新发布版本是3.10_FINAL.该版本保护的jar包有：

Maven artifactId	Prerequisites	JAR
poi	commons-logging, commons-codec, log4j	poi-version-yyyymmdd.jar
poi-scratchpad	poi	poi-scratchpad-version-yyyymmdd.jar
poi-ooxml	poi, poi-ooxml-schemas	poi-ooxml-version-yyyymmdd.jar
poi-ooxml-schemas	xmlbeans	poi-ooxml-schemas-version-yyyymmdd.jar
poi-examples	poi, poi-scratchpad, poi-ooxml	poi-examples-version-yyyymmdd.jar
ooxml-schemas	xmlbeans	ooxml-schemas-1.1.jar

很多人都困惑POI那么多Jar到底应该导入哪一个。

实际上很多时候我们只利用POI来操作Excel。甚至只用xls这一种格式。

那么就没有必要全部都导入了。具体应该使用哪个JAR包请参考以下内容：

Component Map

The Apache POI distribution consists of support for many document file formats. This support is provided in several Jar files. Not all of the Jars are needed for every format. The following tables show the relationships between POI components, Maven repository
tags, and the project's Jar files.

Component	Application type	Maven artifactId	Notes
POIFS	OLE2 Filesystem	poi	Required to work with OLE2 / POIFS based files
HPSF	OLE2 Property Sets	poi
HSSF	Excel XLS	poi	For HSSF only, if common SS is needed see below
HSLF	PowerPoint PPT	poi-scratchpad
HWPF	Word DOC	poi-scratchpad
HDGF	Visio VSD	poi-scratchpad
HPBF	Publisher PUB	poi-scratchpad
HSMF	Outlook MSG	poi-scratchpad
OpenXML4J	OOXML	poi-ooxml plus one of poi-ooxml-schemas, ooxml-schemas	Only one schemas jar is needed, see below for differences
XSSF	Excel XLSX	poi-ooxml
XSLF	PowerPoint PPTX	poi-ooxml
XWPF	Word DOCX	poi-ooxml
Common SS	Excel XLS and XLSX	poi-ooxml	WorkbookFactory and friends all require poi-ooxml, not just core poi

废话不多说了，上代码了

代码如下：

public static void doxc() throws InvalidFormatException {
String importPath = "E://123.docx";
String absolutePath = "E://qwe//";
try {
FileInputStream inputStream = new FileInputStream(importPath);
XWPFDocument xDocument = new XWPFDocument(inputStream);
List<XWPFParagraph> paragraphs = xDocument.getParagraphs();
List<XWPFPictureData> pictures = xDocument.getAllPictures();
Map<String, String> map = new HashMap<String, String>();
for(XWPFPictureData picture : pictures){

String id = picture.getParent().getRelationId(picture);
File folder = new File(absolutePath);
if (!folder.exists()) {
folder.mkdirs();
}
String rawName = picture.getFileName();
String fileExt = rawName.substring(rawName.lastIndexOf("."));
String newName = System.currentTimeMillis() + UUID.randomUUID().toString() + fileExt;
File saveFile = new File(absolutePath + File.separator + newName);
@SuppressWarnings("resource")
FileOutputStream fos = new FileOutputStream(saveFile);
fos.write(picture.getData());
System.out.println(id);
System.out.println(saveFile.getAbsolutePath());
map.put(id, saveFile.getAbsolutePath());
}
String text = "";
for(XWPFParagraph paragraph : paragraphs){
//System.out.println(paragraph.getParagraphText());
List<XWPFRun> runs = paragraph.getRuns();
for(XWPFRun run : runs){
/*System.out.println(run.getCTR().xmlText());*/
if(run.getCTR().xmlText().indexOf("<w:drawing>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:embed");
int rIdEndIndex = runXmlText.indexOf("/>", rIdIndex);
String rIdText = runXmlText.substring(rIdIndex, rIdEndIndex);
System.out.println(rIdText.split("\"")[1].substring("rId".length()));
String id = rIdText.split("\"")[1];
System.out.println(map.get(id));
text = text +"<img src = '"+map.get(id)+"'/>";
}else{
text = text + run;
}
}
}
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

如果是图片pict格式的换成如下代码即可

if(run.getCTR().xmlText().indexOf("<w:pict>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:id");

这样就能完整的获取docx中的文本和图片（数据和位置）信息，然后存入到数据库中，图片都是通过路径进行关联的。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航