您的位置:首页 > 其它

关于poi读取word docx格式文本和图片功能

2017-08-26 14:29 761 查看
最近接手了一个考试系统的项目,其中有个功能是要批量导入试题,客户提供的试题是docx文档格式的,并且问题和答案都有可能含有图片,所以总结了下使用POI读取试题信息的技术方案。关于POI需要的jar包自己可以网上搜索下载,各种包说明如下:

目前POI的最新发布版本是3.10_FINAL.该版本保护的jar包有:

Maven artifactIdPrerequisitesJAR
poicommons-logging, commons-codec, log4jpoi-version-yyyymmdd.jar
poi-scratchpadpoipoi-scratchpad-version-yyyymmdd.jar
poi-ooxmlpoi, poi-ooxml-schemaspoi-ooxml-version-yyyymmdd.jar
poi-ooxml-schemasxmlbeanspoi-ooxml-schemas-version-yyyymmdd.jar
poi-examplespoi, poi-scratchpad, poi-ooxmlpoi-examples-version-yyyymmdd.jar
ooxml-schemasxmlbeansooxml-schemas-1.1.jar
很多人都困惑POI那么多Jar到底应该导入哪一个。

实际上很多时候我们只利用POI来操作Excel。甚至只用xls这一种格式。

那么就没有必要全部都导入了。具体应该使用哪个JAR包请参考以下内容:


Component Map

The Apache POI distribution consists of support for many document file formats. This support is provided in several Jar files. Not all of the Jars are needed for every format. The following tables show the relationships between POI components, Maven repository
tags, and the project's Jar files.
ComponentApplication typeMaven artifactIdNotes
POIFSOLE2 FilesystempoiRequired to work with OLE2 / POIFS based files
HPSFOLE2 Property Setspoi 
HSSFExcel XLSpoiFor HSSF only, if common SS is needed see below
HSLFPowerPoint PPTpoi-scratchpad 
HWPFWord DOCpoi-scratchpad 
HDGFVisio VSDpoi-scratchpad 
HPBFPublisher PUBpoi-scratchpad 
HSMFOutlook MSGpoi-scratchpad 
OpenXML4JOOXMLpoi-ooxml plus one of

poi-ooxml-schemas, ooxml-schemas
Only one schemas jar is needed, see below for differences
XSSFExcel XLSXpoi-ooxml 
XSLFPowerPoint PPTXpoi-ooxml 
XWPFWord DOCXpoi-ooxml 
Common SSExcel XLS and XLSXpoi-ooxmlWorkbookFactory and friends all require poi-ooxml, not just core poi
废话不多说了,上代码了

代码如下:

public static void doxc() throws InvalidFormatException {
String importPath = "E://123.docx";
String absolutePath = "E://qwe//";
try {
FileInputStream inputStream = new FileInputStream(importPath);
XWPFDocument xDocument = new XWPFDocument(inputStream);
List<XWPFParagraph> paragraphs = xDocument.getParagraphs();
List<XWPFPictureData> pictures = xDocument.getAllPictures();
Map<String, String> map = new HashMap<String, String>();
for(XWPFPictureData picture : pictures){

String id = picture.getParent().getRelationId(picture);
File folder = new File(absolutePath);
if (!folder.exists()) {
folder.mkdirs();
}
String rawName = picture.getFileName();
String fileExt = rawName.substring(rawName.lastIndexOf("."));
String newName = System.currentTimeMillis() + UUID.randomUUID().toString() + fileExt;
File saveFile = new File(absolutePath + File.separator + newName);
@SuppressWarnings("resource")
FileOutputStream fos = new FileOutputStream(saveFile);
fos.write(picture.getData());
System.out.println(id);
System.out.println(saveFile.getAbsolutePath());
map.put(id, saveFile.getAbsolutePath());
}
String text = "";
for(XWPFParagraph paragraph : paragraphs){
//System.out.println(paragraph.getParagraphText());
List<XWPFRun> runs = paragraph.getRuns();
for(XWPFRun run : runs){
/*System.out.println(run.getCTR().xmlText());*/
if(run.getCTR().xmlText().indexOf("<w:drawing>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:embed");
int rIdEndIndex = runXmlText.indexOf("/>", rIdIndex);
String rIdText = runXmlText.substring(rIdIndex, rIdEndIndex);
System.out.println(rIdText.split("\"")[1].substring("rId".length()));
String id = rIdText.split("\"")[1];
System.out.println(map.get(id));
text = text +"<img src = '"+map.get(id)+"'/>";
}else{
text = text + run;
}
}
}
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

如果是图片pict格式的换成如下代码即可
if(run.getCTR().xmlText().indexOf("<w:pict>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:id");

这样就能完整的获取docx中的文本和图片(数据和位置)信息,然后存入到数据库中,图片都是通过路径进行关联的。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: