关于poi读取word docx格式文本和图片功能
2017-08-26 14:29
761 查看
最近接手了一个考试系统的项目,其中有个功能是要批量导入试题,客户提供的试题是docx文档格式的,并且问题和答案都有可能含有图片,所以总结了下使用POI读取试题信息的技术方案。关于POI需要的jar包自己可以网上搜索下载,各种包说明如下:
目前POI的最新发布版本是3.10_FINAL.该版本保护的jar包有:
很多人都困惑POI那么多Jar到底应该导入哪一个。
实际上很多时候我们只利用POI来操作Excel。甚至只用xls这一种格式。
那么就没有必要全部都导入了。具体应该使用哪个JAR包请参考以下内容:
The Apache POI distribution consists of support for many document file formats. This support is provided in several Jar files. Not all of the Jars are needed for every format. The following tables show the relationships between POI components, Maven repository
tags, and the project's Jar files.
废话不多说了,上代码了
代码如下:
public static void doxc() throws InvalidFormatException {
String importPath = "E://123.docx";
String absolutePath = "E://qwe//";
try {
FileInputStream inputStream = new FileInputStream(importPath);
XWPFDocument xDocument = new XWPFDocument(inputStream);
List<XWPFParagraph> paragraphs = xDocument.getParagraphs();
List<XWPFPictureData> pictures = xDocument.getAllPictures();
Map<String, String> map = new HashMap<String, String>();
for(XWPFPictureData picture : pictures){
String id = picture.getParent().getRelationId(picture);
File folder = new File(absolutePath);
if (!folder.exists()) {
folder.mkdirs();
}
String rawName = picture.getFileName();
String fileExt = rawName.substring(rawName.lastIndexOf("."));
String newName = System.currentTimeMillis() + UUID.randomUUID().toString() + fileExt;
File saveFile = new File(absolutePath + File.separator + newName);
@SuppressWarnings("resource")
FileOutputStream fos = new FileOutputStream(saveFile);
fos.write(picture.getData());
System.out.println(id);
System.out.println(saveFile.getAbsolutePath());
map.put(id, saveFile.getAbsolutePath());
}
String text = "";
for(XWPFParagraph paragraph : paragraphs){
//System.out.println(paragraph.getParagraphText());
List<XWPFRun> runs = paragraph.getRuns();
for(XWPFRun run : runs){
/*System.out.println(run.getCTR().xmlText());*/
if(run.getCTR().xmlText().indexOf("<w:drawing>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:embed");
int rIdEndIndex = runXmlText.indexOf("/>", rIdIndex);
String rIdText = runXmlText.substring(rIdIndex, rIdEndIndex);
System.out.println(rIdText.split("\"")[1].substring("rId".length()));
String id = rIdText.split("\"")[1];
System.out.println(map.get(id));
text = text +"<img src = '"+map.get(id)+"'/>";
}else{
text = text + run;
}
}
}
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
如果是图片pict格式的换成如下代码即可
这样就能完整的获取docx中的文本和图片(数据和位置)信息,然后存入到数据库中,图片都是通过路径进行关联的。
目前POI的最新发布版本是3.10_FINAL.该版本保护的jar包有:
Maven artifactId | Prerequisites | JAR |
---|---|---|
poi | commons-logging, commons-codec, log4j | poi-version-yyyymmdd.jar |
poi-scratchpad | poi | poi-scratchpad-version-yyyymmdd.jar |
poi-ooxml | poi, poi-ooxml-schemas | poi-ooxml-version-yyyymmdd.jar |
poi-ooxml-schemas | xmlbeans | poi-ooxml-schemas-version-yyyymmdd.jar |
poi-examples | poi, poi-scratchpad, poi-ooxml | poi-examples-version-yyyymmdd.jar |
ooxml-schemas | xmlbeans | ooxml-schemas-1.1.jar |
实际上很多时候我们只利用POI来操作Excel。甚至只用xls这一种格式。
那么就没有必要全部都导入了。具体应该使用哪个JAR包请参考以下内容:
Component Map
The Apache POI distribution consists of support for many document file formats. This support is provided in several Jar files. Not all of the Jars are needed for every format. The following tables show the relationships between POI components, Maven repositorytags, and the project's Jar files.
Component | Application type | Maven artifactId | Notes |
---|---|---|---|
POIFS | OLE2 Filesystem | poi | Required to work with OLE2 / POIFS based files |
HPSF | OLE2 Property Sets | poi | |
HSSF | Excel XLS | poi | For HSSF only, if common SS is needed see below |
HSLF | PowerPoint PPT | poi-scratchpad | |
HWPF | Word DOC | poi-scratchpad | |
HDGF | Visio VSD | poi-scratchpad | |
HPBF | Publisher PUB | poi-scratchpad | |
HSMF | Outlook MSG | poi-scratchpad | |
OpenXML4J | OOXML | poi-ooxml plus one of poi-ooxml-schemas, ooxml-schemas | Only one schemas jar is needed, see below for differences |
XSSF | Excel XLSX | poi-ooxml | |
XSLF | PowerPoint PPTX | poi-ooxml | |
XWPF | Word DOCX | poi-ooxml | |
Common SS | Excel XLS and XLSX | poi-ooxml | WorkbookFactory and friends all require poi-ooxml, not just core poi |
代码如下:
public static void doxc() throws InvalidFormatException {
String importPath = "E://123.docx";
String absolutePath = "E://qwe//";
try {
FileInputStream inputStream = new FileInputStream(importPath);
XWPFDocument xDocument = new XWPFDocument(inputStream);
List<XWPFParagraph> paragraphs = xDocument.getParagraphs();
List<XWPFPictureData> pictures = xDocument.getAllPictures();
Map<String, String> map = new HashMap<String, String>();
for(XWPFPictureData picture : pictures){
String id = picture.getParent().getRelationId(picture);
File folder = new File(absolutePath);
if (!folder.exists()) {
folder.mkdirs();
}
String rawName = picture.getFileName();
String fileExt = rawName.substring(rawName.lastIndexOf("."));
String newName = System.currentTimeMillis() + UUID.randomUUID().toString() + fileExt;
File saveFile = new File(absolutePath + File.separator + newName);
@SuppressWarnings("resource")
FileOutputStream fos = new FileOutputStream(saveFile);
fos.write(picture.getData());
System.out.println(id);
System.out.println(saveFile.getAbsolutePath());
map.put(id, saveFile.getAbsolutePath());
}
String text = "";
for(XWPFParagraph paragraph : paragraphs){
//System.out.println(paragraph.getParagraphText());
List<XWPFRun> runs = paragraph.getRuns();
for(XWPFRun run : runs){
/*System.out.println(run.getCTR().xmlText());*/
if(run.getCTR().xmlText().indexOf("<w:drawing>")!=-1){
String runXmlText = run.getCTR().xmlText();
int rIdIndex = runXmlText.indexOf("r:embed");
int rIdEndIndex = runXmlText.indexOf("/>", rIdIndex);
String rIdText = runXmlText.substring(rIdIndex, rIdEndIndex);
System.out.println(rIdText.split("\"")[1].substring("rId".length()));
String id = rIdText.split("\"")[1];
System.out.println(map.get(id));
text = text +"<img src = '"+map.get(id)+"'/>";
}else{
text = text + run;
}
}
}
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
如果是图片pict格式的换成如下代码即可
if(run.getCTR().xmlText().indexOf("<w:pict>")!=-1){ String runXmlText = run.getCTR().xmlText(); int rIdIndex = runXmlText.indexOf("r:id");
这样就能完整的获取docx中的文本和图片(数据和位置)信息,然后存入到数据库中,图片都是通过路径进行关联的。
相关文章推荐
- Java:使用POI实现word的docx文件的模板功能
- 关于poi读取word文档修改后输出乱码问题 poi word 乱码
- poi读取docx中的文字和图片(自己应用)
- java poi组件 读取word文档 替换文档 内容 图片
- java/poi读取word,并替换word中的文本内容,向word中插入图片的操作
- poi在linux下读取带图片的word文档报错
- java/poi读取word,并替换word中的文本内容,向word中插入图片的操作
- poi修改word文档doc/docx不支持图片
- Java:封装POI实现word的docx文件的简单模板功能
- java-poi3.17读取word文本及图片
- 关于POI读取EXCEL2003----2007问题
- 关于OpenCV读取图片的注意事项
- DocX组件读取与写入Word
- java poi读取word、excel文档
- 使用POI读写word docx文件
- 【Java工具类】 POI操作word文档模版可修改文字图片
- 关于MSChart的导出图片功能
- java 实现poi方式读取word文件内容
- 使用POI读取word文档
- 通过POI读取word文件