lucene入门-解析word文档
2009-12-23 18:39
351 查看
下载:
http://mirrors.ibiblio.org/pub/mirrors/maven2/org/textmining/tm-extractors/0.4/
java代码如下:
package extract;
import java.io.*;
import org.textmining.text.extraction.WordExtractor;
public class ExtractorWord {
/**
* @param args
*/
public static String getText(String file){
String s="";
String wordfile=file;
WordExtractor extractor=null;
try {
FileInputStream in=new FileInputStream(new File(wordfile));
extractor=new WordExtractor();
s=extractor.extractText(in);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
}
public static void toTextFile(String doc,String filename) throws Exception{
String s="";
String wordfile=doc;
String txtfile=filename;
WordExtractor extractor=null;
try {
s=getText(wordfile);
PrintWriter pw=new PrintWriter(new FileWriter(new File(filename)));
pw.write(s);
pw.flush();
pw.close();
System.out.print("成功写入文件!");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
String sc=getText("D:/workspace/testsearch2/htmls/ddd.doc");
System.out.print(sc);
toTextFile("D:/workspace/testsearch2/htmls/ddd.doc","D:/workspace/testsearch2/htmls/ddd.txt");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
效果如下:
http://mirrors.ibiblio.org/pub/mirrors/maven2/org/textmining/tm-extractors/0.4/
java代码如下:
package extract;
import java.io.*;
import org.textmining.text.extraction.WordExtractor;
public class ExtractorWord {
/**
* @param args
*/
public static String getText(String file){
String s="";
String wordfile=file;
WordExtractor extractor=null;
try {
FileInputStream in=new FileInputStream(new File(wordfile));
extractor=new WordExtractor();
s=extractor.extractText(in);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
}
public static void toTextFile(String doc,String filename) throws Exception{
String s="";
String wordfile=doc;
String txtfile=filename;
WordExtractor extractor=null;
try {
s=getText(wordfile);
PrintWriter pw=new PrintWriter(new FileWriter(new File(filename)));
pw.write(s);
pw.flush();
pw.close();
System.out.print("成功写入文件!");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
String sc=getText("D:/workspace/testsearch2/htmls/ddd.doc");
System.out.print(sc);
toTextFile("D:/workspace/testsearch2/htmls/ddd.doc","D:/workspace/testsearch2/htmls/ddd.txt");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
效果如下:
相关文章推荐
- 自己动手写搜索引擎(常搜吧历程七#解析文档之WORD#)(Java、Lucene、hadoop)
- 【Lucene3.6.2入门系列】第14节_SolrJ操作索引和搜索文档以及整合中文分词
- lucene 索引非txt文档 (pdf word rtf html xml)
- java使用poi解析2007以上的word文档中的表格与图片
- Java根据word模板生成word文档之后台解析和实现及部分代码(三)B
- poi解析word文档(解析表格,emf,wmf,svg转jpg图片)
- Java解析word文档
- OpenXml入门----给Word文档添加文字
- python解析html提取数据,并生成word文档实例解析
- 导出word文档——WordXML格式解析
- hadoop入门程序wordcount 解析
- python如何处理解析word文档doc docx , python-docx,python-docx2txt,zipfile
- Java根据word模板生成word文档之后台解析和实现及部分代码(三)C
- spark快速入门与WordCount程序机制深度解析 spark研习第二季
- lucene 解析文档
- Apache-Tika解析Word文档
- java中用dom解析xml的经典入门级文档
- 自己动手写搜索引擎(常搜吧历程七#解析文档之HTML#)(Java、Lucene、hadoop)
- Java解析word,获取文档中图片位置的方法
- 解析word文档,获取相应的数据,并封装成相应的javaBean(二)