您的位置:首页 > 运维架构 > Apache

Apache Jackrabbit源码研究(一)

2013-04-06 18:09 351 查看
几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829

本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者

(注:本文目前可能还处于修改中,如需转载,害人害己)

jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的

实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField

LazyTextExtractorField类的源码如下:

/**
* <code>LazyTextExtractorField</code> implements a Lucene field with a String
* value that is lazily initialized from a given {@link Reader}. In addition
* this class provides a method to find out whether the purpose of the reader
* is to extract text and whether the extraction process is already finished.
*
* @see #isExtractorFinished()
*/
public class LazyTextExtractorField extends AbstractField {

/**
* The logger instance for this class.
*/
private static final Logger log =
LoggerFactory.getLogger(LazyTextExtractorField.class);

/**
* The exception used to forcibly terminate the extraction process
* when the maximum field length is reached.
*/
private static final SAXException STOP =
new SAXException("max field length reached");

/**
* The extracted text content of the given binary value.
* Set to non-null when the text extraction task finishes.
*/
private volatile String extract = null;

/**
* Creates a new <code>LazyTextExtractorField</code> with the given
* <code>name</code>.
*
* @param name the name of the field.
* @param reader the reader where to obtain the string from.
* @param highlighting set to <code>true</code> to
*                     enable result highlighting support
*/
public LazyTextExtractorField(
Parser parser, InternalValue value, Metadata metadata,
Executor executor, boolean highlighting, int maxFieldLength) {
super(FieldNames.FULLTEXT,
highlighting ? Store.YES : Store.NO,
Field.Index.ANALYZED,
highlighting ? TermVector.WITH_OFFSETS : TermVector.NO);
executor.execute(
new ParsingTask(parser, value, metadata, maxFieldLength));
}

/**
* Returns the extracted text. This method blocks until the text
* extraction task has been completed.
*
* @return the string value of this field
*/
public synchronized String stringValue() {
try {
while (!isExtractorFinished()) {
wait();
}
return extract;
} catch (InterruptedException e) {
log.error("Text extraction thread was interrupted", e);
return "";
}
}

/**
* @return always <code>null</code>
*/
public Reader readerValue() {
return null;
}

/**
* @return always <code>null</code>
*/
public byte[] binaryValue() {
return null;
}

/**
* @return always <code>null</code>
*/
public TokenStream tokenStreamValue() {
return null;
}

/**
* Checks whether the text extraction task has finished.
*
* @return <code>true</code> if the extracted text is available
*/
public boolean isExtractorFinished() {
return extract != null;
}

private synchronized void setExtractedText(String value) {
extract = value;
notify();
}

/**
* Releases all resources associated with this field.
*/
public void dispose() {
// TODO: Cause the ContentHandler below to throw an exception
}

/**
* The background task for extracting text from a binary value.
*/
private class ParsingTask extends DefaultHandler implements Runnable {

private final Parser parser;

private final InternalValue value;

private final Metadata metadata;

private final int maxFieldLength;

private final StringBuilder builder = new StringBuilder();

public ParsingTask(
Parser parser, InternalValue value, Metadata metadata,
int maxFieldLength) {
this.parser = parser;
this.value = value;
this.metadata = metadata;
this.maxFieldLength = maxFieldLength;
}

public void run() {
try {
InputStream stream = value.getStream();
try {
parser.parse(stream, this, metadata, new ParseContext());
} finally {
stream.close();
}
} catch (Throwable t) {
if (t != STOP) {
log.warn("Failed to extract text from a binary property", t);
}
} finally {
value.discard();
}
setExtractedText(builder.toString());
}

@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
builder.append(
ch, start,
Math.min(length, maxFieldLength - builder.length()));
if (builder.length() >= maxFieldLength) {
throw STOP;
}
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
characters(ch, start, length);
}

}

}


从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的

这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利

jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面

线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本

需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类

JackrabbitParser类的源码如下:

/**
* Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser}
* for all parsing requests, but sets it up with Jackrabbit-specific
* configuration and implements backwards compatibility support for old
* <code>textExtractorClasses</code> configurations.
*
* @since Apache Jackrabbit 2.0
*/
class JackrabbitParser implements Parser {

/**
* Logger instance.
*/
private static final Logger logger =
LoggerFactory.getLogger(JackrabbitParser.class);

/**
* Flag for blocking all text extraction. Used by the Jackrabbit test suite.
*/
private static volatile boolean blocked = false;

/**
* The configured Tika parser.
*/
private final AutoDetectParser parser;

/**
* Creates a parser using the default Jackrabbit-specific configuration
* settings.
*/
public JackrabbitParser() {
InputStream stream =
JackrabbitParser.class.getResourceAsStream("tika-config.xml");
try {
if (stream != null) {
try {
parser = new AutoDetectParser(new TikaConfig(stream));
} finally {
stream.close();
}
} else {
parser = new AutoDetectParser();
}
} catch (Exception e) {
// Should never happen
throw new RuntimeException(
"Unable to load embedded Tika configuration", e);
}
}

/**
* Backwards compatibility method to support old Jackrabbit 1.x
* <code>textExtractorClasses</code> configurations. Implements a best
* effort mapping from the old-style text extractor classes to
* corresponding Tika parsers.
*
* @param classes configured list of text extractor classes
*/
public void setTextFilterClasses(String classes) {
Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>();

StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
while (tokenizer.hasMoreTokens()) {
String name = tokenizer.nextToken();
if (name.equals(
"org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
parsers.put(MediaType.text("html"), new HtmlParser());
} else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
Parser parser = new OfficeParser();
parsers.put(MediaType.application("vnd.ms-excel"), parser);
parsers.put(MediaType.application("msexcel"), parser);
parsers.put(MediaType.application("excel"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) {
parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser());
} else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor")
|| name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) {
Parser parser = new OfficeParser();
parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
parsers.put(MediaType.application("mspowerpoint"), parser);
parsers.put(MediaType.application("powerpoint"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) {
Parser parser = new OfficeParser();
parsers.put(MediaType.application("vnd.ms-word"), parser);
parsers.put(MediaType.application("msword"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) {
Parser parser = new OfficeParser();
parsers.put(MediaType.application("vnd.ms-word"), parser);
parsers.put(MediaType.application("msword"), parser);
parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
parsers.put(MediaType.application("mspowerpoint"), parser);
parsers.put(MediaType.application("vnd.ms-excel"), parser);
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser);
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser);
parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {
Parser parser = new OpenDocumentParser();
parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser);
parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser);
parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser);
parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser);
parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser);
parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser);
parsers.put(MediaType.application("vnd.sun.xml.calc"), parser);
parsers.put(MediaType.application("vnd.sun.xml.draw"), parser);
parsers.put(MediaType.application("vnd.sun.xml.impress"), parser);
parsers.put(MediaType.application("vnd.sun.xml.writer"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) {
parsers.put(MediaType.application("pdf"), new PDFParser());
} else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) {
parsers.put(MediaType.TEXT_PLAIN, new TXTParser());
} else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) {
Parser parser = new ImageParser();
parsers.put(MediaType.image("png"), parser);
parsers.put(MediaType.image("apng"), parser);
parsers.put(MediaType.image("mng"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) {
Parser parser = new RTFParser();
parsers.put(MediaType.application("rtf"), parser);
parsers.put(MediaType.text("rtf"), parser);
} else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) {
Parser parser = new XMLParser();
parsers.put(MediaType.APPLICATION_XML, parser);
parsers.put(MediaType.text("xml"), parser);
} else {
logger.warn("Ignoring unknown text extractor class: {}", name);
}
}

parser.setParsers(parsers);
}

/**
* Delegates the call to the configured {@link AutoDetectParser}.
*/
public Set<MediaType> getSupportedTypes(ParseContext context) {
return parser.getSupportedTypes(context);
}

/**
* Delegates the call to the configured {@link AutoDetectParser}.
*/
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
waitIfBlocked();
parser.parse(stream, handler, metadata, context);
}

public void parse(
InputStream stream, ContentHandler handler, Metadata metadata)
throws IOException, SAXException, TikaException {
parse(stream, handler, metadata, new ParseContext());
}

/**
* Waits until text extraction is no longer blocked. The block is only
* ever activated in the Jackrabbit test suite when testing delayed
* text extraction.
*
* @throws TikaException if the block was interrupted
*/
private synchronized static void waitIfBlocked() throws TikaException {
try {
while (blocked) {
JackrabbitParser.class.wait();
}
} catch (InterruptedException e) {
throw new TikaException("Text extraction block interrupted", e);
}
}

/**
* Blocks all text extraction tasks.
*/
static synchronized void block() {
blocked = true;
}

/**
* Unblocks all text extraction tasks.
*/
static synchronized void unblock() {
blocked = false;
JackrabbitParser.class.notifyAll();
}

}


具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)

---------------------------------------------------------------------------

本系列Apache Jackrabbit源码研究系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: