Lucene.Net 开发介绍 —— 二、分词(二)
2008-10-23 12:53
337 查看
1.2、分词的过程 1.2.1、分词器工作的过程内置的分词器效果都不好,那怎么办?只能自己写了!在写之前当然是要先看看内置的分词器是怎么实现的了。从1.1分析分词效果,可以看出KeywordAnalyzer这个分词器最懒惰,基本什么事情也没做。并不是它不会做,而是我们没找到使用它的方法,就像手上拿着个盒子,不知道里面是什么,就不知道这个是干嘛的,有什么用。打开盒子,那就是要查看源代码了! 代码 1.2.1.1
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary> "Tokenizes" the entire stream as a single token. This is useful
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// for data like zip codes, ids, and some product names.
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class KeywordAnalyzer : Analyzer
10
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
12
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return new KeywordTokenizer(reader);
14
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override TokenStream ReusableTokenStream(System.String fieldName, System.IO.TextReader reader)
17
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Tokenizer tokenizer = (Tokenizer)GetPreviousTokenStream();
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (tokenizer == null)
20
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
21
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
tokenizer = new KeywordTokenizer(reader);
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
SetPreviousTokenStream(tokenizer);
23
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
24
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
tokenizer.Reset(reader);
26
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return tokenizer;
27
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
28
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
29
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码1.2.1.1 就是传说中的源码了。先看看注释,意思大体是“‘Tokenizes’整体的流变成一个个词。这个特别适用于邮编,ID,和商品名称。”Tokenizes应该是拆分的意思,字典上查不到这个词。这段代码比较简单,只有两个方法,而第二个方法就是我们先前分析结果的时候用的(见段落1.1)。关键点就在于调用了KeywordTokenizer类。切到KeywordTokenizer类查看一下。 代码1.2.1.2
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary> Emits the entire input as a single token.</summary>
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class KeywordTokenizer : Tokenizer
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int DEFAULT_BUFFER_SIZE = 256;
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private bool done;
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public KeywordTokenizer(System.IO.TextReader input) : this(input, DEFAULT_BUFFER_SIZE)
15
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
16
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public KeywordTokenizer(System.IO.TextReader input, int bufferSize) : base(input)
19
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
this.done = false;
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override Token Next(Token result)
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (!done)
26
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
done = true;
28
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int upto = 0;
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.Clear();
30
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char[] buffer = result.TermBuffer();
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
while (true)
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int length = input.Read(buffer, upto, buffer.Length - upto);
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length <= 0)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
36
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
upto += length;
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (upto == buffer.Length)
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer = result.ResizeTermBuffer(1 + buffer.Length);
39
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.termLength = upto;
41
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return result;
42
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return null;
44
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
45
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override void Reset(System.IO.TextReader input)
47
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
48
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
base.Reset(input);
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
this.done = false;
50
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
51
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
52
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码 1.2.1.2 就是KeywordTokenizer的源码。代码量很小,却没有完成全部工作,而是将部分工作交给了父类。关注Lucene的人都可以知道,新版本中,分词这里换掉了,现在多了一个重载的Next方法。这里不讨论为什么要加这个重载,这篇文章主要是讲应用的。因为取词是用Next方法走的,那么只需要关注Next方法就可以了。KeywordTokenizer的父类是Tokenizer,但是在Tokenizer里找不到我们想要的关系,但是Tokenizer又继承自TokenStream。查看TokenStream类。 代码 1.2.1.3
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
4
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using Payload = Lucene.Net.Index.Payload;
5
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
7
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
8
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
9
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>A TokenStream enumerates the sequence of tokens, either from
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// fields of a document or from query text.
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <p>
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// This is an abstract class. Concrete subclasses are:
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <ul>
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>{@link Tokenizer}, a TokenStream
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// whose input is a Reader; and
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>{@link TokenFilter}, a TokenStream
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// whose input is another TokenStream.
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </ul>
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// NOTE: subclasses must override at least one of {@link
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// #Next()} or {@link #Next(Token)}.
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public abstract class TokenStream
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
26
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns the next token in the stream, or null at EOS.
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// The returned Token is a "full private copy" (not
28
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// re-used across calls to next()) but will be slower
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// than calling {@link #Next(Token)} instead..
30
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual Token Next()
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Token result = Next(new Token());
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (result != null)
36
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Payload p = result.GetPayload();
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (p != null)
39
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.SetPayload((Payload) p.Clone());
41
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
42
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
44
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return result;
45
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
47
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns the next token in the stream, or null at EOS.
48
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// When possible, the input Token should be used as the
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// returned Token (this gives fastest tokenization
50
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// performance), but this is not required and a new Token
51
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// may be returned. Callers may re-use a single Token
52
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// instance for successive calls to this method.
53
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <p>
54
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// This implicitly defines a "contract" between
55
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// consumers (callers of this method) and
56
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// producers (implementations of this method
57
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// that are the source for tokens):
58
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <ul>
59
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>A consumer must fully consume the previously
60
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// returned Token before calling this method again.</li>
61
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>A producer must call {@link Token#Clear()}
62
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// before setting the fields in it & returning it</li>
63
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </ul>
64
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// Note that a {@link TokenFilter} is considered a consumer.
65
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </summary>
66
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <param name="result">a Token that may or may not be used to return
67
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </param>
68
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <returns> next token in the stream or null if end-of-stream was hit
69
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </returns>
70
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual Token Next(Token result)
71
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
72
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return Next();
73
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
74
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
75
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Resets this stream to the beginning. This is an
76
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// optional operation, so subclasses may or may not
77
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// implement this method. Reset() is not needed for
78
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// the standard indexing process. However, if the Tokens
79
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// of a TokenStream are intended to be consumed more than
80
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// once, it is necessary to implement reset().
81
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
82
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual void Reset()
83
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
84
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
85
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
86
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Releases resources associated with this stream. </summary>
87
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual void Close()
88
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
89
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
90
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
91
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码 1.2.1.3 就是TokenStream类的源码。Next(Token)方法和Next()是相互调用的关系。但是因为Next(Token)方法在KeywordTokenizer里被重写掉了,因此,这里就可以忽略TokenStream的Next(Token)方法了。 从上面代码可以看出,调用Next()方法,实际上是传递给Next(Token)方法一个新Token实例。即使直接调用Next(Token),传递一个带有数据的Token,也会先被清除。在循环中,会把构造函数传入的流缓冲进Token类的缓冲区。ResizeTermBuffer方法是自动扩容用的,就像.Net Framework里的一些类能够自然扩容一样。比如List<T>,Hashtable或StringBuilder等。这个过程看不到分词的过程。不过这样就大致明白了分词器工作的流程。 1.2.2 如何让分词器分词 知道分词器如何工作了,但是现在还不明白分词如何分词。再回到1.1.2节,看到WhitespaceAnalyzer分词器似乎是学习的好选择。因为这个分词器只有遇到空格才会进行分词操作。 根据1.2.1的经验,直接查看WhitespaceTokenizer类。 代码1.2.2.1
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>A WhitespaceTokenizer is a tokenizer that divides text at whitespace.
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// Adjacent sequences of non-Whitespace characters form tokens.
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class WhitespaceTokenizer : CharTokenizer
11
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
12
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Construct a new WhitespaceTokenizer. </summary>
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public WhitespaceTokenizer(System.IO.TextReader in_Renamed) : base(in_Renamed)
14
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
15
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
17
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Collects only characters which do not satisfy
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// {@link Character#isWhitespace(char)}.
19
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal override bool IsTokenChar(char c)
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return !System.Char.IsWhiteSpace(c);
23
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
25
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 很好,这段代码很短,可是没有看到我们想要的东西。继续看父类。 代码1.2.2.2
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>An abstract base class for simple, character-oriented tokenizers.</summary>
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public abstract class CharTokenizer : Tokenizer
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public CharTokenizer(System.IO.TextReader input) : base(input)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
11
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private int offset = 0, bufferIndex = 0, dataLen = 0;
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int MAX_WORD_LEN = 255;
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int IO_BUFFER_SIZE = 1024;
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private char[] ioBuffer = new char[IO_BUFFER_SIZE];
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
18
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns true iff a character should be included in a token. This
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// tokenizer generates as tokens adjacent sequences of characters which
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// satisfy this predicate. Characters for which this is false are used to
21
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// define token boundaries and are not included in tokens.
22
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal abstract bool IsTokenChar(char c);
24
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
25
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Called on each token character to normalize it before it is added to the
26
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// token. The default implementation does nothing. Subclasses may use this
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// to, e.g., lowercase tokens.
28
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal virtual char Normalize(char c)
30
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return c;
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override Token Next(Token token)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
36
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.Clear();
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int length = 0;
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int start = bufferIndex;
39
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char[] buffer = token.TermBuffer();
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
while (true)
41
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
42
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (bufferIndex >= dataLen)
44
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
45
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
offset += dataLen;
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
dataLen = input is Lucene.Net.Index.DocumentsWriter.ReusableStringReader ? ((Lucene.Net.Index.DocumentsWriter.ReusableStringReader) input).Read(ioBuffer) : input.Read((System.Char[]) ioBuffer, 0, ioBuffer.Length);
47
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (dataLen <= 0)
48
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length > 0)
50
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
51
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else
52
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return null;
53
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
54
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
bufferIndex = 0;
55
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
56
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
57
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char c = ioBuffer[bufferIndex++];
58
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
59
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (IsTokenChar(c))
60
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
61
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// if it's a token char
62
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
63
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length == 0)
64
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// start of token
65
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
start = offset + bufferIndex - 1;
66
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else if (length == buffer.Length)
67
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer = token.ResizeTermBuffer(1 + length);
68
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
69
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer[length++] = Normalize(c); // buffer it, normalized
70
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
71
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length == MAX_WORD_LEN)
72
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// buffer overflow!
73
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
74
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
75
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else if (length > 0)
76
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// at non-Letter w/ chars
77
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break; // return 'em
78
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
79
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
80
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.termLength = length;
81
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.startOffset = start;
82
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.endOffset = start + length;
83
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return token;
84
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
85
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
86
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override void Reset(System.IO.TextReader input)
87
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
88
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
base.Reset(input);
89
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
bufferIndex = 0;
90
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
offset = 0;
91
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
dataLen = 0;
92
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
93
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
94
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 天公不作美,刚看到简单的,就来了个长的。无奈中。不过为什么要多一重继承呢?那就是有其他分词器也用到CharTokenizer了。而WhitespaceTokenizer中没有重写Next方法,而只是重写了IsTokenChar方法,几乎可以肯定。这个IsTokenChar才是重点。IsTokenChar故名思意,一看注释,果然!这个方法是判断是否遇到了分词的点的。这个其实和string类的Split方法相似。注意到Next方法关于IsTokenChar逻辑那一段,恩,果然是这样分词的。实际上就是拆分字符串嘛。
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary> "Tokenizes" the entire stream as a single token. This is useful
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// for data like zip codes, ids, and some product names.
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class KeywordAnalyzer : Analyzer
10
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
12
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return new KeywordTokenizer(reader);
14
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override TokenStream ReusableTokenStream(System.String fieldName, System.IO.TextReader reader)
17
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Tokenizer tokenizer = (Tokenizer)GetPreviousTokenStream();
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (tokenizer == null)
20
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
21
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
tokenizer = new KeywordTokenizer(reader);
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
SetPreviousTokenStream(tokenizer);
23
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
24
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
tokenizer.Reset(reader);
26
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return tokenizer;
27
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
28
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
29
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码1.2.1.1 就是传说中的源码了。先看看注释,意思大体是“‘Tokenizes’整体的流变成一个个词。这个特别适用于邮编,ID,和商品名称。”Tokenizes应该是拆分的意思,字典上查不到这个词。这段代码比较简单,只有两个方法,而第二个方法就是我们先前分析结果的时候用的(见段落1.1)。关键点就在于调用了KeywordTokenizer类。切到KeywordTokenizer类查看一下。 代码1.2.1.2
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary> Emits the entire input as a single token.</summary>
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class KeywordTokenizer : Tokenizer
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int DEFAULT_BUFFER_SIZE = 256;
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private bool done;
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public KeywordTokenizer(System.IO.TextReader input) : this(input, DEFAULT_BUFFER_SIZE)
15
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
16
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public KeywordTokenizer(System.IO.TextReader input, int bufferSize) : base(input)
19
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
this.done = false;
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override Token Next(Token result)
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (!done)
26
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
done = true;
28
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int upto = 0;
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.Clear();
30
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char[] buffer = result.TermBuffer();
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
while (true)
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int length = input.Read(buffer, upto, buffer.Length - upto);
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length <= 0)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
36
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
upto += length;
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (upto == buffer.Length)
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer = result.ResizeTermBuffer(1 + buffer.Length);
39
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.termLength = upto;
41
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return result;
42
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return null;
44
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
45
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override void Reset(System.IO.TextReader input)
47
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
48
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
base.Reset(input);
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
this.done = false;
50
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
51
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
52
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码 1.2.1.2 就是KeywordTokenizer的源码。代码量很小,却没有完成全部工作,而是将部分工作交给了父类。关注Lucene的人都可以知道,新版本中,分词这里换掉了,现在多了一个重载的Next方法。这里不讨论为什么要加这个重载,这篇文章主要是讲应用的。因为取词是用Next方法走的,那么只需要关注Next方法就可以了。KeywordTokenizer的父类是Tokenizer,但是在Tokenizer里找不到我们想要的关系,但是Tokenizer又继承自TokenStream。查看TokenStream类。 代码 1.2.1.3
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
4
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using Payload = Lucene.Net.Index.Payload;
5
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
7
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
8
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
9
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>A TokenStream enumerates the sequence of tokens, either from
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// fields of a document or from query text.
11
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <p>
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// This is an abstract class. Concrete subclasses are:
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <ul>
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>{@link Tokenizer}, a TokenStream
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// whose input is a Reader; and
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>{@link TokenFilter}, a TokenStream
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// whose input is another TokenStream.
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </ul>
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// NOTE: subclasses must override at least one of {@link
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// #Next()} or {@link #Next(Token)}.
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public abstract class TokenStream
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
25
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
26
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns the next token in the stream, or null at EOS.
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// The returned Token is a "full private copy" (not
28
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// re-used across calls to next()) but will be slower
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// than calling {@link #Next(Token)} instead..
30
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual Token Next()
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Token result = Next(new Token());
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (result != null)
36
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
Payload p = result.GetPayload();
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (p != null)
39
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
result.SetPayload((Payload) p.Clone());
41
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
42
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
44
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return result;
45
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
47
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns the next token in the stream, or null at EOS.
48
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// When possible, the input Token should be used as the
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// returned Token (this gives fastest tokenization
50
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// performance), but this is not required and a new Token
51
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// may be returned. Callers may re-use a single Token
52
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// instance for successive calls to this method.
53
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <p>
54
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// This implicitly defines a "contract" between
55
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// consumers (callers of this method) and
56
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// producers (implementations of this method
57
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// that are the source for tokens):
58
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <ul>
59
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>A consumer must fully consume the previously
60
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// returned Token before calling this method again.</li>
61
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <li>A producer must call {@link Token#Clear()}
62
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// before setting the fields in it & returning it</li>
63
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </ul>
64
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// Note that a {@link TokenFilter} is considered a consumer.
65
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </summary>
66
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <param name="result">a Token that may or may not be used to return
67
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// </param>
68
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// <returns> next token in the stream or null if end-of-stream was hit
69
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </returns>
70
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual Token Next(Token result)
71
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
72
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return Next();
73
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
74
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
75
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Resets this stream to the beginning. This is an
76
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// optional operation, so subclasses may or may not
77
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// implement this method. Reset() is not needed for
78
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// the standard indexing process. However, if the Tokens
79
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// of a TokenStream are intended to be consumed more than
80
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// once, it is necessary to implement reset().
81
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
82
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual void Reset()
83
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
84
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
85
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
86
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Releases resources associated with this stream. </summary>
87
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public virtual void Close()
88
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
89
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
90
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
91
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 代码 1.2.1.3 就是TokenStream类的源码。Next(Token)方法和Next()是相互调用的关系。但是因为Next(Token)方法在KeywordTokenizer里被重写掉了,因此,这里就可以忽略TokenStream的Next(Token)方法了。 从上面代码可以看出,调用Next()方法,实际上是传递给Next(Token)方法一个新Token实例。即使直接调用Next(Token),传递一个带有数据的Token,也会先被清除。在循环中,会把构造函数传入的流缓冲进Token类的缓冲区。ResizeTermBuffer方法是自动扩容用的,就像.Net Framework里的一些类能够自然扩容一样。比如List<T>,Hashtable或StringBuilder等。这个过程看不到分词的过程。不过这样就大致明白了分词器工作的流程。 1.2.2 如何让分词器分词 知道分词器如何工作了,但是现在还不明白分词如何分词。再回到1.1.2节,看到WhitespaceAnalyzer分词器似乎是学习的好选择。因为这个分词器只有遇到空格才会进行分词操作。 根据1.2.1的经验,直接查看WhitespaceTokenizer类。 代码1.2.2.1
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>A WhitespaceTokenizer is a tokenizer that divides text at whitespace.
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// Adjacent sequences of non-Whitespace characters form tokens.
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public class WhitespaceTokenizer : CharTokenizer
11
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
12
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Construct a new WhitespaceTokenizer. </summary>
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public WhitespaceTokenizer(System.IO.TextReader in_Renamed) : base(in_Renamed)
14
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
15
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
17
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Collects only characters which do not satisfy
18
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// {@link Character#isWhitespace(char)}.
19
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal override bool IsTokenChar(char c)
21
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
22
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return !System.Char.IsWhiteSpace(c);
23
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
24
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
25
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 很好,这段代码很短,可是没有看到我们想要的东西。继续看父类。 代码1.2.2.2
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
1
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
using System;
2
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
3
![](http://www.cnblogs.com/Images/OutliningIndicators/None.gif)
namespace Lucene.Net.Analysis
4
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif)
{
5
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
6
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>An abstract base class for simple, character-oriented tokenizers.</summary>
7
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public abstract class CharTokenizer : Tokenizer
8
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
9
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public CharTokenizer(System.IO.TextReader input) : base(input)
10
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
11
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
12
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
13
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private int offset = 0, bufferIndex = 0, dataLen = 0;
14
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int MAX_WORD_LEN = 255;
15
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private const int IO_BUFFER_SIZE = 1024;
16
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
private char[] ioBuffer = new char[IO_BUFFER_SIZE];
17
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
18
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Returns true iff a character should be included in a token. This
19
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// tokenizer generates as tokens adjacent sequences of characters which
20
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// satisfy this predicate. Characters for which this is false are used to
21
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// define token boundaries and are not included in tokens.
22
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
23
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal abstract bool IsTokenChar(char c);
24
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
25
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
/// <summary>Called on each token character to normalize it before it is added to the
26
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// token. The default implementation does nothing. Subclasses may use this
27
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
/// to, e.g., lowercase tokens.
28
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
/// </summary>
29
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
protected internal virtual char Normalize(char c)
30
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
31
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return c;
32
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
33
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
34
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override Token Next(Token token)
35
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
36
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.Clear();
37
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int length = 0;
38
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
int start = bufferIndex;
39
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char[] buffer = token.TermBuffer();
40
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
while (true)
41
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
42
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
43
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (bufferIndex >= dataLen)
44
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
45
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
offset += dataLen;
46
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
dataLen = input is Lucene.Net.Index.DocumentsWriter.ReusableStringReader ? ((Lucene.Net.Index.DocumentsWriter.ReusableStringReader) input).Read(ioBuffer) : input.Read((System.Char[]) ioBuffer, 0, ioBuffer.Length);
47
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (dataLen <= 0)
48
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
49
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length > 0)
50
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
51
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else
52
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return null;
53
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
54
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
bufferIndex = 0;
55
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
56
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
57
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
char c = ioBuffer[bufferIndex++];
58
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
59
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (IsTokenChar(c))
60
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
61
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// if it's a token char
62
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
63
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length == 0)
64
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// start of token
65
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
start = offset + bufferIndex - 1;
66
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else if (length == buffer.Length)
67
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer = token.ResizeTermBuffer(1 + length);
68
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
69
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
buffer[length++] = Normalize(c); // buffer it, normalized
70
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
71
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
if (length == MAX_WORD_LEN)
72
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// buffer overflow!
73
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break;
74
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
75
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
else if (length > 0)
76
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
// at non-Letter w/ chars
77
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
break; // return 'em
78
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
79
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
80
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.termLength = length;
81
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.startOffset = start;
82
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
token.endOffset = start + length;
83
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
return token;
84
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
85
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
86
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
public override void Reset(System.IO.TextReader input)
87
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif)
{
88
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
base.Reset(input);
89
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
bufferIndex = 0;
90
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
offset = 0;
91
![](http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif)
dataLen = 0;
92
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
93
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif)
}
94
![](http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif)
} 天公不作美,刚看到简单的,就来了个长的。无奈中。不过为什么要多一重继承呢?那就是有其他分词器也用到CharTokenizer了。而WhitespaceTokenizer中没有重写Next方法,而只是重写了IsTokenChar方法,几乎可以肯定。这个IsTokenChar才是重点。IsTokenChar故名思意,一看注释,果然!这个方法是判断是否遇到了分词的点的。这个其实和string类的Split方法相似。注意到Next方法关于IsTokenChar逻辑那一段,恩,果然是这样分词的。实际上就是拆分字符串嘛。
相关文章推荐
- Lucene.Net 2.3.1开发介绍 —— 二、分词(三)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(一)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(二)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(四)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(一)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(三)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(四)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(二)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(三)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(五)
- Lucene.Net 开发介绍 —— 二、分词(一)
- Lucene.Net 开发介绍 —— 二、分词(四)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(三)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(五)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(四)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(一)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(五)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(六)
- Lucene.Net 2.3.1开发介绍 —— 二、分词(五)
- Lucene.Net开发介绍 —— 二、分词(六)