您的位置:首页 > 其它

Solr入门之官方文档6.0阅读笔记系列(七)

2016-06-15 19:17 471 查看
第三部分 :  Understanding
Analyzers, Tokenizers, and Filters

Tokenizers 

相关包:

         <!-- analyzer start  -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>5.5.0</version>
</dependency>
        <dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>5.5.0</version>
</dependency>
        <!-- analyzer end  -->

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters
are discarded, with the following exceptions:

1. 点被保留 ip可以不分割
2.@会被丢弃 邮箱被切割
Factory class: solr.StandardTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of
 characters specified
by maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
In: "Please,
email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email",
"john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

public class StandardTokenizerFactory extends TokenizerFactory {
  private final int maxTokenLength;

  /** Creates a new StandardTokenizerFactory */
  public StandardTokenizerFactory(Map<String,String> args) {
    super(args);
    maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public Tokenizer create(AttributeFactory factory) {
    if (luceneMatchVersion.onOrAfter(Version.LUCENE_4_7_0)) {
      StandardTokenizer tokenizer = new StandardTokenizer(factory);
      tokenizer.setMaxTokenLength(maxTokenLength);
      return tokenizer;
    } else {
      StandardTokenizer40 tokenizer40 = new StandardTokenizer40(factory);
      tokenizer40.setMaxTokenLength(maxTokenLength);
      return tokenizer40;
    }
  }
}

工厂类比较简单,创建分词器和设置参数,具体的分词器实现,比较复杂现在不研究

Classic Tokenizer

这个和标准分词器差不多,但是不使用Unicode standard annex UAX#29; 

This tokenizer splits the text field into tokens, treating whitespace and punctuation as 
delimiters. Delimitercharacters
are discarded, with the following exceptions:
1.空格被分隔
2.连接符不被分隔
3.邮箱和ip能被保存
Factory class: solr.ClassicTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by
maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
In: "Please,
email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email",
"john.doe@foo.com", "by", "03-09", "re", "m37-xq"

public class ClassicTokenizerFactory extends TokenizerFactory {
  private final int maxTokenLength;

  /** Creates a new ClassicTokenizerFactory */
  public ClassicTokenizerFactory(Map<String,String> args) {
    super(args);
    maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public ClassicTokenizer create(AttributeFactory factory) {
    ClassicTokenizer tokenizer = new ClassicTokenizer(factory);
    tokenizer.setMaxTokenLength(maxTokenLength);
    return tokenizer;
  }
}

Keyword Tokenizer

This tokenizer treats the entire text field as a single token.
输入的文本作为一个单元词处理;
Factory class: solr.KeywordTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
In: "Please,
email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please, email
john.doe@foo.com by 03-09, re: m37-xq."

public class KeywordTokenizerFactory extends TokenizerFactory {

  /** Creates a new KeywordTokenizerFactory */
  public KeywordTokenizerFactory(Map<String,String> args) {
    super(args);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public KeywordTokenizer create(AttributeFactory factory) {
    return new KeywordTokenizer(factory, KeywordTokenizer.DEFAULT_BUFFER_SIZE);
  }
}

Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

创建词符以连续的字母,非字母被抛弃

Factory class: solr.LetterTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
In: "I can't."
Out: "I", "can",
"t"

public class LetterTokenizerFactory extends TokenizerFactory {

  /** Creates a new LetterTokenizerFactory */
  public LetterTokenizerFactory(Map<String,String> args) {
    super(args);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public LetterTokenizer create(AttributeFactory factory) {
    return new LetterTokenizer(factory);
  }
}

Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and
non-letters are discarded.
不是字母为分隔,非字母和空格将被丢弃,全部的字母变为小写;
Factory class: solr.LowerCaseTokenizerFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
In: "I just LOVE my
iPhone!"
Out: "i", "just",
"love", "my", "iphone"

public class LowerCaseTokenizerFactory extends TokenizerFactory implements MultiTermAwareComponent {

  /** Creates a new LowerCaseTokenizerFactory */
  public LowerCaseTokenizerFactory(Map<String,String> args) {
    super(args);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public LowerCaseTokenizer create(AttributeFactory factory) {
    return new LowerCaseTokenizer(factory);
  }

  @Override
  public AbstractAnalysisFactory getMultiTermComponent() {
    return new LowerCaseFilterFactory(new HashMap<>(getOriginalArgs()));
  }
}

N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.
Factory class: solr.NGramTokenizerFactory
Arguments:
minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.
Example:
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace.
As a result, the space character is included in the encoding.
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>
In: "hey man"
Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
Example:
With an n-gram size range of 4 to 5:
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>
In: "bicycle"
Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

public class NGramTokenizerFactory extends TokenizerFactory {
  private final int maxGramSize;
  private final int minGramSize;

  /** Creates a new NGramTokenizerFactory */
  public NGramTokenizerFactory(Map<String, String> args) {
    super(args);
    minGramSize = getInt(args, "minGramSize", NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE);
    maxGramSize = getInt(args, "maxGramSize", NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  /** Creates the {@link TokenStream} of n-grams from the given {@link Reader} and {@link AttributeFactory}. */
  @Override
  public Tokenizer create(AttributeFactory factory) {
    if (luceneMatchVersion.onOrAfter(Version.LUCENE_4_4_0)) {
      return new NGramTokenizer(factory, minGramSize, maxGramSize);
    } else {
      return new Lucene43NGramTokenizer(factory, minGramSize, maxGramSize);
    }
  }
}

这个有两个参数,是最小偏移量和最大偏移量. 上面的两个例子也能说明一些问题:

如果我设置 最小为 2 最大 为 4 那么下面的分词效果

这是一个测试分词

结果就是:
这是  是一 一个 个测 测试 试词 这是一 是一个 一个测 个测试 测试分 试分词
这是一个 是一个测 一个测试   个测试分 测试分词

也就是从第一个位开始,以最小步距开始截取字符串,直到末尾.然后增加一个步距继续从头截取
直到达到最大步距为止.

这个是可以做搜索联想功能的,不过有点缺点,切分不是从头联想也是可以的.

Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.
Factory class: solr.EdgeNGramTokenizerFactory
Arguments:
minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= minGramSize.
side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text
or from the end (back).
Example:
Default behavior (min and max default to 1):
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>
In: "babaloo"
Out: "b"
Example:
Edge n-gram range of 2 to 5
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2"
maxGramSize="5"/>
</analyzer>
In: "babaloo"
Out:"ba", "bab", "baba",
"babal"
Example:
Edge n-gram range of 2 to 5, from the back side:
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"
side="back"/>
</analyzer>
In: "babaloo"
Out: "oo", "loo",
"aloo", "baloo"

public class EdgeNGramTokenizerFactory extends TokenizerFactory {
  private final int maxGramSize;
  private final int minGramSize;

  /** Creates a new EdgeNGramTokenizerFactory */
  public EdgeNGramTokenizerFactory(Map<String, String> args) {
    super(args);
    minGramSize = getInt(args, "minGramSize", EdgeNGramTokenizer.DEFAULT_MIN_GRAM_SIZE);
    maxGramSize = getInt(args, "maxGramSize", EdgeNGramTokenizer.DEFAULT_MAX_GRAM_SIZE);
    if (!args.isEmpty()) {
      throw new IllegalArgumentException("Unknown parameters: " + args);
    }
  }

  @Override
  public Tokenizer create(AttributeFactory factory) {
    if (luceneMatchVersion.onOrAfter(Version.LUCENE_4_4_0)) {
      return new EdgeNGramTokenizer(factory, minGramSize, maxGramSize);
    }
    return new Lucene43NGramTokenizer(factory, minGramSize, maxGramSize);
  }
}

这个是上面ngrem的限制版本,能从左边或者右边截取,不过首位不可以移动.
用作联想词很完美
ICU Tokenizer

配置多语言脚本处理

Factory class: solr.ICUTokenizerFactory
Arguments:
rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script
code, followed by a colon, then a resource path.
Example:
<analyzer>
<!-- no customization -->
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>

Path Hierarchy Tokenizer
This tokenizer creates synonyms from file path hierarchies.
Factory class: solr.PathHierarchyTokenizerFactory
Arguments:
delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you
provide. This can be useful for working with backslash delimiters.
replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
Example:
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\"
replace="/"/>
</analyzer>
</fieldType>
In: "c:\usr\local\apache"
Out: "c:", "c:/usr",
"c:/usr/local", "c:/usr/local/apache"

这是是对目录层级的处理分词器,用给定的分隔符代替指定的目录分隔符.
从输出结果来看,是用了split方法,先截取后再进行组装.

Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression
provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match
patterns that should be extracted from the text as tokens.
See the Javadocs for java.util.regex.Pattern for
more information on Java regular expression syntax.
Factory class: solr.PatternTokenizerFactory
Arguments:
pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern.
group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the
regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that
character sequences matching that regex group should be converted to tokens. Group zero refers to the entire
regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.
Example:
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or
more spaces.
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer>
In: "fee,fie,
foe , fum, foo"
Out: "fee",
"fie", "foe", "fum", "foo"
Example:
Apache Solr Reference Guide 6.0 115
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of
either case is extracted as a token.
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*"
group="0"/>
</analyzer>
In: "Hello.
My name is Inigo Montoya. You killed my father. Prepare to die."
Out: "Hello",
"My", "Inigo", "Montoya", "You", "Prepare"
Example:
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional
semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups
are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which
matches one or more digits or hyphens.
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>
In: "SKU: 1234,
Part Number 5678, Part: 126-987"
Out: "1234", "5678",
"126-987"

这个是正则表达式进行的分词,有两个参数,三种情况.
pattern: (Required) 这个是java中的正则表达式
group: (Optional, default -1) 这个是设置的组号

当group为默认值 -1 时,意思是使用匹配正则表达式的符号作为切割,将分隔后的结果作为tokens返回
当group为0时,代表,将匹配到正则的部分作为tokens返回
当group大于0是 将匹配到正则中的对应相应的组的值作为tokens返回.

UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter
characters are discarded, with the following exceptions:
Periods (dots) that are not followed by whitespace are kept as part of the token.
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
Recognizes and preserves as single tokens the following: 
Internet domain names containing top-level domains validated against the white list in the IANA
Root Zone Database when the tokenizer was generated
email addresses
file://, http(s)://, and ftp:// URLs
IPv4 and IPv6 addresses
The UAX29 URL Email Tokenizer supports Unicode standard
annex UAX#29 word boundaries with the following
token types: <ALPHANUM>, <NUM>, <URL>, <EMAIL>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGAN
A>.
Factory class: solr.UAX29URLEmailTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by
maxTokenLength.
Example:
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>
In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10",
"or", "e", "mail", "bob.cratchet@accarol.com"

空格和字符作为分隔符,分隔文本,分隔符被丢弃,下列情况除外:

域名
邮箱
ip
file://, http(s)://, and ftp:// URLs

White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters
as tokens. Note that any punctuation will be
included in the tokens.
Factory class: solr.WhitespaceTokenizerFactory
Arguments: rule
: Specifies how to define whitespace for the purpose of tokenization. Valid values:
java: (Default) Uses Character.isWhitespace(int)
unicode: Uses Unicode's WHITESPACE property
Example:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer>
In: "To be, or
what?"
Out: "To", "be,",
"or", "what?"

仅仅去除空格,别的连续的字符被作为tokens返回
两个参数空格的设置

Related Topics
TokenizerFactories

分词器部分已经看完了,看一下有哪些分词器,
Standard Tokenizer 处理空格和字符
Classic Tokenizer  处理空格和字符但是ip和邮箱保留

Keyword
Tokenizer 文本作为一个词不处理
Letter Tokenizer     非字母的都处理
Lower Case Tokenizer  非字母的都处理,转换字母为小写
N-Gram Tokenizer  步距截取,用好了很不错
Edge N-Gram Tokenizer 限制的步距处理 可以做联想词
ICU Tokenizer  多语言设置
Path Hierarchy Tokenizer  层级目录分隔符转换
Regular Expression Pattern Tokenizer  正则方式获取tokens
UAX29 URL Email Tokenizer  url和email ip被保留的标准分词
White Space Tokenizer  空格分词

下面看过滤器部分:
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  solr 文档