Solr入门之官方文档6.0阅读笔记系列(七)
2016-06-15 19:17
471 查看
第三部分 : Understanding Analyzers, Tokenizers, and Filters | ||
Tokenizers 相关包: <!-- analyzer start --> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>5.5.0</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>5.5.0</version> </dependency> <!-- analyzer end --> | ||
Standard Tokenizer This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: 1. 点被保留 ip可以不分割 2.@会被丢弃 邮箱被切割 Factory class: solr.StandardTokenizerFactory Arguments: maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength. Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> </analyzer> In: "Please, email john.doe@foo.com by 03-09, re: m37-xq." Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"
| ||
Classic Tokenizer 这个和标准分词器差不多,但是不使用Unicode standard annex UAX#29; This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimitercharacters are discarded, with the following exceptions: 1.空格被分隔 2.连接符不被分隔 3.邮箱和ip能被保存 Factory class: solr.ClassicTokenizerFactory Arguments: maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength. Example: <analyzer> <tokenizer class="solr.ClassicTokenizerFactory"/> </analyzer> In: "Please, email john.doe@foo.com by 03-09, re: m37-xq." Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
| ||
Keyword Tokenizer This tokenizer treats the entire text field as a single token. 输入的文本作为一个单元词处理; Factory class: solr.KeywordTokenizerFactory Arguments: None Example: <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> In: "Please, email john.doe@foo.com by 03-09, re: m37-xq." Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
| ||
Letter Tokenizer This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters. 创建词符以连续的字母,非字母被抛弃 Factory class: solr.LetterTokenizerFactory Arguments: None Example: <analyzer> <tokenizer class="solr.LetterTokenizerFactory"/> </analyzer> In: "I can't." Out: "I", "can", "t"
| ||
Lower Case Tokenizer Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded. 不是字母为分隔,非字母和空格将被丢弃,全部的字母变为小写; Factory class: solr.LowerCaseTokenizerFactory Arguments: None Example: <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> In: "I just LOVE my iPhone!" Out: "i", "just", "love", "my", "iphone"
| ||
N-Gram Tokenizer Reads the field text and generates n-gram tokens of sizes in the given range. Factory class: solr.NGramTokenizerFactory Arguments: minGramSize: (integer, default 1) The minimum n-gram size, must be > 0. maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize. Example: Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding. <analyzer> <tokenizer class="solr.NGramTokenizerFactory"/> </analyzer> In: "hey man" Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an" Example: With an n-gram size range of 4 to 5: <analyzer> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/> </analyzer> In: "bicycle" Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"
如果我设置 最小为 2 最大 为 4 那么下面的分词效果 这是一个测试分词 结果就是: 这是 是一 一个 个测 测试 试词 这是一 是一个 一个测 个测试 测试分 试分词 这是一个 是一个测 一个测试 个测试分 测试分词 也就是从第一个位开始,以最小步距开始截取字符串,直到末尾.然后增加一个步距继续从头截取 直到达到最大步距为止. 这个是可以做搜索联想功能的,不过有点缺点,切分不是从头联想也是可以的. | ||
Edge N-Gram Tokenizer Reads the field text and generates edge n-gram tokens of sizes in the given range. Factory class: solr.EdgeNGramTokenizerFactory Arguments: minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0. maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= minGramSize. side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text or from the end (back). Example: Default behavior (min and max default to 1): <analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory"/> </analyzer> In: "babaloo" Out: "b" Example: Edge n-gram range of 2 to 5 <analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/> </analyzer> In: "babaloo" Out:"ba", "bab", "baba", "babal" Example: Edge n-gram range of 2 to 5, from the back side: <analyzer> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5" side="back"/> </analyzer> In: "babaloo" Out: "oo", "loo", "aloo", "baloo"
用作联想词很完美 | ||
ICU Tokenizer 配置多语言脚本处理 Factory class: solr.ICUTokenizerFactory Arguments: rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. Example: <analyzer> <!-- no customization --> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> | ||
Path Hierarchy Tokenizer This tokenizer creates synonyms from file path hierarchies. Factory class: solr.PathHierarchyTokenizerFactory Arguments: delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters. replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output. Example: <fieldType name="text_path" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/> </analyzer> </fieldType> In: "c:\usr\local\apache" Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache" 这是是对目录层级的处理分词器,用给定的分隔符代替指定的目录分隔符. 从输出结果来看,是用了split方法,先截取后再进行组装. | ||
Regular Expression Pattern Tokenizer This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens. See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax. Factory class: solr.PatternTokenizerFactory Arguments: pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern. group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right. Example: A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces. <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/> </analyzer> In: "fee,fie, foe , fum, foo" Out: "fee", "fie", "foe", "fum", "foo" Example: Apache Solr Reference Guide 6.0 115 Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token. <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/> </analyzer> In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die." Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare" Example: Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens. <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/> </analyzer> In: "SKU: 1234, Part Number 5678, Part: 126-987" Out: "1234", "5678", "126-987" 这个是正则表达式进行的分词,有两个参数,三种情况. pattern: (Required) 这个是java中的正则表达式 group: (Optional, default -1) 这个是设置的组号 当group为默认值 -1 时,意思是使用匹配正则表达式的符号作为切割,将分隔后的结果作为tokens返回 当group为0时,代表,将匹配到正则的部分作为tokens返回 当group大于0是 将匹配到正则中的对应相应的组的值作为tokens返回. | ||
UAX29 URL Email Tokenizer This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: Periods (dots) that are not followed by whitespace are kept as part of the token. Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved. Recognizes and preserves as single tokens the following: Internet domain names containing top-level domains validated against the white list in the IANA Root Zone Database when the tokenizer was generated email addresses file://, http(s)://, and ftp:// URLs IPv4 and IPv6 addresses The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following token types: <ALPHANUM>, <NUM>, <URL>, <EMAIL>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGAN A>. Factory class: solr.UAX29URLEmailTokenizerFactory Arguments: maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by maxTokenLength. Example: <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> </analyzer> In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com" Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com" 空格和字符作为分隔符,分隔文本,分隔符被丢弃,下列情况除外: 域名 邮箱 ip file://, http(s)://, and ftp:// URLs | ||
White Space Tokenizer Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokens. Factory class: solr.WhitespaceTokenizerFactory Arguments: rule : Specifies how to define whitespace for the purpose of tokenization. Valid values: java: (Default) Uses Character.isWhitespace(int) unicode: Uses Unicode's WHITESPACE property Example: <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" /> </analyzer> In: "To be, or what?" Out: "To", "be,", "or", "what?" 仅仅去除空格,别的连续的字符被作为tokens返回 两个参数空格的设置 | ||
Related Topics TokenizerFactories 分词器部分已经看完了,看一下有哪些分词器, Standard Tokenizer 处理空格和字符 Classic Tokenizer 处理空格和字符但是ip和邮箱保留 Keyword Tokenizer 文本作为一个词不处理 Letter Tokenizer 非字母的都处理 Lower Case Tokenizer 非字母的都处理,转换字母为小写 N-Gram Tokenizer 步距截取,用好了很不错 Edge N-Gram Tokenizer 限制的步距处理 可以做联想词 ICU Tokenizer 多语言设置 Path Hierarchy Tokenizer 层级目录分隔符转换 Regular Expression Pattern Tokenizer 正则方式获取tokens UAX29 URL Email Tokenizer url和email ip被保留的标准分词 White Space Tokenizer 空格分词 下面看过滤器部分: |
相关文章推荐
- C#生成Word文档代码示例
- 如何使用C#从word文档中提取图片
- jQuery窗口、文档、网页各种高度的精确理解
- 比较全的一个C#操作word文档示例
- 用JavaScript获取页面文档内容的实现代码
- C#编程实现Excel文档中搜索文本内容的方法及思路
- 在Eclipse中运行Solr 基础知识
- php文档工具PHP Documentor安装与使用方法
- MongoDB快速入门笔记(六)之MongoDB删除文档操作
- MongoDB快速入门笔记(四)之MongoDB查询文档操作实例代码
- MongoDB快速入门笔记(三)之MongoDB插入文档操作
- MongoDB修改、删除文档的域属性实例
- MongoDB快速入门笔记(六)之MongoDB的文档修改操作
- 跟老齐学Python之Python文档
- Python文档生成工具pydoc使用介绍
- Solr 5.3.0集成mmseg4j、tomcat部署、Solrj 5.3.0使用
- Solr基础--设置solr/home的三种方式
- windows下安装solr5.5.0
- Docker使用supervisor构建solr
- solr4.0安装和简单导入mysql数据