ElasticStack系列之十三 & 联想补全策略
2017-10-09 19:20
405 查看
业务需求
1. 实现搜索引擎前缀搜索功能(中文,拼音前缀查询及简拼前缀查询功能)2. 实现摘要全文检索功能,及标题加权处理功能(按照标题权值高内容权值相对低的权值分配规则,按照索引的相关性进行排序,列出前20条相关性最高的文章)
前缀搜索
中文搜索:1. 搜索“刘”,匹配到“刘德华”、“刘斌”、“刘德志”
2. 搜索“刘德”,匹配到“刘德华”、“刘德志”
小结:搜索的文字需要匹配到集合中所有名字的子集。
全拼搜索:
1. 搜索“li”,匹配到“刘德华”、“刘斌”、“刘德志”
2. 搜索“liud”,匹配到“刘德华”、“刘德”
3. 搜索“liudeh”,匹配到“刘德华”
小结:搜索的文字转换成拼音后,需要匹配到集合中所有名字转成拼音后的子集
简拼搜索:
1. 搜索“w”,匹配到“我是中国人”,“我爱我的祖国”
2. 搜索“wszg”,匹配到“我是中国人”
小结:搜索的文字取拼音首字母进行组合,需要匹配到组合字符串中前缀匹配的子集
解决方案
方案一:将 “like” 搜索的字段的 中、英简拼、英全拼 分别用索引的三个字段来进行存储并且 不进行分词,最简单直接,检索索引数据的时候进行通配符查询(like查询),从这三个字段中分别进行搜索,查询匹配的记录然后返回。
优势:存储格式简单,倒排索引存储的数据量最少。
缺点:like 索引数据的时候开销比较大 prefix 查询比 term 查询开销大得多
方案二:
将 中、中简拼、中全拼 用一个字段衍生出三个字段(multi-field)来存储三种数据,并且分词器filter 采用 edge_ngram 类型对分词的数据进行分词处理存储到倒排索引中,当检索索引数据时,检索所有字段的数据。
优势:格式紧凑,检索索引数据的时候采用 term 全匹配规则,也无需对入参进行分词,查询效率高。
缺点:采用以空间换时间的策略,但是对索引来说可以接受。采用衍生字段来存储,增加了存储及检索的复杂度,对于三个字段搜索会将相关度相加,容易混淆查询相关度结果
方案三:
将索引数据存储在一个不需分词的字段中(keyword), 生成倒排索引时进行三种类型倒排索引的生成,倒排索引生成的时候采用 edge_ngram 对倒排进一步拆分,以满足业务场景需求,检索时不对入参进行分词。
优势:索引数据存储简单,检索索引数据的时只需对一个字段采用 term 全匹配查询规则,查询效率极高。
缺点:采用以空间换时间的策略——比方案二要少,对索引数据来说可以接受。
ES 针对这一业务场景解决方案还有很多种,先列出比较典型的这三种方案,选择方案三来进行处理。
准备工作
pinyin分词插件安装及参数解读ElasticSearch edge_ngram 使用
ElasticSearch multi-field 使用
ElasticSearch 多种查询特性熟悉
代码
myself_settings.json:{ "refresh_interval":"2s", "number_of_replicas":1, "number_of_shards":2, "analysis":{ "filter":{ "autocomplete_filter":{ "type":"edge_ngram", "min_gram":1, "max_gram":15 }, "pinyin_first_letter_and_full_pinyin_filter" : { "type" : "pinyin", "keep_first_letter" : true, "keep_full_pinyin" : false, "keep_joined_full_pinyin": true, "keep_none_chinese" : false, "keep_original" : false, "limit_first_letter_length" : 16, "lowercase" : true, "trim_whitespace" : true, "keep_none_chinese_in_first_letter" : true }, "full_pinyin_filter" : { "type" : "pinyin", "keep_first_letter" : true, "keep_full_pinyin" : false, "keep_joined_full_pinyin": true, "keep_none_chinese" : false, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "trim_whitespace" : true, "keep_none_chinese_in_first_letter" : true } }, "analyzer":{ "full_prefix_analyzer":{ "type":"custom", "char_filter": [ "html_strip" ], "tokenizer":"keyword", "filter":[ "lowercase", "full_pinyin_filter", "autocomplete_filter" ] }, "chinese_analyzer":{ "type":"custom", "char_filter": [ "html_strip" ], "tokenizer":"keyword", "filter":[ "lowercase", "autocomplete_filter" ] }, "pinyin_analyzer":{ "type":"custom", "char_filter": [ "html_strip" ], "tokenizer":"keyword", "filter":[ "pinyin_first_letter_and_full_pinyin_filter", "autocomplete_filter" ] } } } }
myself_mapping.json
{ "test_type": { "properties": { "full_name": { "type": "text", "analyzer": "full_prefix_analyzer" }, "age": { "type": "integer" } } } }
工程目录:
![](https://images2017.cnblogs.com/blog/980882/201710/980882-20171009191714262-862727332.png)
测试项目代码:
public class PrefixTest { @Test public void testCreateIndex() throws Exception{ TransportClient client = ESConnect.getInstance().getTransportClient(); //定义索引 BaseIndex.createWithSetting(client,"baidu_index","esjson/baidu_settings.json"); //定义类型及字段详细设计 BaseIndex.createMapping(client,"baidu_index","baidu_type","esjson/baidu_mapping.json"); } @Test public void testBulkInsert() throws Exception{ TransportClient client = ESConnect.getInstance().getTransportClient(); List<Object> list = new ArrayList<>(); list.add(new BulkInsert(12l,"我们都有一个家名字叫中国",12)); list.add(new BulkInsert(13l,"兄弟姐妹都很多景色也不错 ",13)); list.add(new BulkInsert(14l,"家里盘着两条龙是长江与黄河",14)); list.add(new BulkInsert(15l,"还有珠穆朗玛峰儿是最高山坡",15)); list.add(new BulkInsert(16l,"我们都有一个家名字叫中国",16)); list.add(new BulkInsert(17l,"兄弟姐妹都很多景色也不错",17)); list.add(new BulkInsert(18l,"看那一条长城万里在云中穿梭",18)); boolean flag = BulkOperation.batchInsert(client,"baidu_index","baidu_type",list); System.out.println(flag); } }
接下来查看下定义的分词器效果:
http://192.168.20.114:9200/baidu_index/_analyze?text=刘德华AT2016&analyzer=full_prefix_analyzer
得到的结果内容为:
{ "tokens": [ { "token": "刘", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华a", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华at", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华at2", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华at20", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华at201", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "刘德华at2016", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "l", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "li", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liu", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liud", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liude", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liudeh", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liudehu", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "liudehua", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "l", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ld", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldh", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldha", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldhat", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldhat2", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldhat20", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldhat201", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 }, { "token": "ldhat2016", "start_offset": 0, "end_offset": 9, "type": "word", "position": 0 } ] }
看到以上结果,则表明大功告成了!
相关文章推荐
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- 【量化小讲堂-Python&Pandas系列18】平均趋向指标(ADX)策略在A股的实证
- ElasticStack系列之十九 & bulk时 index 和 create 的区别
- ElasticStack系列之九 & master、data 和 client 节点
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- ElasticStack系列之十一 & 同步 mysql 数据的实践与思考
- ElasticStack系列之七 & IK自动热更新原理与实现
- ElasticStack系列之十 & 生产中的问题与解决方案
- 【Qt编程】基于Qt的词典开发系列<十四>自动补全功能
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- JVM系列二:GC策略&内存申请、对象衰老
- ElasticStack系列之十二 & 搜索结果研究
- 【量化小讲堂-Python&Pandas系列14】逆天的反转策略在A 股实证
- JVM系列二:GC策略&内存申请、对象衰老