您的位置：首页 > 其它

Elasticsearch安装中文分词插件ik

2015-05-24 18:37 471 查看

Elasticsearch默认提供的分词器，会把每个汉字分开，而不是我们想要的根据关键词来分词。例如：

[html] view
plaincopy

curl -XPOST "http://localhost:9200/userinfo/_analyze?analyzer=standard&pretty=true&text=我是中国人"

我们会得到这样的结果：

[html] view
plaincopy

{

tokens: [

{

token: text

start_offset: 2

end_offset: 6

type: <ALPHANUM>

position: 1

}

{

token: 我

start_offset: 9

end_offset: 10

type: <IDEOGRAPHIC>

position: 2

}

{

token: 是

start_offset: 10

end_offset: 11

type: <IDEOGRAPHIC>

position: 3

}

{

token: 中

start_offset: 11

end_offset: 12

type: <IDEOGRAPHIC>

position: 4

}

{

token: 国

start_offset: 12

end_offset: 13

type: <IDEOGRAPHIC>

position: 5

}

{

token: 人

start_offset: 13

end_offset: 14

type: <IDEOGRAPHIC>

position: 6

}

]

}

正常情况下，这不是我们想要的结果，比如我们更希望 “中国人”，“中国”，“我”这样的分词，这样我们就需要安装中文分词插件，ik就是实现这个功能的。

elasticsearch-analysis-ik
是一款中文的分词插件，支持自定义词库。

安装步骤：

1、到github网站下载源代码，网站地址为：https://github.com/medcl/elasticsearch-analysis-ik

右侧下方有一个按钮“Download ZIP"，点击下载源代码elasticsearch-analysis-ik-master.zip。

2、解压文件elasticsearch-analysis-ik-master.zip，进入下载目录，执行命令：

[html] view
plaincopy

unzip elasticsearch-analysis-ik-master.zip

3、将解压目录文件中config/ik文件夹复制到ES安装目录config文件夹下。
4、因为是源代码，此处需要使用maven打包，进入解压文件夹中，执行命令：

[html] view
plaincopy

mvn clean package

5、将打包得到的jar文件elasticsearch-analysis-ik-1.2.8-sources.jar复制到ES安装目录的lib目录下。

6、在ES的配置文件config/elasticsearch.yml中增加ik的配置，在最后增加：

[html] view
plaincopy

index:

analysis:

analyzer:

ik:

alias: [ik_analyzer]

type: org.elasticsearch.index.analysis.IkAnalyzerProvider

ik_max_word:

type: ik

use_smart: false

ik_smart:

type: ik

use_smart: true

或

[html] view
plaincopy

index.analysis.analyzer.ik.type : “ik”

7、重新启动elasticsearch服务，这样就完成配置了，收入命令：

[html] view
plaincopy

curl -XPOST "http://localhost:9200/userinfo/_analyze?analyzer=ik&pretty=true&text=我是中国人"

测试结果如下：

[html] view
plaincopy

{

tokens: [

{

token: text

start_offset: 2

end_offset: 6

type: ENGLISH

position: 1

}

{

token: 我

start_offset: 9

end_offset: 10

type: CN_CHAR

position: 2

}

{

token: 中国人

start_offset: 11

end_offset: 14

type: CN_WORD

position: 3

}

{

token: 中国

start_offset: 11

end_offset: 13

type: CN_WORD

position: 4

}

{

token: 国人

start_offset: 12

end_offset: 14

type: CN_WORD

position: 5

}

]

}

说明：

1、ES安装插件本来使用使用命令plugin来完成，但是我本机安装ik时一直不成功，所以就使用源代码打包安装了。

2、自定义词库的方式，请参考 https://github.com/medcl/elasticsearch-analysis-ik

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航