您的位置：首页 > 编程语言 > Python开发

CoreNLP Python接口处理中文

2017-08-02 19:03 246 查看

CoreNLP 项目是Stanford开发的一套开源的NLP系统。包括tokenize, pos , parse 等功能，与SpaCy类似。SpaCy号称是目前最快的NLP系统，并且提供现成的python接口，但不足之处就是目前还不支持中文处理， CoreNLP则包含了中文模型，可以直接用于处理中文，但CoreNLP使用Java开发，python调用稍微麻烦一点。

安装

安装的方式比较简单，下载CoreNLP最新的压缩包，再下载对应的语言jar包。从CoreNLP下载页面下载。将压缩包解压得到目录，再将语言的jar包放到这个目录下即可。

启动NLPServer

由于corenlp使用Java开发，所以没有python包可以直接使用，但是corenlp可以启动Server端，接收http请求。所以使用python简单的封装，就可以与server端进行通信，像使用原生python包一样使用。

对于中文的情况，启动corenlp server的方式是，到corenlp的目录下，执行如下代码

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

目前corenlp对jdk的要求是1.8以上。上面的

-Xmx4g

的含义是为这个server端申请4G的内存。

-serverProperties

指定properties文件，这个文件在chinese-model的jar包里面。

启动Server之后，第一次执行的时候会比较慢，需要载入各种包。

基本HTTP 请求

wget --post-data 'The quick brown fox jumped over the lazy dog.' 'localhost:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json"}'

这是发一个POST的HTTP请求，使用Python的示例如下

import requests

url = 'http://192.168.200.169:9000'
properties = {'annotators': 'tokenize,ssplit,pos', 'outputFormat': 'json'}

# properties 要转成字符串, requests包装URL的时候貌似不支持嵌套的dict
params = {'properties' : str(properties)}

data = '天气非常好'

resp = requests.post(url, data, params=params)

官方Python接口

CoreNLP官方也有提供封装好的Python接口：https://github.com/stanfordnlp/python-stanford-corenlp

git clone https://github.com/stanfordnlp/python-stanford-corenlp

然后在python-stanford-corenlp目录底下，

sudo python setup.py install

就安装成功了。

设置JAVANLP_HOME环境变量

这个Python接口并不是一个完整的CoreNLP Python包，它仅仅是对上文所说的启动Server，Client端发送http请求的一个封装。因此底层还是依赖于运行在JVM里面的CoreNLP Server端。这个Server端可以在代码执行的时候在本地启动，因此程序需要知道Java CoreNLP的目录，为了不用每次都传这个参数，代码中是从系统获取名为

JAVANLP_HOME

的环境变量。

所以到~/.bashrc或~/.bash_profile文件中添加JAVANLP_HOME环境变量

JAVANLP_HOME="/path/to/corenlp"
export JAVANLP_HOME

修改代码以处理中文

但是用于处理中文还需要改一些地方，可以fork到自己的github，修改一下，以后在其他地方要用直接clone自己修改过的项目就可以了。

需要改的是

python-stanford-corenlp/corenlp/client.py

文件CoreNLPClient的

__init__

方法中启动server端的命令

start_cmd

，原来的代码如下：

start_cmd = "{javanlp}/bin/javanlp.sh edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port {port} -timeout {timeout}".format(
javanlp=os.getenv("JAVANLP_HOME"),
port=port,
timeout=timeout)

修改为

start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
memory=allocate_mem,
javanlp=os.getenv("JAVANLP_HOME"),
port=port,
timeout=timeout)

原来的命令

start_cmd

被写的比较死，并且可能由于我下的CoreNLP版本不对，目录底下并没有bin目录与javanlp.sh脚本。因此直接改成

java -Xmx{memory}g -cp "{javanlp}/*"

，memory参数用于配置server端所需的内存。

增加-serverProperties参数为了可以处理中文。修改后的

__init__

方法代码如下：

    DEFAULT_ANNOTATORS = "tokenize ssplit pos ner depparse".split()
DEFAULT_PROPERTIES = {}

def __init__(self, start_server=True, endpoint="http://localhost:9000",
timeout=5000, annotators=DEFAULT_ANNOTATORS, properties=DEFAULT_PROPERTIES, allocate_mem=4):
if start_server:
host, port = urlparse(endpoint).netloc.split(":")
assert host == "localhost", "If starting a server, endpoint must be localhost"

assert os.getenv("JAVANLP_HOME") is not None, "Please define $JAVANLP_HOME where your CoreNLP Java checkout is"
start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
memory=allocate_mem,
javanlp=os.getenv("JAVANLP_HOME"),
port=port,
timeout=timeout)
stop_cmd = None
else:
start_cmd = stop_cmd = None

super(CoreNLPClient, self).__init__(start_cmd, stop_cmd, endpoint)
self.default_annotators = annotators
self.default_properties = properties

还去除了DEFAULT_ANNOTATORS中的lemma，获取词原型的功能在处理中文的时候没用。

示例

修改好代码以后，重新执行一遍

sudo python setup.py install

即可。

应用的示例代码如下

#-*- coding:utf-8 -*-

import corenlp

text = u'今天是一个大晴天'

with corenlp.CoreNLPClient(annotators='tokenize ssplit pos'.split()) as client:
ann = client.annotate(text)
sentence = ann.sentence[0]

for token in sentence.token:
print token.word, token.pos

执行以后结果

今天 NT
是 VC
一 CD
个 M
大 JJ
晴天 NN

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： nlp

相关文章推荐

新的分享

章节导航