您的位置：首页 > 其它

利用Gensim在英文Wikipedia训练词向量

2017-01-18 14:01 218 查看

最近在SemEval 2010 Task 8上做关系分类的实验，主要是实现了一下这篇论文的模型：

A neural network framework for relation extraction: Learning entity semantic and relation pattern

，奈何比原文中的性能差了两个点，好忧桑，想着是不是词向量的原因。这篇文章中用的是英文Wikipedia语料训练的词向量，没办法，动手试一下吧。。

一、数据准备

首先下载最新的英文wiki数据wiki dumps，我的下载时间是2017-01-17，12.7G的xml格式压缩包。

1.1 转为纯文本

之前看过一个实验是使用process_wiki.py+Gensim处理的，但是gensim默认是没有断句的，使用上述方法处理完之后，你会发现一句话变得特别地长，这可能会影响词向量的质量。

所以这里对源代码做了一些改动，gensim的WikiCorpus模块在文件如下文件中：

python3.5/dist-packages/gensim/corpora/wikicorpus.py

我们将wikicorpus.py文件拷贝出来，主要是修改

process_article

函数和

WikiCorpus

类，这里用到了nltk的

sent_tokenize

模块，并且将

ARTICLE_MIN_WORDS

的值改成了10（这里实际上判断的是句子的最短长度），修改后的两个文件process_wiki.py和wikicorpus.py分别见附件1和附件2。

上述工作完成之后，执行

python3 process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text

，最后会得到一个约13G的纯文本文件wiki.en.text，每个句子占一行（ps: 在24核的机器上跑了大概100分钟），从下图可以看出比分句之前的结果好看了很多，但是标点符号都不见了，不知道怎么保留标点符号，希望知道的小伙伴提出来。

Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions

These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations

Anarchism holds the state to be undesirable unnecessary and harmful

While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system

Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy

Many types and traditions of anarchism exist not all of which are mutually exclusive

二、训练词向量

将wiki语料处理成纯文本之后就可以开始训练词向量了，利用Gensim的Word2Vec模块，用的是下面这个脚本train_word2vec_model.py，执行

python3 train_word2vec_model.py wiki.en.text model.bin

，训练时间较长，写这篇博客时才完成2.05%。

# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.models.word2vec import LineSentence
from time import time
from gensim.models import Word2Vec

if __name__ == '__main__':
t0 = time()

program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

# check and process input arguments
if len(sys.argv) < 3:
print(globals()['__doc__'] % locals())
sys.exit(1)
inp, outp = sys.argv[1:3]  # corpus path and path to save model

model = Word2Vec(sg=0, sentences=LineSentence(inp), size=300, window=5, min_count=5,
workers=16, iter=35)

# trim unneeded model memory = use(much) less RAM
#model.init_sims(replace=True)
model.save_word2vec_format(outp, binary=True)

print('done in %ds!' % (time()-t0))

三、将词向量保存成pkl格式

txt格式的词向量加载起来那叫一个慢，Gensim生成的bin文件虽然快了很多，但是还是要利用Gensim模块来加载，很不方便。我在python中使用词向量的时候，一般是将词向量存在一个字典里，其中键是词，值是词对应的词向量（numpy类型的数组）。如果将字典直接以二进制的形式存储在文件中，那加载会方便很多，速度也会快不少。这里用到是pickle模块，程序如下：

# -*- encoding:utf-8 -*-
import pickle
import numpy as np
from time import time
from gensim.models import Word2Vec

if __name__ == '__main__':
t0 = time()

word_wieghts = {}
model = Word2Vec.load_word2vec_format('model.bin', binary=True)
for word in model.vocab:
word_weights[word] = model[word]
with open('model.pkl','wb') as file:
pickle.dump(file, word_weights)

print('Done in %ds!' % (time()-t0))

四、总结

这里主要改进了gensim模块，在处理wiki文本的时候利用nltk工具将文章分割成句子。

附件1：process_wiki.py

# -*- encoding:utf-8 -*-
import logging
import os.path
import sys

from wikicorpus import WikiCorpus  # 注意这里

# add by ljx
def decode_text(text):
words = []
for w in text:
words.append(w.decode('utf-8'))
return words

if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

# check and process input arguments
if len(sys.argv) < 3:
print(globals()['__doc__'] % locals())
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0

output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(decode_text(text)) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " sentences")

output.close()
logger.info("Finished Saved " + str(i) + " sentences")

附件2：wikicorpus.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html 
"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

If you have the `pattern` package installed, this module will use a fancy
lemmatization to get a lemma of each token (instead of plain alphabetic
tokenizer). The package is available at https://github.com/clips/pattern .

See scripts/process_wiki.py for a canned (example) script based on this
module.
"""

import bz2
import logging
import re
from xml.etree.cElementTree import iterparse  # LXML isn't faster, so let's go with the built-in solution
import multiprocessing

from gensim import utils

# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpus

from nltk.tokenize import sent_tokenize, word_tokenize

logger = logging.getLogger('gensim.corpora.wikicorpus')

# ignore articles shorter than ARTICLE_MIN_WORDS characters (after full preprocessing)
ARTICLE_MIN_WORDS = 10

RE_P0 = re.compile('<!--.*?-->', re.DOTALL | re.UNICODE)  # comments
RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)  # footnotes
RE_P2 = re.compile("(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$", re.UNICODE)  # links to languages
RE_P3 = re.compile("{{([^}{]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P4 = re.compile("{{([^}]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P5 = re.compile('\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)  # remove URL, keep description
RE_P6 = re.compile("\[([^][]*)\|([^][]*)\]", re.DOTALL | re.UNICODE)  # simplify links, keep description
RE_P7 = re.compile('\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of images
RE_P8 = re.compile('\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of files
RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)  # outside links
RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)  # math content
RE_P11 = re.compile('<(.*?)>', re.DOTALL | re.UNICODE)  # all other tags
RE_P12 = re.compile('\n(({\|)|(\|-)|(\|}))(.*?)(?=\n)', re.UNICODE)  # table formatting
RE_P13 = re.compile('\n(\||\!)(.*?\|)*([^|]*?)', re.UNICODE)  # table cell formatting
RE_P14 = re.compile('\[\[Category:[^][]*\]\]', re.UNICODE)  # categories
# Remove File and Image template
RE_P15 = re.compile('\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)

# MediaWiki namespaces (https://www.mediawiki.org/wiki/Manual:Namespace) that
# ought to be ignored
IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template',
'MediaWiki', 'User', 'Help', 'Book', 'Draft',
'WikiProject', 'Special', 'Talk']

def filter_wiki(raw):
"""
Filter out wiki mark-up from `raw`, leaving only text. `raw` is either unicode
or utf-8 encoded string.
"""
# parsing of the wiki markup is not perfect, but sufficient for our purposes
# contributions to improving this code are welcome :)
text = utils.to_unicode(raw, 'utf8', errors='ignore')
text = utils.decode_htmlentities(text)  # '&nbsp;' --> '\xa0'
return remove_markup(text)

def remove_markup(text):
text = re.sub(RE_P2, "", text)  # remove the last list (=languages)
# the wiki markup is recursive (markup inside markup etc)
# instead of writing a recursive grammar, here we deal with that by removing
# markup in a loop, starting with inner-most expressions and working outwards,
# for as long as something changes.
text = remove_template(text)
text = remove_file(text)
iters = 0
while True:
old, iters = text, iters + 1
text = re.sub(RE_P0, "", text)  # remove comments
text = re.sub(RE_P1, '', text)  # remove footnotes
text = re.sub(RE_P9, "", text)  # remove outside links
text = re.sub(RE_P10, "", text)  # remove math content
text = re.sub(RE_P11, "", text)  # remove all remaining tags
text = re.sub(RE_P14, '', text)  # remove categories
text = re.sub(RE_P5, '\\3', text)  # remove urls, keep description
text = re.sub(RE_P6, '\\2', text)  # simplify links, keep description only
# remove table markup
text = text.replace('||', '\n|')  # each table cell on a separate line
text = re.sub(RE_P12, '\n', text)  # remove formatting lines
text = re.sub(RE_P13, '\n\\3', text)  # leave only cell content
# remove empty mark-up
text = text.replace('[]', '')
if old == text or iters > 2:  # stop if nothing changed between two iterations or after a fixed number of iterations
break

# the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists'
# TODO is this really desirable?
text = text.replace('[', '').replace(']', '')  # promote all remaining markup to plain text
return text

def remove_template(s):
"""Remove template wikimedia markup.

Return a copy of `s` with all the wikimedia markup template removed. See http://meta.wikimedia.org/wiki/Help:Template for wikimedia templates
details.

Note: Since template can be nested, it is difficult remove them using
regular expresssions.
"""

# Find the start and end position of each template by finding the opening
# '{{' and closing '}}'
n_open, n_close = 0, 0
starts, ends = [], []
in_template = False
prev_c = None
for i, c in enumerate(iter(s)):
if not in_template:
if c == '{' and c == prev_c:
starts.append(i - 1)
in_template = True
n_open = 1
if in_template:
if c == '{':
n_open += 1
elif c == '}':
n_close += 1
if n_open == n_close:
ends.append(i)
in_template = False
n_open, n_close = 0, 0
prev_c = c

# Remove all the templates
s = ''.join([s[end + 1:start] for start, end in
zip(starts + [None], [-1] + ends)])

return s

def remove_file(s):
"""Remove the 'File:' and 'Image:' markup, keeping the file caption.

Return a copy of `s` with all the 'File:' and 'Image:' markup replaced by
their corresponding captions. See http://www.mediawiki.org/wiki/Help:Images for the markup details.
"""
# The regex RE_P15 match a File: or Image: markup
for match in re.finditer(RE_P15, s):
m = match.group(0)
caption = m[:-2].split('|')[-1]
s = s.replace(m, caption, 1)
return s

def tokenize(content):  # 修改后
"""
Tokenize a piece of text from wikipedia. The input string `content` is assumed
to be mark-up free (see `filter_wiki()`).

Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer
that 15 characters (not bytes!).
"""
# TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
return [token.encode('utf8') for token in utils.tokenize(content, lower=False, errors='ignore')
if len(token) <= 15 and not token.startswith('_')]

def get_namespace(tag):
"""Returns the namespace of tag."""
m = re.match("^{(.*?)}", tag)
namespace = m.group(1) if m else ""
if not namespace.startswith("http://www.mediawiki.org/xml/export-"):
raise ValueError("%s not recognized as MediaWiki dump namespace"
% namespace)
return namespace
_get_namespace = get_namespace

def extract_pages(f, filter_namespaces=False):
"""
Extract pages from a MediaWiki database dump = open file-like object `f`.

Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.

"""
elems = (elem for _, elem in iterparse(f, events=("end",)))

# We can't rely on the namespace for database dumps, since it's changed
# it every time a small modification to the format is made. So, determine
# those from the first element we find, which will be part of the metadata,
# and construct element paths.
elem = next(elems)
namespace = get_namespace(elem.tag)
ns_mapping = {"ns": namespace}
page_tag = "{%(ns)s}page" % ns_mapping
text_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mapping
title_path = "./{%(ns)s}title" % ns_mapping
ns_path = "./{%(ns)s}ns" % ns_mapping
pageid_path = "./{%(ns)s}id" % ns_mapping

for elem in elems:
if elem.tag == page_tag:
title = elem.find(title_path).text
text = elem.find(text_path).text

if filter_namespaces:
ns = elem.find(ns_path).text
if ns not in filter_namespaces:
text = None

pageid = elem.find(pageid_path).text
yield title, text or "", pageid     # empty page will yield None

# Prune the element tree, as per
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # except that we don't need to prune backlinks from the parent
# because we don't use LXML.
# We do this only for <page>s, since we need to inspect the
# ./revision/text element. The pages comprise the bulk of the
# file, so in practice we prune away enough.
elem.clear()
_extract_pages = extract_pages  # for backward compatibility

def process_article(args):  # 修改后
"""
Parse a wikipedia article, returning its content as a list of tokens
(utf8-encoded strings).
"""
text, lemmatize, title, pageid = args
text = filter_wiki(text)
sentences = []
sentences_str = sent_tokenize(text)
for sentence_str in sentences_str:
sentences.append(tokenize(sentence_str))
return sentences, title, pageid

class WikiCorpus(TextCorpus):  # 修改后
"""
Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus.

The documents are extracted on-the-fly, so that the whole (massive) dump
can stay compressed on disk.

>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word

"""
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
"""
Initialize the corpus. Unless a dictionary is provided, this scans the
corpus once, to determine its vocabulary.

If `pattern` package is installed, use fancier shallow parsing to get
token lemmas. Otherwise, use simple regexp tokenization. You can override
this automatic logic by forcing the `lemmatize` parameter explicitly.

"""
self.fname = fname
self.filter_namespaces = filter_namespaces
self.metadata = False
if processes is None:
processes = max(1, multiprocessing.cpu_count() - 1)
self.processes = processes
self.lemmatize = lemmatize
if dictionary is None:
self.dictionary = Dictionary(self.get_texts())
else:
self.dictionary = dictionary

def get_texts(self):
"""
Iterate over the dump, returning text version of each article as a list
of tokens.

Only articles of sufficient length are returned (short articles & redirects
etc are ignored).

Note that this iterates over the **texts**; if you want vectors, just use
the standard corpus interface instead of this function::

>>> for vec in wiki_corpus:
>>>     print(vec)
"""
articles, articles_all = 0, 0
positions, positions_all = 0, 0
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
pool = multiprocessing.Pool(self.processes)
# process the corpus in smaller chunks of docs, because multiprocessing.Pool
# is dumb and would load the entire input into RAM at once...
for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
for sentences, title, pageid in pool.imap(process_article, group):  # chunksize=10):
articles_all += 1
positions_all += len(sentences)
# article redirects and short stubs are pruned here
if any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
continue
for sentence in sentences:
if len(sentence) < ARTICLE_MIN_WORDS:
continue
articles += 1
positions += len(sentence)
yield sentence
pool.terminate()

logger.info(
"finished iterating over Wikipedia corpus of %i documents with %i positions"
" (total %i articles, %i positions before pruning articles shorter than %i words)",
articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
self.length = articles  # cache corpus length
# endclass WikiCorpus

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 维基百科词向量 word2vec gensim 深度学习

相关文章推荐

新的分享

章节导航