利用Gensim在英文Wikipedia训练词向量
2017-01-18 14:01
218 查看
最近在SemEval 2010 Task 8上做关系分类的实验,主要是实现了一下这篇论文的模型:
所以这里对源代码做了一些改动,gensim的WikiCorpus模块在文件如下文件中:
我们将wikicorpus.py文件拷贝出来,主要是修改
上述工作完成之后,执行
Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions
These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations
Anarchism holds the state to be undesirable unnecessary and harmful
While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system
Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy
Many types and traditions of anarchism exist not all of which are mutually exclusive
A neural network framework for relation extraction: Learning entity semantic and relation pattern,奈何比原文中的性能差了两个点,好忧桑,想着是不是词向量的原因。这篇文章中用的是英文Wikipedia语料训练的词向量,没办法,动手试一下吧。。
一、数据准备
首先下载最新的英文wiki数据wiki dumps,我的下载时间是2017-01-17,12.7G的xml格式压缩包。1.1 转为纯文本
之前看过一个实验是使用process_wiki.py+Gensim处理的,但是gensim默认是没有断句的,使用上述方法处理完之后,你会发现一句话变得特别地长,这可能会影响词向量的质量。所以这里对源代码做了一些改动,gensim的WikiCorpus模块在文件如下文件中:
python3.5/dist-packages/gensim/corpora/wikicorpus.py
我们将wikicorpus.py文件拷贝出来,主要是修改
process_article函数和
WikiCorpus类,这里用到了nltk的
sent_tokenize模块,并且将
ARTICLE_MIN_WORDS的值改成了10(这里实际上判断的是句子的最短长度),修改后的两个文件process_wiki.py和wikicorpus.py分别见附件1和附件2。
上述工作完成之后,执行
python3 process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text,最后会得到一个约13G的纯文本文件wiki.en.text,每个句子占一行(ps: 在24核的机器上跑了大概100分钟),从下图可以看出比分句之前的结果好看了很多,但是标点符号都不见了,不知道怎么保留标点符号,希望知道的小伙伴提出来。
Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions
These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations
Anarchism holds the state to be undesirable unnecessary and harmful
While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system
Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy
Many types and traditions of anarchism exist not all of which are mutually exclusive
二、训练词向量
将wiki语料处理成纯文本之后就可以开始训练词向量了,利用Gensim的Word2Vec模块,用的是下面这个脚本train_word2vec_model.py,执行python3 train_word2vec_model.py wiki.en.text model.bin,训练时间较长,写这篇博客时才完成2.05%。
# -*- coding: utf-8 -*- import logging import os.path import sys import multiprocessing from gensim.models.word2vec import LineSentence from time import time from gensim.models import Word2Vec if __name__ == '__main__': t0 = time() program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print(globals()['__doc__'] % locals()) sys.exit(1) inp, outp = sys.argv[1:3] # corpus path and path to save model model = Word2Vec(sg=0, sentences=LineSentence(inp), size=300, window=5, min_count=5, workers=16, iter=35) # trim unneeded model memory = use(much) less RAM #model.init_sims(replace=True) model.save_word2vec_format(outp, binary=True) print('done in %ds!' % (time()-t0))
三、将词向量保存成pkl格式
txt格式的词向量加载起来那叫一个慢,Gensim生成的bin文件虽然快了很多,但是还是要利用Gensim模块来加载,很不方便。我在python中使用词向量的时候,一般是将词向量存在一个字典里,其中键是词,值是词对应的词向量(numpy类型的数组)。如果将字典直接以二进制的形式存储在文件中,那加载会方便很多,速度也会快不少。这里用到是pickle模块,程序如下:# -*- encoding:utf-8 -*- import pickle import numpy as np from time import time from gensim.models import Word2Vec if __name__ == '__main__': t0 = time() word_wieghts = {} model = Word2Vec.load_word2vec_format('model.bin', binary=True) for word in model.vocab: word_weights[word] = model[word] with open('model.pkl','wb') as file: pickle.dump(file, word_weights) print('Done in %ds!' % (time()-t0))
四、总结
这里主要改进了gensim模块,在处理wiki文本的时候利用nltk工具将文章分割成句子。附件1:process_wiki.py
# -*- encoding:utf-8 -*- import logging import os.path import sys from wikicorpus import WikiCorpus # 注意这里 # add by ljx def decode_text(text): words = [] for w in text: words.append(w.decode('utf-8')) return words if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print(globals()['__doc__'] % locals()) sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(decode_text(text)) + "\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " sentences") output.close() logger.info("Finished Saved " + str(i) + " sentences")
附件2:wikicorpus.py
#!/usr/bin/env python # -*- coding: utf-8 -*- # # Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz> # Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com> # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html """ Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump. If you have the `pattern` package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer). The package is available at https://github.com/clips/pattern . See scripts/process_wiki.py for a canned (example) script based on this module. """ import bz2 import logging import re from xml.etree.cElementTree import iterparse # LXML isn't faster, so let's go with the built-in solution import multiprocessing from gensim import utils # cannot import whole gensim.corpora, because that imports wikicorpus... from gensim.corpora.dictionary import Dictionary from gensim.corpora.textcorpus import TextCorpus from nltk.tokenize import sent_tokenize, word_tokenize logger = logging.getLogger('gensim.corpora.wikicorpus') # ignore articles shorter than ARTICLE_MIN_WORDS characters (after full preprocessing) ARTICLE_MIN_WORDS = 10 RE_P0 = re.compile('<!--.*?-->', re.DOTALL | re.UNICODE) # comments RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE) # footnotes RE_P2 = re.compile("(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$", re.UNICODE) # links to languages RE_P3 = re.compile("{{([^}{]*)}}", re.DOTALL | re.UNICODE) # template RE_P4 = re.compile("{{([^}]*)}}", re.DOTALL | re.UNICODE) # template RE_P5 = re.compile('\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE) # remove URL, keep description RE_P6 = re.compile("\[([^][]*)\|([^][]*)\]", re.DOTALL | re.UNICODE) # simplify links, keep description RE_P7 = re.compile('\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE) # keep description of images RE_P8 = re.compile('\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE) # keep description of files RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE) # outside links RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE) # math content RE_P11 = re.compile('<(.*?)>', re.DOTALL | re.UNICODE) # all other tags RE_P12 = re.compile('\n(({\|)|(\|-)|(\|}))(.*?)(?=\n)', re.UNICODE) # table formatting RE_P13 = re.compile('\n(\||\!)(.*?\|)*([^|]*?)', re.UNICODE) # table cell formatting RE_P14 = re.compile('\[\[Category:[^][]*\]\]', re.UNICODE) # categories # Remove File and Image template RE_P15 = re.compile('\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE) # MediaWiki namespaces (https://www.mediawiki.org/wiki/Manual:Namespace) that # ought to be ignored IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk'] def filter_wiki(raw): """ Filter out wiki mark-up from `raw`, leaving only text. `raw` is either unicode or utf-8 encoded string. """ # parsing of the wiki markup is not perfect, but sufficient for our purposes # contributions to improving this code are welcome :) text = utils.to_unicode(raw, 'utf8', errors='ignore') text = utils.decode_htmlentities(text) # ' ' --> '\xa0' return remove_markup(text) def remove_markup(text): text = re.sub(RE_P2, "", text) # remove the last list (=languages) # the wiki markup is recursive (markup inside markup etc) # instead of writing a recursive grammar, here we deal with that by removing # markup in a loop, starting with inner-most expressions and working outwards, # for as long as something changes. text = remove_template(text) text = remove_file(text) iters = 0 while True: old, iters = text, iters + 1 text = re.sub(RE_P0, "", text) # remove comments text = re.sub(RE_P1, '', text) # remove footnotes text = re.sub(RE_P9, "", text) # remove outside links text = re.sub(RE_P10, "", text) # remove math content text = re.sub(RE_P11, "", text) # remove all remaining tags text = re.sub(RE_P14, '', text) # remove categories text = re.sub(RE_P5, '\\3', text) # remove urls, keep description text = re.sub(RE_P6, '\\2', text) # simplify links, keep description only # remove table markup text = text.replace('||', '\n|') # each table cell on a separate line text = re.sub(RE_P12, '\n', text) # remove formatting lines text = re.sub(RE_P13, '\n\\3', text) # leave only cell content # remove empty mark-up text = text.replace('[]', '') if old == text or iters > 2: # stop if nothing changed between two iterations or after a fixed number of iterations break # the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists' # TODO is this really desirable? text = text.replace('[', '').replace(']', '') # promote all remaining markup to plain text return text def remove_template(s): """Remove template wikimedia markup. Return a copy of `s` with all the wikimedia markup template removed. See http://meta.wikimedia.org/wiki/Help:Template for wikimedia templates details. Note: Since template can be nested, it is difficult remove them using regular expresssions. """ # Find the start and end position of each template by finding the opening # '{{' and closing '}}' n_open, n_close = 0, 0 starts, ends = [], [] in_template = False prev_c = None for i, c in enumerate(iter(s)): if not in_template: if c == '{' and c == prev_c: starts.append(i - 1) in_template = True n_open = 1 if in_template: if c == '{': n_open += 1 elif c == '}': n_close += 1 if n_open == n_close: ends.append(i) in_template = False n_open, n_close = 0, 0 prev_c = c # Remove all the templates s = ''.join([s[end + 1:start] for start, end in zip(starts + [None], [-1] + ends)]) return s def remove_file(s): """Remove the 'File:' and 'Image:' markup, keeping the file caption. Return a copy of `s` with all the 'File:' and 'Image:' markup replaced by their corresponding captions. See http://www.mediawiki.org/wiki/Help:Images for the markup details. """ # The regex RE_P15 match a File: or Image: markup for match in re.finditer(RE_P15, s): m = match.group(0) caption = m[:-2].split('|')[-1] s = s.replace(m, caption, 1) return s def tokenize(content): # 修改后 """ Tokenize a piece of text from wikipedia. The input string `content` is assumed to be mark-up free (see `filter_wiki()`). Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer that 15 characters (not bytes!). """ # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.) return [token.encode('utf8') for token in utils.tokenize(content, lower=False, errors='ignore') if len(token) <= 15 and not token.startswith('_')] def get_namespace(tag): """Returns the namespace of tag.""" m = re.match("^{(.*?)}", tag) namespace = m.group(1) if m else "" if not namespace.startswith("http://www.mediawiki.org/xml/export-"): raise ValueError("%s not recognized as MediaWiki dump namespace" % namespace) return namespace _get_namespace = get_namespace def extract_pages(f, filter_namespaces=False): """ Extract pages from a MediaWiki database dump = open file-like object `f`. Return an iterable over (str, str, str) which generates (title, content, pageid) triplets. """ elems = (elem for _, elem in iterparse(f, events=("end",))) # We can't rely on the namespace for database dumps, since it's changed # it every time a small modification to the format is made. So, determine # those from the first element we find, which will be part of the metadata, # and construct element paths. elem = next(elems) namespace = get_namespace(elem.tag) ns_mapping = {"ns": namespace} page_tag = "{%(ns)s}page" % ns_mapping text_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mapping title_path = "./{%(ns)s}title" % ns_mapping ns_path = "./{%(ns)s}ns" % ns_mapping pageid_path = "./{%(ns)s}id" % ns_mapping for elem in elems: if elem.tag == page_tag: title = elem.find(title_path).text text = elem.find(text_path).text if filter_namespaces: ns = elem.find(ns_path).text if ns not in filter_namespaces: text = None pageid = elem.find(pageid_path).text yield title, text or "", pageid # empty page will yield None # Prune the element tree, as per # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # except that we don't need to prune backlinks from the parent # because we don't use LXML. # We do this only for <page>s, since we need to inspect the # ./revision/text element. The pages comprise the bulk of the # file, so in practice we prune away enough. elem.clear() _extract_pages = extract_pages # for backward compatibility def process_article(args): # 修改后 """ Parse a wikipedia article, returning its content as a list of tokens (utf8-encoded strings). """ text, lemmatize, title, pageid = args text = filter_wiki(text) sentences = [] sentences_str = sent_tokenize(text) for sentence_str in sentences_str: sentences.append(tokenize(sentence_str)) return sentences, title, pageid class WikiCorpus(TextCorpus): # 修改后 """ Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus. The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk. >>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h >>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word """ def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)): """ Initialize the corpus. Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary. If `pattern` package is installed, use fancier shallow parsing to get token lemmas. Otherwise, use simple regexp tokenization. You can override this automatic logic by forcing the `lemmatize` parameter explicitly. """ self.fname = fname self.filter_namespaces = filter_namespaces self.metadata = False if processes is None: processes = max(1, multiprocessing.cpu_count() - 1) self.processes = processes self.lemmatize = lemmatize if dictionary is None: self.dictionary = Dictionary(self.get_texts()) else: self.dictionary = dictionary def get_texts(self): """ Iterate over the dump, returning text version of each article as a list of tokens. Only articles of sufficient length are returned (short articles & redirects etc are ignored). Note that this iterates over the **texts**; if you want vectors, just use the standard corpus interface instead of this function:: >>> for vec in wiki_corpus: >>> print(vec) """ articles, articles_all = 0, 0 positions, positions_all = 0, 0 texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces)) pool = multiprocessing.Pool(self.processes) # process the corpus in smaller chunks of docs, because multiprocessing.Pool # is dumb and would load the entire input into RAM at once... for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1): for sentences, title, pageid in pool.imap(process_article, group): # chunksize=10): articles_all += 1 positions_all += len(sentences) # article redirects and short stubs are pruned here if any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES): continue for sentence in sentences: if len(sentence) < ARTICLE_MIN_WORDS: continue articles += 1 positions += len(sentence) yield sentence pool.terminate() logger.info( "finished iterating over Wikipedia corpus of %i documents with %i positions" " (total %i articles, %i positions before pruning articles shorter than %i words)", articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS) self.length = articles # cache corpus length # endclass WikiCorpus
相关文章推荐
- 利用Gensim训练关于英文维基百科的Word2Vec模型(Training Word2Vec Model on English Wikipedia by Gensim)
- 利用 word2vec 训练的字向量进行中文分词
- 【python gensim使用】word2vec词向量处理英文语料
- windows环境下使用wiki中文百科及gensim工具库训练词向量
- 利用 word2vec 训练的字向量进行中文分词
- doc2vec 利用gensim 生成文档向量
- 利用 word2vec 训练的字向量进行中文分词
- 利用 word2vec 训练的字向量进行中文分词
- 利用 word2vec 训练的字向量进行中文分词
- gensim训练词向量word2vec
- word2vec词向量训练及gensim的使用
- 基于gensim进行句向量的训练
- 利用 word2vec 训练的字向量进行中文分词
- 基于Gensim的维基百科语料库中文词向量训练
- 文本深度表示模型Word2Vec 简介 Word2vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效工具, 其利用深度学习的思想,可以通过训练,把对文本内容的处理简
- 基于python的gensim word2vec训练词向量
- 利用 word2vec 训练的字向量进行中文分词
- python下进行lda主题挖掘(二)——利用gensim训练LDA模型
- 一个基于特征向量的近似网页去重算法——term用SVM人工提取训练,基于term的特征向量,倒排索引查询相似文档,同时利用cos计算相似度
- windows以及linux下安装gensim笔记以及用wiki(维基百科数据)训练中文词向量