您的位置：首页 > 其它

word2vec的使用（未完成）

2017-09-14 20:15 337 查看

最近希望使用TextCNN来进行文本分类，还以为很简单。然后去github clone了两个项目。想着先用gensim.word2vec就可以获得很好的效果。结果发现下载下来的代码都是使用google word2vec。gensim.word2vec是基于google word2vec写的，但是gensim.word2vec的功能少一些。它没法统计word2vec的维度，它也无法返回词向量矩阵，更加返回每个词的index。所以最终需要使用google word2vec。

先提一下这两个工具的不同之处。由于这些不同之处导致了工作陷入特别的困难。

word2vec:提供了对词向量方便的操作。包括词向量的shape(词的个数和词向量的大小)，vectors(向量矩阵),vocab(所有的词)和vocab_hash(用字典存储，各个词的index)

gensim.word2vec:虽然没有这些操作，但是在实际训练词向量的过程中有更高的鲁棒性。使用word2vec训练词向量的过程中，只要文档中包含非UTF-8编码的字符，就会导致出错。而gensim.word2vec却没有这个问题。

word2vec和gensim.word2vec的github地址如下：

word2vec和gensim.word2vec

下载word2vec

>> pip install word2vec

word2vec中的函数

PS：训练中文词向量时使用

import sys
reload(sys)
# 使用默认编码，否则有可能出错
sys.setdefaultencoding('utf-8')

方法（再补充）

word2vec.phrase(‘input_file_path’, ‘output_file_path’, verbose=True)

4.

2、 gensim.word2vec中的函数

训练词向量

from gensim.models import word2vec

class IterSentence(object):
def __init__(self, file_path):
self.file_path = file_path
def __iter__(self):
for line in open(self.file_path, 'r'):
yield line.split()

# sentences 采用迭代的方式。（一句话分成列表格式的各个词）
sentences = IterSentence(file_path)
# size: 词向量特征的大小
# window： 词向量训练时参考的最大词长度。类似n-gram
# min_count: 允许最小词的个数。词频低于5的词，不计算词向量
# workers： 采用并行的个数
model = word2vec.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

保存和读取词向量模型

model.save(fname)
model = Word2Vec.load(fname)  # you can continue training with the loaded model!

查看某词的向量

>> model.wv['computer']  # numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

加载C语言版本的词向量

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format

word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

存储词向量格式

model.wv.save_word2vec_format(wordvec_file,
vocabulary_file,
binary=True)
# wordvec_file：词向量文件
# vocabulary_file:词计数文件

动态训练词向量

train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=None)

'''
Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). For Word2Vec, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progres-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided. (If the corpus is the same as was provided to build_vocab(), the count of examples in that corpus will be available in the model’s corpus_count property.)

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.
'''
update_weights()
'''
Copy all the existing weights, and reset the weights for the newly added vocabulary.
'''

3、 gensim.doc2vec中的函数

训练doc向量

from gensim.models import doc2vec

sentences = doc2vec.TaggedLineDocument(file_name)
model = doc2vec.Doc2Vec(sentences, min_count, size, window, workers)

保存和读取doc向量模型

model.save(model_name)
# model_name:是要保存的model的文件名

存储doc向量

with open(doc_name, 'w') as out:
out.write(str(self.__d_model.docvecs.count) + ' ' + str(self.__d_size) + '\n')
for i in range(0, self.__d_model.docvecs.count):
docvec = self.__d_model.docvecs[i]
docvec_list = docvec.tolist()
docvec_list.append('\n')
out.write(' '.join(str(v) for v in docvec_list))
#doc_name:存储doc向量的名称

动态训练词向量

train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=None)

update_weights()

查看某doc的向量(索引号)

>> model.docvecs[0]  # numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： nlp word2vec

相关文章推荐

新的分享

章节导航