您的位置:首页 > 其它

使用spark的TF-IDF算法计算单词的重要性

2016-10-18 14:16 453 查看


使用spark的TF-IDF算法计算单词的重要性

本文简单学习一下spark的TF-IDF算法的使用
要计算每个单词的重要性,首先需要将单词分割,然后转换成数值型特征

In [1]:

from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = sqlContext.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(5, False)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors
featurizedData.select('rawFeatures', 'label').show(5, False)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("features", "label").show(5, False)


+-----+-----------------------------------+------------------------------------------+
|label|sentence                           |words                                     |
+-----+-----------------------------------+------------------------------------------+
|0    |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|0    |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1    |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+

+-----------------------------------------+-----+
|rawFeatures                              |label|
+-----------------------------------------+-----+
|(20,[5,6,9],[2.0,1.0,2.0])               |0    |
|(20,[3,5,12,14,18],[2.0,2.0,1.0,1.0,1.0])|0    |
|(20,[5,12,14,18],[1.0,2.0,1.0,1.0])      |1    |
+-----------------------------------------+-----+

+--------------------------------------------------------------------------------------------------------+-----+
|features                                                                                                |label|
+--------------------------------------------------------------------------------------------------------+-----+
|(20,[5,6,9],[0.0,0.6931471805599453,1.3862943611198906])                                                |0    |
|(20,[3,5,12,14,18],[1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085])|0    |
|(20,[5,12,14,18],[0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085])                      |1    |
+--------------------------------------------------------------------------------------------------------+-----+
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: