使用spark的TF-IDF算法计算单词的重要性
2016-10-18 14:16
453 查看
使用spark的TF-IDF算法计算单词的重要性
本文简单学习一下spark的TF-IDF算法的使用要计算每个单词的重要性,首先需要将单词分割,然后转换成数值型特征
In [1]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer sentenceData = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (0, "I wish Java could use case classes"), (1, "Logistic regression models are neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsData = tokenizer.transform(sentenceData) wordsData.show(5, False) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20) featurizedData = hashingTF.transform(wordsData) # alternatively, CountVectorizer can also be used to get term frequency vectors featurizedData.select('rawFeatures', 'label').show(5, False) idf = IDF(inputCol="rawFeatures", outputCol="features") idfModel = idf.fit(featurizedData) rescaledData = idfModel.transform(featurizedData) rescaledData.select("features", "label").show(5, False)
+-----+-----------------------------------+------------------------------------------+ |label|sentence |words | +-----+-----------------------------------+------------------------------------------+ |0 |Hi I heard about Spark |[hi, i, heard, about, spark] | |0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]| |1 |Logistic regression models are neat|[logistic, regression, models, are, neat] | +-----+-----------------------------------+------------------------------------------+ +-----------------------------------------+-----+ |rawFeatures |label| +-----------------------------------------+-----+ |(20,[5,6,9],[2.0,1.0,2.0]) |0 | |(20,[3,5,12,14,18],[2.0,2.0,1.0,1.0,1.0])|0 | |(20,[5,12,14,18],[1.0,2.0,1.0,1.0]) |1 | +-----------------------------------------+-----+ +--------------------------------------------------------------------------------------------------------+-----+ |features |label| +--------------------------------------------------------------------------------------------------------+-----+ |(20,[5,6,9],[0.0,0.6931471805599453,1.3862943611198906]) |0 | |(20,[3,5,12,14,18],[1.3862943611198906,0.0,0.28768207245178085,0.28768207245178085,0.28768207245178085])|0 | |(20,[5,12,14,18],[0.0,0.5753641449035617,0.28768207245178085,0.28768207245178085]) |1 | +--------------------------------------------------------------------------------------------------------+-----+
相关文章推荐
- 使用spark TF-IDF特征计算文章间相似度
- spark mllib 中的tf-idf算法计算文档相似度
- Spark MLlib TF-IDF算法原理及调用实例(Scala/Java/python)
- Spark ML Lib中的Tf-Idf生成的向量不能直接用于其他算法的问题
- 使用sci-kit learn计算TF-IDF
- 使用scikit-learn tfidf计算词语权重
- TF-IDF提取关键词并用余弦算法计算相似度
- 使用Gensim建立bow TFIDF LSI模型对文本相似度计算
- Spark Streaming实现实时WordCount,DStream的使用,updateStateByKey(func)实现累计计算单词出现频率
- java 实现 计算tfidf 使用ik分词
- 使用Spark完成基于TF-IDF特征的新闻热点聚类
- Spark MLlib java TF-IDF计算 (spark 1.5.2)
- 关键词权重计算算法 - TF-IDF
- 关于使用Filter减少Lucene tf idf打分计算的调研
- 关键词权重计算算法:TF-IDF
- 关键词权重计算算法 - TF-IDF
- 文档的词频-反向文档频率(TF-IDF)计算
- 运用hadoop计算TF-IDF续-支持中文读取-支持文件输出控制
- Java词频统计算法(使用单词树)
- 文件文档文档的词频-反向文档频率(TF-IDF)计算