Mahout 文本分类过程
2014-05-13 13:52
197 查看
以下是官网提供的基于CBayes算法的文本分类过程:
End to end commands to build a CBayes model for 20 Newsgroups:
The 20newsgroup example script issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:
Be sure that MAHOUT_HOME/bin
and HADOOP_HOME/bin are in your $PATH
Create a working directory for the dataset and all input/output.
$ export WORK_DIR=/tmp/mahout-work-${USER} $ mkdir -p ${WORK_DIR}
Download and extract the 20news-bydate.tar.gz from the 20newsgroups
dataset to the working directory.
$ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz $ mkdir -p ${WORK_DIR}/20news-bydate $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd .. $ mkdir ${WORK_DIR}/20news-all $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
If you're running on a hadoop cluster
$ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
<
4000
/li>
Convert the full 20newsgroups dataset into a < Text, Text > sequence file.
$ mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq -ow
Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.
$ mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the Creating
vectors from text for a list of all se2sparse options.
Split the preprocessed dataset into training and testing sets.
$ mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
Train the classifier.
$ mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow -c
Test the classifier.
$ mahhout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing -c
相关文章推荐
- Mahout分类中在“文本编码器”中“body”是啥意思
- 基于朴素贝叶斯分类器的文本分类算法的实现过程分析
- scikit-learn:构建文本分类的“pipeline”简化分类过程、网格搜索调参
- 【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0
- 文本分类程序的实现过程(C++语言)——特征选择的预处理
- 基于朴素贝叶斯分类器的文本分类算法的实现过程分析
- 朴素贝叶斯文本分类过程
- 运行Mahout分类算法,分析20newsgroup的分类过程
- 重新实现关于Mikolov的集成文本分类实验(详细过程)-
- CNN和RNN在文本分类过程中的区别整理
- scikit-learn:0.4.构建文本分类的“pipeline”简化分类过程、网格搜索调参
- opnet仿真过程中SEED的概念问题 分类: opnet 2014-11-02 15:25 69人阅读 评论(0) 收藏
- K一最邻近算法在文本自动分类中的应用
- 算法分类,学习过程
- 【信息检索导论】第13章 文本分类及朴素贝叶斯方法
- 图文讲解基于centos虚拟机的Hadoop集群安装,并且使用Mahout实现贝叶斯分类实例 (1)
- 基于向量空间模型的文本自动分类系统的研究与实现
- iOS之UILabel------分类创建label,计算文本大小
- 基于朴素贝叶斯分类器的文本分类算法(上)
- 贝叶斯文本分类 java实现