您的位置:首页 > 其它

Mahout 文本分类过程

2014-05-13 13:52 197 查看

以下是官网提供的基于CBayes算法的文本分类过程:


End to end commands to build a CBayes model for 20 Newsgroups:

The 20
newsgroup example script issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:

Be sure that MAHOUT_HOME/bin
and HADOOP_HOME/bin are in your $PATH

Create a working directory for the dataset and all input/output.

$ export WORK_DIR=/tmp/mahout-work-${USER}
$ mkdir -p ${WORK_DIR}


Download and extract the 20news-bydate.tar.gz from the 20newsgroups
dataset to the working directory.

$ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz
$ mkdir -p ${WORK_DIR}/20news-bydate
$ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
$ mkdir ${WORK_DIR}/20news-all
$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all


If you're running on a hadoop cluster

$ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all


<
4000
/li>

Convert the full 20newsgroups dataset into a < Text, Text > sequence file.

$ mahout seqdirectory
-i ${WORK_DIR}/20news-all
-o ${WORK_DIR}/20news-seq -ow


Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.

$ mahout seq2sparse
-i ${WORK_DIR}/20news-seq
-o ${WORK_DIR}/20news-vectors
-lnorm
-nv
-wt tfidf


If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the Creating
vectors from text for a list of all se2sparse options.

Split the preprocessed dataset into training and testing sets.

$ mahout split
-i ${WORK_DIR}/20news-vectors/tfidf-vectors
--trainingOutput ${WORK_DIR}/20news-train-vectors
--testOutput ${WORK_DIR}/20news-test-vectors
--randomSelectionPct 40
--overwrite --sequenceFiles -xm sequential


Train the classifier.

$ mahout trainnb
-i ${WORK_DIR}/20news-train-vectors -el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow
-c


Test the classifier.

$ mahhout testnb
-i ${WORK_DIR}/20news-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex
-ow
-o ${WORK_DIR}/20news-testing
-c
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  Mahout