mahout自带的例子 -------------------------分类
2013-11-14 11:43
183 查看
介绍
这二十个新闻组数据集合是收集大约20,000新闻组文档,均匀的分布在20个不同的集合。这20个新闻组集合采集最近流行的数据集合到文本程序中作为实验,根据机器学习技术。例如文本分类,文本聚集。我们将使用Mahout的Bayes Classifier创造一个模型,它将一个新文档分类到这20个新闻组集合
首要条件
。Mahout已经下载
。Maven是可用的
。已经设置了如下环境变量:
HADOOP_HOME hadoop的安装路径
MAHOUT_HOME mahout的安装路径
安装Mahout
如果已经下载了distribution的Mahout,使用unzip/untar解压,进入解压后的目录,
1.进入trunk目录,编译并且创建hadoop job
Java代码
mvn install
基于Mahout 0.2+:
1.创建目录并且下载20newsgroup的数据
Java代码
$ mkdir $MAHOUT_HOME/examples/bin/work/
$ cd $MAHOUT_HOME/examples/bin/work/
2.下载20news-bydate.tar.gz从20newsgroups dataset
3.解压
Java代码
tar zxf 20news-bydate.tar.gz
4.生成input的数据
Java代码
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p examples/bin/work/20news-bydate/20news-bydate-train \
-o examples/bin/work/20news-bydate/bayes-train-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8
5.生成test的数据
Java代码
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p examples/bin/work/20news-bydate/20news-bydate-test \
-o examples/bin/work/20news-bydate/bayes-test-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8
在hadoop集群上运行20newsgroups例子
设置hadoop集群
1.编辑hadoop-site.xml,添加本地设置Hadoop quickstart
Java代码
emacs $HADOOP_HOME/conf/hadoop-site.xml
2.格式HDFS
Java代码
$ $HADOOP_HOME/bin/hadoop namenode -format
3.启动hadoop
Java代码
$ $HADOOP_HOME/bin/start-all.sh
4.上传文件到HDFS
Java代码
$ $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/bin/work/20news-bydate/bayes-train-input 20news-input
Train贝叶斯分类基于Tri-grams
下面将在hadoop运行4个map reduce工作,为了Train这个分类器并且将运行一段时间如果在只有一个节点的机器上
Java代码
$> $MAHOUT_HOME/bin/mahout trainclassifier \
-i 20news-input/bayes-train-input \
-o newsmodel \
-type bayes \
-ng 3 \
-source hdfs
你可以监控这个job的状态,从Job Tracker的机器上打开浏览器访问:http://localhost:50030/jobtracker.jsp
在input目录运行Test分类器
Java代码
$> $MAHOUT_HOME/bin/mahout testclassifier \
-m newsmodel \
-d 20news-input \
-type bayes \
-ng 3 \
-source hdfs \
-method mapreduce
输出的结果:
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism96.9962453066333775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics99.28057553956835966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc96.95431472081218955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware99.59266802443992978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware99.47970863683663956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x99.59183673469387976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale98.45679012345678957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos99.4949494949495985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles100.0994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball99.89939637826961993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey99.89989989989989998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt99.39455095862765985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics98.98063200815494971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med99.79797979797979988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space99.3920972644377981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian99.49849548645938992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns99.45054945054945905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast98.82978723404256929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc89.93548387096774697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc61.78343949044586388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 18369 97.5621%
Incorrectly Classified Instances : 459 2.4379%
Total Classified Instances : 18828
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
994 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 994 a = rec.motorcycles
0 976 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 1 | 980 b = comp.windows.x
7 0 929 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 | 940 c = talk.politics.mideast
0 0 0 905 0 0 1 0 0 0 0 0 0 0 0 0 3 0 1 0 | 910 d = talk.politics.guns
4 1 4 27 388 1 0 1 0 5 1 1 2 2 149 7 2 33 0 0 | 628 e = talk.religion.misc
3 0 0 0 0 985 0 1 0 0 0 0 0 1 0 0 0 0 0 0 | 990 f = rec.autos
0 0 0 0 0 0 993 1 0 0 0 0 0 0 0 0 0 0 0 0 | 994 g = rec.sport.baseball
0 0 0 0 0 0 1 998 0 0 0 0 0 0 0 0 0 0 0 0 | 999 h = rec.sport.hockey
0 0 0 0 0 0 0 0 956 0 2 0 0 0 0 0 0 0 2 1 | 961 i = comp.sys.mac.hardware
0 0 0 0 0 0 0 0 0 981 0 0 5 0 0 1 0 0 0 0 | 987 j = sci.space
0 0 0 0 0 0 0 0 0 0 978 0 1 0 0 0 0 0 2 1 | 982 k = comp.sys.ibm.pc.hardware
1 0 3 36 0 1 2 1 0 5 0 697 4 0 3 3 19 0 0 0 | 775 l = talk.politics.misc
0 2 0 0 0 0 0 0 0 0 2 0 966 0 0 0 0 0 2 1 | 973 m = comp.graphics
1 0 0 0 0 0 0 0 0 0 6 0 0 971 0 0 0 0 3 0 | 981 n = sci.electronics
1 0 0 0 0 0 0 0 1 0 0 0 0 0 992 1 0 1 0 1 | 997 o = soc.religion.christian
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 988 0 0 0 1 | 990 p = sci.med
0 0 0 2 0 0 0 0 0 0 0 0 2 1 0 0 985 0 1 0 | 991 q = sci.crypt
0 0 0 1 1 0 0 0 0 1 0 0 1 0 19 0 1 775 0 0 | 799 r = alt.atheism
1 0 0 0 0 3 1 2 0 0 3 0 0 5 0 0 0 0 957 0 | 972 s = misc.forsale
0 0 0 8 0 0 0 0 0 0 6 0 6 0 0 0 0 0 10 955 | 985 t = comp.os.ms-windows.misc
附加的Naive Bayes
Train一个CBayes分类器使用bi-grams
Java代码
$> $MAHOUT_HOME/bin/mahout trainclassifier \
-i 20news-input \
-o newsmodel \
-type cbayes \
-ng 2 \
-source hdfs
Test一个CBayes分类器使用bi-grams
Java代码
$> $MAHOUT_HOME/bin/mahout testclassifier \
-m newsmodel \
-d 20news-input \
-type cbayes \
-ng 2 \
-source hdfs \
-method mapreduce
这二十个新闻组数据集合是收集大约20,000新闻组文档,均匀的分布在20个不同的集合。这20个新闻组集合采集最近流行的数据集合到文本程序中作为实验,根据机器学习技术。例如文本分类,文本聚集。我们将使用Mahout的Bayes Classifier创造一个模型,它将一个新文档分类到这20个新闻组集合
首要条件
。Mahout已经下载
。Maven是可用的
。已经设置了如下环境变量:
HADOOP_HOME hadoop的安装路径
MAHOUT_HOME mahout的安装路径
安装Mahout
如果已经下载了distribution的Mahout,使用unzip/untar解压,进入解压后的目录,
1.进入trunk目录,编译并且创建hadoop job
Java代码
mvn install
基于Mahout 0.2+:
1.创建目录并且下载20newsgroup的数据
Java代码
$ mkdir $MAHOUT_HOME/examples/bin/work/
$ cd $MAHOUT_HOME/examples/bin/work/
2.下载20news-bydate.tar.gz从20newsgroups dataset
3.解压
Java代码
tar zxf 20news-bydate.tar.gz
4.生成input的数据
Java代码
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p examples/bin/work/20news-bydate/20news-bydate-train \
-o examples/bin/work/20news-bydate/bayes-train-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8
5.生成test的数据
Java代码
$> $MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p examples/bin/work/20news-bydate/20news-bydate-test \
-o examples/bin/work/20news-bydate/bayes-test-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8
在hadoop集群上运行20newsgroups例子
设置hadoop集群
1.编辑hadoop-site.xml,添加本地设置Hadoop quickstart
Java代码
emacs $HADOOP_HOME/conf/hadoop-site.xml
2.格式HDFS
Java代码
$ $HADOOP_HOME/bin/hadoop namenode -format
3.启动hadoop
Java代码
$ $HADOOP_HOME/bin/start-all.sh
4.上传文件到HDFS
Java代码
$ $HADOOP_HOME/bin/hadoop dfs -put $MAHOUT_HOME/examples/bin/work/20news-bydate/bayes-train-input 20news-input
Train贝叶斯分类基于Tri-grams
下面将在hadoop运行4个map reduce工作,为了Train这个分类器并且将运行一段时间如果在只有一个节点的机器上
Java代码
$> $MAHOUT_HOME/bin/mahout trainclassifier \
-i 20news-input/bayes-train-input \
-o newsmodel \
-type bayes \
-ng 3 \
-source hdfs
你可以监控这个job的状态,从Job Tracker的机器上打开浏览器访问:http://localhost:50030/jobtracker.jsp
在input目录运行Test分类器
Java代码
$> $MAHOUT_HOME/bin/mahout testclassifier \
-m newsmodel \
-d 20news-input \
-type bayes \
-ng 3 \
-source hdfs \
-method mapreduce
输出的结果:
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism96.9962453066333775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics99.28057553956835966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc96.95431472081218955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware99.59266802443992978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware99.47970863683663956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x99.59183673469387976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale98.45679012345678957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos99.4949494949495985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles100.0994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball99.89939637826961993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey99.89989989989989998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt99.39455095862765985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics98.98063200815494971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med99.79797979797979988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space99.3920972644377981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian99.49849548645938992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns99.45054945054945905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast98.82978723404256929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc89.93548387096774697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc61.78343949044586388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 18369 97.5621%
Incorrectly Classified Instances : 459 2.4379%
Total Classified Instances : 18828
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
994 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 994 a = rec.motorcycles
0 976 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 1 | 980 b = comp.windows.x
7 0 929 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 | 940 c = talk.politics.mideast
0 0 0 905 0 0 1 0 0 0 0 0 0 0 0 0 3 0 1 0 | 910 d = talk.politics.guns
4 1 4 27 388 1 0 1 0 5 1 1 2 2 149 7 2 33 0 0 | 628 e = talk.religion.misc
3 0 0 0 0 985 0 1 0 0 0 0 0 1 0 0 0 0 0 0 | 990 f = rec.autos
0 0 0 0 0 0 993 1 0 0 0 0 0 0 0 0 0 0 0 0 | 994 g = rec.sport.baseball
0 0 0 0 0 0 1 998 0 0 0 0 0 0 0 0 0 0 0 0 | 999 h = rec.sport.hockey
0 0 0 0 0 0 0 0 956 0 2 0 0 0 0 0 0 0 2 1 | 961 i = comp.sys.mac.hardware
0 0 0 0 0 0 0 0 0 981 0 0 5 0 0 1 0 0 0 0 | 987 j = sci.space
0 0 0 0 0 0 0 0 0 0 978 0 1 0 0 0 0 0 2 1 | 982 k = comp.sys.ibm.pc.hardware
1 0 3 36 0 1 2 1 0 5 0 697 4 0 3 3 19 0 0 0 | 775 l = talk.politics.misc
0 2 0 0 0 0 0 0 0 0 2 0 966 0 0 0 0 0 2 1 | 973 m = comp.graphics
1 0 0 0 0 0 0 0 0 0 6 0 0 971 0 0 0 0 3 0 | 981 n = sci.electronics
1 0 0 0 0 0 0 0 1 0 0 0 0 0 992 1 0 1 0 1 | 997 o = soc.religion.christian
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 988 0 0 0 1 | 990 p = sci.med
0 0 0 2 0 0 0 0 0 0 0 0 2 1 0 0 985 0 1 0 | 991 q = sci.crypt
0 0 0 1 1 0 0 0 0 1 0 0 1 0 19 0 1 775 0 0 | 799 r = alt.atheism
1 0 0 0 0 3 1 2 0 0 3 0 0 5 0 0 0 0 957 0 | 972 s = misc.forsale
0 0 0 8 0 0 0 0 0 0 6 0 6 0 0 0 0 0 10 955 | 985 t = comp.os.ms-windows.misc
附加的Naive Bayes
Train一个CBayes分类器使用bi-grams
Java代码
$> $MAHOUT_HOME/bin/mahout trainclassifier \
-i 20news-input \
-o newsmodel \
-type cbayes \
-ng 2 \
-source hdfs
Test一个CBayes分类器使用bi-grams
Java代码
$> $MAHOUT_HOME/bin/mahout testclassifier \
-m newsmodel \
-d 20news-input \
-type cbayes \
-ng 2 \
-source hdfs \
-method mapreduce
相关文章推荐
- Mahout实现的分类算法,两个例子,预测期望的目标变量
- Mahout实现的分类算法,两个例子,预测期望的目标变量
- hadoop mahout 运行自带的例子
- TensorFlow自带例子已经包含了android和ios下的摄像头图像分类示例Inception v1,这里补充一个Windows下的,使用AForge库(www.aforgenet.com)操作
- mahout自带例子的常用语法
- Tomcat 自带的例子分析(Jsp部分)(上)
- php smarty 二级分类代码和模版循环例子
- mahout贝叶斯分类结果解析
- 【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0
- ArcGIS.Server.9.2.DotNet自带例子分析(一、一)
- 图文讲解基于centos虚拟机的Hadoop集群安装,并且使用Mahout实现贝叶斯分类实例 (1)
- CodeTyphon32中自带例子源码:演示自定义错误提示
- caffe_实战之两个简单的例子(物体分类和人脸检测)
- CodeTyphon32中自带例子源码:所有1500个源码打包下载
- Hadoop自带例子学习
- hadoop反编译安装包自带的例子
- mahout分类学习和遇到的问题总结
- Mahout 文本分类过程
- 按条件分类汇总sql脚本例子
- hadoop学习-mahout-Bayes分类算法示例程序