您的位置：首页 > 编程语言 > Java开发

最新中文文本挖掘小例子及程序

2012-05-30 17:16 316 查看

http://bbs.pinggu.org/thread-853290-1-1.html

中文分词：

因为TM和openNLP对中文支持不好，所以这里的分词软件采用imdict-chinese-analyzer它是中科院张华平博士开发的一款基于HHMM的智能分词软件

分词效果：

zw <- c("如果你聽到某人說他使用某軟體，然后看看效果，有些美中不足，那就叫《星光灿烂》吧！thus do not have the texts already

           stored on a hard disk, and want to save the text documents to disk")

1、去停用词：

zwfc(zw,zj1)

[1] "聽某人說使用軟體看看效果美中不足星光灿烂 thu text alreadi store hard disk save text document

disk time: 0.109 s"

2、不去停用词：

zwfc(zw,zj1)

[1] "如果你聽到某人說他使用某軟體 , 然后看看效果 , 有些美中不足 , 那就叫 , 星光灿烂 , 吧

, thu do not have the text alreadi store on a hard disk , and want to save the text document to disk time: 0.0

s"

中文分词对人名地名分解的仍然不好，大多分解成单字。

下面是个简单例子：

一、安装TM和rJava包，并到SUN网站安装JAVA运行环境软件包。

二、将下面的压缩包解压到c盘根目录。

三、在R中运行软件。

结果：

共五个文件：

$FileList

[1] "c:/text/荷兰队长上演惊天远射.txt"

[2] "c:/text/技术化转型路上德国人受重创.txt"

[3] "c:/text/普约尔贡献头球绝杀.txt"

[4] "c:/text/四大天王沉沦各有难念的经.txt"

[5] "c:/text/再战德班德西命运迥异.txt"

-----------------------------------------

1、找出最少出现过5次的词条 ##

> findFreqTerms(dtm, 5)

[1] "乌拉圭" "西班牙"

--------------------------------------------

2、找出与"西班牙"相关度至少达0.8的词条 ###

> findAssocs(dtm, "西班牙", 0.8)

西班牙德意志

1.00   0.92

--------------------------------------------

去掉较少词频（40%以下）的词条后词条-文件矩阵

inspect(removeSparseTerms(dtm, 0.4))

A document-term matrix (5 documents, 5 terms)

Non-/sparse entries: 22/3

Sparsity           : 12%

Maximal term length: 5

Weighting          : term frequency (tf)

     Terms

Docs 0.0 time: 半决赛世界杯西班牙

    1   0     1      1      2      0

    2   1     1      1      1      5

    3   1     1      1      2      4

    4   1     1      0      3      1

    5   1     1      1      1      7

----------------------------------------

### 词典 ### 它通常用来表示文本挖掘有关词条

A document-term matrix (5 documents, 3 terms)

Non-/sparse entries: 13/2

Sparsity           : 13%

Maximal term length: 3

Weighting          : term frequency (tf)

     Terms

Docs 半决赛世界杯西班牙

    1      1      2      0

    2      1      1      5

    3      1      2      4

    4      0      3      1

    5      1      1      7

本文来自: 人大经济论坛 S-Plus&R专版版，详细出处参考： http://bbs.pinggu.org/forum.php?mod=viewthread&tid=853290&page=1

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： disk matrix c sun java

相关文章推荐

新的分享

章节导航