您的位置：首页 > 其它

R中的数据抽样SMOTE （谢佳标老师讲课笔记）

2017-02-27 21:24 465 查看

在使用抽样之前，之前学的内容忘得差不多了。

所以在使用本次例子之前，对获取该数据作下了解。

hyper<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data",header=F)

names<-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.names",header=F,sep='\t')[[1]]

> names
 [1] hypothyroid, negative.     age:                       sex:                      
 [4] on_thyroxine:              query_on_thyroxine:        on_antithyroid_medication:
 [7] thyroid_surgery:           query_hypothyroid:         query_hyperthyroid:       
[10] pregnant:                  sick:                      tumor:                    
[13] lithium:                   goitre:                    TSH_measured:             
[16] TSH:                       T3_measured:               T3:                       
[19] TT4_measured:              TT4:                       T4U_measured:             
[22] T4U:                       FTI_measured:              FTI:                      
[25] TBG_measured:              TBG:

对列名进行清洗。因为有"."，还有以“：”结尾的。所以处理成均不含符号的。

name1<-gsub(".|:","",names)

把含有“.”或者“：”的都去掉，这样对吗？注意对. 进行处理时，要加上"[]"。

> gg<-gsub(":|[.]","",names)
> gg
[1] "hypothyroid, negative"     "age"                       "sex"
[4] "on_thyroxine"              "query_on_thyroxine"        "on_antithyroid_medication"
[7] "thyroid_surgery"           "query_hypothyroid"         "query_hyperthyroid"
[10] "pregnant"                  "sick"                      "tumor"
[13] "lithium"                   "goitre"                    "TSH_measured"
[16] "TSH"                       "T3_measured"               "T3"
[19] "TT4_measured"              "TT4"                       "T4U_measured"
[22] "T4U"                       "FTI_measured"              "FTI"
[25] "TBG_measured"              "TBG"

gsub(pattern,replacement,x) 在此扩展用法。

如：

> x <- "line 4322: He is now 25 years old, and weights 130lbs"
> y <- gsub("\\d+","---",x)
> y

[1] "line ---: He is now --- years old, and weights ---lbs"

把数字替换成“----”（更多替换内容，请参考http://www.endmemo.com/program/R/gsub.php）

> colnames(hyper)
[1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V1
4000
0" "V11" "V12" "V13" "V14" "V15" "V16"
[17] "V17" "V18" "V19" "V20" "V21" "V22" "V23" "V24" "V25" "V26"
> colnames(hyper)<-gg

colnames(hyper)[1]<-"target"

> table(hyper$target)

hypothyroid    negative
151        2983

> hyper$target<-ifelse(hyper$target=="negative",0,1)# 此步骤的变换很cool

> table(hyper$target)

0    1
2983  151

talbe 主要用于频数统计。频率呢？prop.table()

> prop.table(table(hyper$target))

0          1
0.95181876 0.04818124

将0和1转化成为因子型变量。

> str(hyper$target)
num [1:3134] 1 1 1 1 1 1 1 1 1 1 ...
> hyper$target<-as.factor(hyper$target)
> str(hyper$target)
Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

下面进入正题。

1.SMOTE函数

SMOTE(form, data, perc.over = 200, k = 5, perc.under = 200,

learner = NULL, ...)

form 通常为形式为 V1~. 其中v1 代表分类变量的名称

data 表示整个数据集的名称

perc.over= 200 表示抽样时对少数样本增加2倍。

perc.under=200 表示抽样时多数样本是当前增加的少数样本数量的2倍（总计4倍）。

k 可以忽略，不知道什么意思。

和前面的结合起来可知。

install.packages("DMwR")

> hyper_new<-SMOTE(target~.,hyper,per.over=200,perc.under=200)
> table(hyper_new$target)

0   1
604 453

help(SMOTE) 中的例子也是按照这个原理，可以试一下。这里主要也要进行因子类型的转换。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航