转载︱案例 基于贪心算法的特征选择
2016-10-07 11:13
471 查看
本文转载于R语言中文社区,详情链接
相关帖子转载︱案例 基于贪心算法的特征选择
用GA算法设计22个地点之间最短旅程-R语言实现
————————————————————————————————————————————————————————
greedy Algorithm Feature Selection
贪心算法(又称贪婪算法)是指,在对问题求解时,总是做出在当前看来是最好的选择。也就是说,不从整体最优上加以考虑, 它所做出的是在某种意义上的局部最优解。贪心算法不是对所有问题都能得到整体最优解,关键是贪心策略的选择,选择的贪心 策略必须具备无后效性,即某个状态以前的过程不会影响以后的状态,只与当前状态有关。算法设计:
初始化问题的目标值
while(实现优化目标的约束条件){
利用筛选策略,求出解空间的一个可行解
}
将所有可行解组合成目标解空间。
options(warn = -1) require(magrittr) require(dplyr) require(glmnet) # Greedy Algorithm GreedyAlgorithm = function(dataSet) { # 基于逻辑回归,以AUC作为评价指标,采用贪心算法进行特征筛选 # # Args: # dataSet: A dataframe that contains a feature "label" # # Returns: # A vector of selected features features = data.frame(name = colnames(dataSet)) %>% dplyr::filter(name != "label") # select all features of the dataSet except "label" features = as.vector(features$name) featureSelect = c("label") # init the feature vector to be selected scoreBefore = data.frame() # init the storage whice stores the (feature,aucScore) tuple from the end of each iteration while((nrow(scoreBefore)<2||scoreBefore[length(scoreBefore),2]> scoreBefore[length(scoreBefore) - 1,2])&&nrow(scoreBefore)<length(features)){ score = data.frame() for(feature in features){ if(length(intersect(feature,featureSelect)) == 0){ trainData = dataSet[,append(featureSelect,feature)] model = glm(label~.,family = "binomial",data = trainData,epsilon = 1e-10) prediction = predict(model,trainData) aucValue = auc(trainData$label,prediction) score = rbind(score,data.frame(feature = feature,aucValue = aucValue)) } } featureSelect = unique(append(featureSelect,as.character(score[which.max(score$aucValue),1]))) scoreBefore = rbind(scoreBefore,score[which.max(score$aucValue),]) } featureSelect = head(featureSelect,length(featureSelect)-1) # delete the last feature that can't fit the iteration condition return(featureSelect[-1]) # reture the selected features except "label" }
KS值表征了模型将正例和负例区分开来的能力。值越大,模型的预测准确性越好。通常情况下,KS>0.3即可认为模型有比较好的预测准确性。
KS值计算方法:
将所有样本根据预测得分从低到高排序均分成N组,分别计算这N组的实际好样本数、坏样本数、累积好样本数、累积坏样本数、累积好样本数占比、 累积坏样本数占比,差值。其中,实际好坏样本数分别为该组内的好坏样本数,累积好坏样本数为该组累积的好坏样本数,累积好坏样本数占比为 累积好坏样本数占总好坏样本数的比值,差值为累积坏样本数占比减去累计好样本数占比。KS指标为差值绝对值的最大值。
# ksValue KsValue = function(prediction,n){ # Compute the ks value of a model # # Args: # prediction: A vector that the prediction of a model # n: The group number # # Returns: # A vector that the difference value between the rate cumulative bad sample and the rate of cumulative good sample dataResult = sort(prediction,decreasing = T) a = c() b = c() c = c() a[1] = 0 b[1] = 0 c[1] = 0 if(length(dataResult)%%n==0){ cut = length(dataResult)/n for (i in 2:(n+1)) { a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))]) b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i] } }else{ cut = round(length(dataResult)/n) for (i in 2:n) { a[i] = sum(dataResult[(cut*(i-2)+1):(cut*(i-1))]) b[i] = length(dataResult[(cut*(i-2)+1):(cut*(i-1))])-a[i] } a[n+1] = sum(dataResult[(cut*(n-2)+1):(cut*(n-1))]) b[n+1] = length(dataResult[(cut*(n-2)+1):(cut*(n-1))])-a[n+1] } c = abs(cumsum(a)/sum(a)-cumsum(b)/sum(b)) return(c) }
require(caret) require(pROC) data = read.csv("/data/workspace/Rworkspace/data_test.csv",encoding = "UTF-8") data %<>% mutate(label = ifelse(target>30,1,0)) data = data[,-1] data = data.frame(apply(data, 2, function(x) ifelse(is.na(x),median(x,na.rm = T),x))) # 剔除近似常量的变量 # feature1 = nearZeroVar(data) # data = data[,-feature1] # 剔除相关度过高的自变量 # dataCor = cor(data) # highcor = findCorrelation(dataCor,0.8) # data = data[,-highcor] # 利用贪心算法进行特征选择 # feature = GreedyAlgorithm(dataSet = data) load("/data/workspace/Rworkspace/featureSelect.RData") # 数据量较大,生成html过程中该算法比较耗时,所以直接加载测试时已经选取的特征 set.seed(521) ind = base::sample(2,nrow(data),replace=T,prob=c(0.7,0.3)) trainData = data[ind==1,] testData = data[ind==2,] model = cv.glmnet(as.matrix(trainData[,feature]),trainData[,"label"], family = "binomial",type.measure = "auc",alpha = 0, lambda.min.ratio = 0.0001) prediction = predict(model,as.matrix(testData[,feature]),s="lambda.min",type="response") # compute ksValue ksValue = KsValue(prediction,10) par(mfrow = c(2,1)) plot(density(ksValue),type = 'l',main = "ksValue Plot",xlab = "cutPoint",ylab = "density_ks") ks_value = max(ksValue) text(.2,1.0,paste("ksValue = ",ks_value)) roc(testData$label, as.vector(prediction), auc = T,plot = T,print.auc=T)
## ## Call: ## roc.default(response = testData$label, predictor = as.vector(prediction), auc = T, plot = T, print.auc = T) ## ## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1). ## Area under the curve: 0.7385
par(mfrow=c(1,1))
相关文章推荐
- 转载︱案例 基于贪心算法的特征选择
- 转载︱案例 基于贪心算法的特征选择
- [转载]基于.net开发平台项目案例集锦
- (转载)基于形状特征向量的图像检索
- 推荐系统那点事 —— 基于Spark MLlib的特征选择
- 【Python学习系列十九】基于scikit-learn库进行特征选择
- (转载)基于.net开发平台项目案例集锦
- Python决策树之基于信息增益的特征选择示例
- 挑子学习笔记:特征选择——基于假设检验的Filter方法
- 1 - 基于卡方检验的特征选择
- 基于局部学习的特征选择:Local-Learning-Based Feature Selecti
- 特征选择算法之开方检验(转载)
- 特征选择常用算法综述(转载)
- 特征选择[转载,翻译]
- spark mllib 基于卡方校验的特征选择
- [转载]基于效率考虑,对Windows多线程同步机制的选择,分析与实测
- 特征选择算法之卡方检验(转载)
- 基于统计的命名实体识别特征选择
- 基于正则化的特征选择
- 基于L2,1范数的特征选择方法