《机器学习实战》第3章学习小结
2018-01-07 14:43
232 查看
第三章,决策树。因为初学机器学习和Python,想先过一遍书,所以并没有看得精细(比如如何用MatPlotlib画出树结构就暂时跳过了)。
如何构造树:首先寻找用来划分数据集的特征,书中通过求解信息增益的方法找到当前最好的特征,并用它划分数据集,创建分支,函数采用递归的思想。
1. 信息增益
3. 构造树
如何构造树:首先寻找用来划分数据集的特征,书中通过求解信息增益的方法找到当前最好的特征,并用它划分数据集,创建分支,函数采用递归的思想。
1. 信息增益
def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} #保存每类键值的个数 for featVec in dataSet: #the the number of unique elements and their occurance currentLabel = featVec[-1] #获得样本的键值 if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 #创建 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob,2) #log base 2 return shannonEnt2. 划分数据集
def splitDataSet(dataSet, axis, value): #划分数据集 retDataSet = [] for featVec in dataSet: if featVec[axis] == value: #指定特征的值与给定的value一致,保留其余特征值和键值,并添加到retDataSet reducedFeatVec = featVec[:axis] #chop out axis used for splitting reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet def chooseBestFeatureToSplit(dataSet): #选择最好的特征划分数据集 numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels baseEntropy = calcShannonEnt(dataSet) #计算初始的熵 bestInfoGain = 0.0; bestFeature = -1 for i in range(numFeatures): #iterate over all the features 遍历所有特征 featList = [example[i] for example in dataSet]#create a list of all the examples of this feature uniqueVals = set(featList) #get a set of unique values newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) #用第i个特征划分数组 prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) #计算熵 infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy if (infoGain > bestInfoGain): #compare this to the best gain so far bestInfoGain = infoGain #if better than current best, set to best bestFeature = i return bestFeature #returns an integer
3. 构造树
def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] #获得所有键值 if classList.count(classList[0]) == len(classList): return classList[0] #stop splitting when all of the classes are equal if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet,只剩键值 return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) #选择最好的特征并划分数据集 bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} #创建新的分支 del(labels[bestFeat]) #将当前最好特征从labels数组中剔除 featValues = [example[bestFeat] for example in dataSet] #当前最好特征在数据集中所有出现的情况 uniqueVals = set(featValues) #获得当前最好特征在数据集中出现的所有类别 for value in uniqueVals: subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) #递归 return myTree
相关文章推荐
- 《机器学习实战》1,2章学习小结
- 学习小结
- js学习小结(八)2014.5.6(DOM节点,DOM操作技术)
- qsort学习小结
- Git学习小结(第三次作业)
- gcc学习小结
- cakephp 学习小结 5
- 学习《windows核心编程》小结
- AC自动机学习小结
- 后续的ActiveMQ的学习小结
- JavaSE学习小结
- objective-c基础教程——学习小结
- 微波工程学习小结
- zend framework学习小结
- JS---学习小结
- 网络编程学习小结
- 《设计模式之禅》学习小结之观察者模式,门面模式和备忘录模式
- 学习ios MVC设计模式的小结
- [置顶] angularJS学习小结——service
- Git & GitHub学习小结