您的位置：首页 > 其它

《机器学习实战》第四章4.6-4.7 示例1：垃圾邮件过滤示例2：从个人广告中获取区域倾向

2017-02-27 20:07 246 查看

机器学习实战》系列博客主要是实现并理解书中的代码，相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙，需结合书本。博主边查边学，水平有限，有问题的地方评论区请多指教。书中的代码和数据，网上有很多请自行下载。

4.6 垃圾邮件过滤

4.6.1 准备数据：切分文本

对于文本字符串，可以用string.split 切分

>>> mySent = 'This book is the best book on python or M.L. I have ever laid eyes upon'
>>> mySent.split()
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

标点符号也被当成词的一部分，可以使用正则表示式来切分，其中分隔符是除单词，数字外的任意字符串。

>>> import re
>>> regEX = re.compile('\\W*')
>>> listOfTokens = regEX.split(mySent)
>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

去空格（好像上面的已经把空格去了？？）

字符串变小写

>>> [tok for tok in listOfTokens if len(tok)>0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
>>> [tok.lower() for tok in listOfTokens if len(tok)>0<
ce4d
/span>]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

4.6.2 测试算法：使用朴素贝叶斯进行交叉验证

文件解析及完整的垃圾邮件测试函数

文件夹中有各有25个spam 和ham ，随机选择10个做测试集，其余是训练集。这种方法称为：留存交叉验证

随机选择会导致，输出结果有差别。可以重复试验取平均

def textParse(bigString): #输入一个大字符串并解析为字符串列表
import re
listOfTokens = re.split(r'\W*', bigString)
#函数去掉少于2个字符的字符串，并全部转为小写
return [tok.lower() for tok in listOfTokens if len(tok) > 2]

def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt' % i).read())
docList.append(wordList)  #添加成[[][][]]形式
fullText.extend(wordList) #添加成[]形式
classList.append(1)       #类标签
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList) #调用函数createVocabList生成词表
trainingSet = range(50); testSet=[]  #有50个训练样本
for i in range(10):                  #随机选10个做测试样本
randIndex = int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))#词向量
trainClasses.append(classList[docIndex])#对应的类标签
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))#训练生成3个概率
errorCount = 0
for docIndex in testSet:        #验证测试集
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) #词向量
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1 #分类错误加加
print "classification error",docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
#return vocabList,fullText

>>> bayes.spamTest()
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.1
>>> bayes.spamTest()
the error rate is:  0.0
>>> bayes.spamTest()
classification error ['experience', 'with', 'biggerpenis', 'today', 'grow', 'inches', 'more', 'the', 'safest', 'most', 'effective', 'methods', 'of_penisen1argement', 'save', 'your', 'time', 'and', 'money', 'bettererections', 'with', 'effective', 'ma1eenhancement', 'products', 'ma1eenhancement', 'supplement', 'trusted', 'millions', 'buy', 'today']
classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
the error rate is:  0.2

4.7 ：从个人广告中获取区域倾向

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航