您的位置:首页 > 其它

关联规则挖掘算法-FP-Growth

2017-01-13 09:46 337 查看

apriori算法有如下两种开销的影响:

它仍可能产生大量的候选集。例如,如果10的4次方个频繁1项集,则apriori算法需要产生多达10的7次方个候选2项集。

它可能需要重复地扫描数据库,通过模式匹配检查一个很大的候选集合。检查数据库中每个事务来确定候选集支持度的开销很大。

FP-Growth(频繁模式增长)

FP-Growth可以消除上述apriori算法的两中开销。

FP-Growth采用如下分治策略:首先,将代表频繁项集的数据库压缩到一棵频繁模式树(FP树),该树仍保留项集的关联信息。然后把这种压缩后的数据库划分成一组条件数据库,每个数据库关联一个频繁模式,并分别挖掘每个条件数据库。

FP-Growth算法

构建树的过程如下:

class treeNode:
""" FP-tree节点

"""
def __init__(self, nameValue, numOccur, parentNode):
self.name = nameValue
self.count = numOccur
self.nodeLink = None
self.parent = parentNode
self.children = {}

def inc(self, numOccur):
self.count += numOccur

def display(self, ind=1):
print('  '*ind, self.name, '  ', self.count)
for child in self.children.values():
child.display(ind+1)

def createInitSet(dataSet):
retDict = {}
for trans in dataSet:
retDict[frozenset(trans)] = 1
return retDict

def createTree(dataSet, minSuppport=1):
''' 构造fp-tree

:param dataSet:
:param minSuppport: 最小支持度--某个频繁项集在所有事务中出现的频数的最小值,这里是频数不是频率
:return: fp-tree,和头链表
'''
###################构造头链表和1维频繁集########################
headerTable = {}
for trans in dataSet:
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
for k in list(headerTable.keys()):
if headerTable[k] < minSuppport:    #频数小于最小支持度的删除
del(headerTable[k])
freqItemSet = set(headerTable.keys())   #1维频繁项集
if (len(freqItemSet) == 0):
return None,None
for k in headerTable.keys():
headerTable[k] = [headerTable[k], None]
###################构造头链表和1维频繁集########################

root = treeNode('Null', 1, None)

#从每条记录中抽取出现在1维频繁项集中的元素,然后按照频数从大到小排序,最后更新fp-tree
for tranSet, count in dataSet.items():
localD = {}
for item in tranSet:
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p:p[1], reverse=True)]
updateTree(orderedItems, root, headerTable, count)
print('headerTable: ', headerTable)
return root,headerTable

def updateTree(items, inTree, headerTable, count):
''' 递归更新fp树和头链表

:param items:
:param inTree:
:param headerTable:
:param count:
:return: 无
'''
if items[0] in inTree.children.keys():
inTree.children[items[0]].inc(count)
else:
inTree.children[items[0]] = treeNode(items[0], count, inTree)
if headerTable[items[0]][1] == None:
headerTable[items[0]][1] = inTree.children[items[0]]
else:
updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
if len(items) > 1:
updateTree(items[1::], inTree.children[items[0]], headerTable, count)

def updateHeader(nodeToTest, targetNode):
while nodeToTest.nodeLink != None:
nodeToTest = nodeToTest.nodeLink
nodeToTest.nodeLink = targetNode


挖掘树的算法如下:

def ascendTree(leafNode, prefixPath):
if leafNode.parent != None:
prefixPath.append(leafNode.name)
ascendTree(leafNode.parent, prefixPath)

def findPrefixPath(basePattern, treeNode):
conditionPatterns = {}
while treeNode != None:
prefixPath = []
ascendTree(treeNode, prefixPath)
if len(prefixPath) > 1:
conditionPatterns[frozenset(prefixPath[1:])] = treeNode.count
treeNode = treeNode.nodeLink
return conditionPatterns

def mineTree(inTree, headerTable, minSupport, prefix, freqItemList):
''' 从fp-tree中挖掘频繁项集

:param inTree: fp-tree
:param headerTable: 头链表
:param minSupport:
:param prefix: 用于递归调用时产生频繁项
:param freqItemList: 保存结果list
:return:
'''
#首先,headerTable中从频数最小的开始,排序
bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1][0])]
#遍历头链表的每个节点元素
for basePat in bigL:
newFreqSet = prefix.copy()
newFreqSet.append(basePat)
# newFreqSet.add(basePat)
freqItemList.insert(0, newFreqSet)
CPB = findPrefixPath(basePat, headerTable[basePat][1])      #条件模式基
conditionTree, conditionHeadTab = createTree(CPB, minSupport)
if conditionHeadTab != None:
print('conditional tree for: ', newFreqSet)
conditionTree.display(1)
mineTree(conditionTree, conditionHeadTab, minSupport, newFreqSet, freqItemList)


FP-Growth算法介绍完毕!!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  算法