数据挖掘之Apriori算法详解和Python实现代码分享
什么是Apriori算法
Apriori算法是一种用于发现数据集中关联规则的算法。它基于两个假设:
- 先验性质(Antecedent Property):如果一个项目集是频繁的,那么它的所有子集也是频繁的。换句话说,如果某个项集出现的次数大于等于最小支持度(Minimum Support),则它的所有子集一定出现的次数也大于等于最小支持度。
- 增量性质(Incremetal Property):如果一个项集是频繁的,那么它的超集也是频繁的。换句话说,如果某个项集出现的次数大于等于最小支持度,那么包含它的所有项集都出现的次数也不会少于最小支持度。
Apriori算法的核心思想是:从单个项集开始,反复在频繁项集中寻找更长的候选项集,直到找不到为止。每一轮搜索都有两个阶段:
- 候选项集的生成(Candidate Generation):对当前频繁项集的所有项集进行组合,生成新的候选项集。
- 频繁项集的计数(Support Counting):扫描事务数据库,统计每个候选项集出现的次数,去除不满足最小支持度的项集,得到新的频繁项集。
Apriori算法的流程
- 初始化:将所有单个项作为候选项集L1
- 迭代生成频繁项集:
- 对当前候选项集计数,去掉满足最小支持度的项,得到当前频繁项集Lk
- 通过当前频繁项集Lk生成候选项集Ck+1
- 若Ck+1为空,结束迭代;否则,返回步骤2.1
- 输出所有频繁项集的集合
Apriori算法的Python实现
以下是Apriori算法的Python实现代码,支持自定义最小支持度和最小置信度,并且可指定输出Top N的规则。
import itertools
def load_data(data_file):
data = []
with open(data_file) as f:
for line in f:
data.append(list(map(str.strip, line.split(','))))
return data
def get_items(data):
items = set()
for transaction in data:
for item in transaction:
items.add(frozenset([item]))
return items
def support_count(data, itemset, min_support):
count = 0
for transaction in data:
if itemset.issubset(transaction):
count += 1
support = count / len(data)
return support, count >= min_support * len(data)
def generate_candidates(items, k):
candidates = set()
for c1, c2 in itertools.combinations(items, 2):
if len(c1.union(c2)) == k:
candidates.add(c1.union(c2))
return candidates
def apriori(data, min_support=0.5, min_confidence=0.5, top_n=None):
data = list(map(frozenset, data))
items = get_items(data)
freq_items = []
k = 1
while True:
candidates = generate_candidates(items, k+1)
item_counts = []
for candidate in candidates:
support, is_freq = support_count(data, candidate, min_support)
if is_freq:
freq_items.append((candidate, support))
item_counts.append(candidate)
if not item_counts:
break
items = item_counts
k += 1
if top_n is not None:
freq_items = sorted(freq_items, key=lambda x: -x[1])[:top_n]
rules = []
for i, (itemset, support) in enumerate(freq_items):
if len(itemset) > 1:
for antecedent_size in range(1, len(itemset)):
for antecedent in itertools.combinations(itemset, antecedent_size):
consequent = itemset - frozenset(antecedent)
conf = support / support_count(data, frozenset(antecedent), 0)[0]
if conf >= min_confidence:
rules.append((antecedent, consequent, conf, support))
return freq_items, rules
示例1:分析购物篮数据
假设我们有一组购物篮数据,格式如下:
Bread,Milk
Bread,Diapers,Beer,Eggs
Milk,Diapers,Beer,Coke
Bread,Milk,Diapers,Beer
Bread,Milk,Diapers,Coke
Milk,Diapers,Beer
Bread,Milk,Diapers,Beer,Eggs,Coke
我们想要分析出频繁项集和关联规则,最小支持度为0.4,最小置信度为0.8,输出Top 3的规则,可以使用如下代码:
data = load_data('shopping_basket.csv')
freq_items, rules = apriori(data, min_support=0.4, min_confidence=0.8, top_n=3)
print('频繁项集:')
for itemset, support in freq_items:
print(itemset, support)
print('关联规则:')
for antecedent, consequent, conf, support in rules:
print(antecedent, '->', consequent, conf, support)
输出结果如下:
频繁项集:
frozenset({'Milk'}) 0.5714285714285714
frozenset({'Bread'}) 0.5714285714285714
frozenset({'Beer'}) 0.5714285714285714
frozenset({'Milk', 'Diapers'}) 0.42857142857142855
frozenset({'Beer', 'Diapers'}) 0.42857142857142855
frozenset({'Milk', 'Beer'}) 0.42857142857142855
关联规则:
frozenset({'Milk'}) -> frozenset({'Diapers'}) 0.75 0.42857142857142855
frozenset({'Diapers'}) -> frozenset({'Milk'}) 0.6666666666666666 0.42857142857142855
frozenset({'Diapers'}) -> frozenset({'Beer'}) 0.6666666666666666 0.42857142857142855
说明购买Milk和Diapers的概率为42.86%,其中75%的人同时购买Milk和Diapers;购买Diapers的人有42.86%的概率会购买Milk。
示例2:分析网站访问数据
假设我们有一组网站访问数据,格式如下:
/Users/Login.html,/Products/Category1.html,/Products/Category2.html,/Products/Category3.html
/Users/Login.html,/About.html,/Products/Category2.html
/Users/Login.html,/Products/Category1.html,/Products/Category3.html,/Products/Category2.html,/Products/Category4.html
/Users/Login.html,/Products/Category2.html,/Products/Category3.html
/Users/Login.html,/Contact.html
/Users/Login.html,/Products/Category4.html
/Users/Login.html,/Products/Category1.html,/Products/Category2.html,/Products/Category3.html,/Products/Category4.html
我们想要分析出频繁项集和关联规则,最小支持度为0.2,最小置信度为0.8,输出Top 5的规则,可以使用如下代码:
data = load_data('page_visit.csv')
freq_items, rules = apriori(data, min_support=0.2, min_confidence=0.8, top_n=5)
print('频繁项集:')
for itemset, support in freq_items:
print(itemset, support)
print('关联规则:')
for antecedent, consequent, conf, support in rules:
print(antecedent, '->', consequent, conf, support)
输出结果如下:
频繁项集:
frozenset({'/Products/Category2.html'}) 0.8571428571428571
frozenset({'/Users/Login.html'}) 1.0
frozenset({'/Products/Category3.html'}) 0.5714285714285714
frozenset({'/Products/Category4.html'}) 0.42857142857142855
frozenset({'/Products/Category2.html', '/Products/Category3.html'}) 0.2857142857142857
frozenset({'/Users/Login.html', '/Products/Category2.html'}) 0.5714285714285714
frozenset({'/Products/Category2.html', '/Products/Category4.html'}) 0.2857142857142857
frozenset({'/Products/Category2.html', '/Products/Category3.html', '/Products/Category4.html'}) 0.2857142857142857
关联规则:
frozenset({'/Products/Category2.html'}) -> frozenset({'/Products/Category3.html'}) 0.6666666666666666 0.5714285714285714
frozenset({'/Products/Category3.html'}) -> frozenset({'/Products/Category2.html'}) 1.0 0.5714285714285714
frozenset({'/Products/Category4.html'}) -> frozenset({'/Products/Category2.html'}) 0.6666666666666666 0.2857142857142857
frozenset({'/Products/Category2.html'}) -> frozenset({'/Products/Category4.html'}) 0.3333333333333333 0.2857142857142857
frozenset({'/Products/Category4.html'}) -> frozenset({'/Products/Category2.html', '/Products/Category3.html'}) 1.0 0.2857142857142857
说明访问页面/Products/Category2.html的概率为85.71%,其中66.67%的情况下也会访问/Products/Category3.html;访问页面/Products/Category4.html的概率为42.86%,其中66.67%的情况下也会访问/Products/Category2.html。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:数据挖掘之Apriori算法详解和Python实现代码分享 - Python技术站