HNUST 数据挖掘课设 《实验二 Close 算法设计与应用》
一、实验内容
1.实验要求
2. 实验原理
一个频繁闭合项目集的所有闭合子集一定是频繁的;一个非频繁闭合项目集的所有闭合超集一定是非频繁的。因此可以在闭合项目集格空间上讨论项目集的频繁问题。实验证明,它对特殊数据是可以减少数据库扫描次数的。Close算法是一种用于频繁项集挖掘的算法,其主要目的是发现数据集中的闭合频繁项集,通过发现闭合频繁项集,能够避免生成大量不必要的候选项集,减少后续关联规则挖掘的计算复杂度,节省计算资源和时间,从而提高挖掘效率。
3.程序流程图
二、数据结构与关键代码
1. 数据结构
List<List<String>> transactions //每条记录
Map<Set<String>, Integer> FrequentItemsets //所有频繁项目集
Map<Set<String>, Integer> frequentItemsets //
Map<Set<String>, Set<String>> Closures //每个Generator及closure
Map<Set<String>,Integer> closures //闭合项及支持度
2. 关键代码
//计算各个产生式的闭合
public static Set<String> calculateClosure(Set<String> item, List<List<String>> transactions) {
Set<String> currentClosure = null;
// 遍历每个事务
for (List<String> transaction : transactions) {
// 如果事务包含项(item),则将事务中的所有项添加到当前闭包中
if (transaction.containsAll(item)) {
if (currentClosure == null) {
currentClosure = new HashSet<>(transaction);
} else {
currentClosure.retainAll(transaction);
}
}
}
return currentClosure != null ? currentClosure : new HashSet<>();
}
//通过频繁闭合项目集得到频繁项目集
while (maxLength >1) {
Iterator<Set<String>> iterator = closures.keySet().iterator();
while (iterator.hasNext()) {
Set<String> key = iterator.next();
if (key.size() == maxLength) {
List<Set<String>> subSetsk = generateKsubsets(key);
System.out.println(key+"======>SUB "+subSetsk);
for (Set<String> sub : subSetsk) {
if (!closures.containsKey(sub)) {
System.out.println(key + " newAdd => " + sub+"=>"+closures.get(key));
endFrequentItemsets.add(sub);
FrequentItemsets.put(sub,closures.get(key));
medium.put(sub,closures.get(key));
System.out.println("medium=>"+medium);
}
}
}
}
for (Set<String> newClosure:medium.keySet()){
closures.put(newClosure,medium.get(newClosure));
}
maxLength--;
}
3. 完整代码
采用文件读取的方式从dataset.txt中读入数据,当时让我觉得比较难的点是剪枝,代码是好久以前写的了哈哈哈
(1)找出候选1-项目集
(2)扫描数据库得到候选闭合项目集
(3)修剪,将支持度小于最小支持度的候选闭合项删除
(4)得到频繁闭合项目集,与自身连接得到频繁候选i-项目集,如此继续下去,直到某个值r使得候选频繁闭合r-项目集为空,这时算法结束。
(5)通过频繁闭合项目集得到频繁项目集。首先对FC中的每个闭合项目集,计算它的项目个数,把所有项目个数相同的归入中,同时得到最大的个数计为k,然后从k开始,对每个中的所有项目集进行分类,找到它的所有(i-1)-项子集。然后对于每个子集,如果它不属于则把它加入,直到i=2,就找到了所有的频繁项目集。
(6)挖掘关联规则
import javafx.scene.effect.SepiaTone;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
/**
* Created by 23222 on 2023/12/18.
*/
public class Close {
// public static void main(String )
public static void main(String[] args){
double minSupport = 0.6;
double minConfidence=0.6;
// Scanner scanner = new Scanner(System.in);
//
// System.out.print("Enter the minSupport: ");
// double minSupport = scanner.nextDouble();
//
// System.out.print("Enter the minConfidence: ");
// double minConfidence = scanner.nextDouble();
//
// 读取事务数据库
String filename = "../2-Close/dataset.txt";
List<List<String>> transactions = readTransactions(filename);
for (List<String> transaction : transactions) {
System.out.println(transaction);
}
// 生成频繁项目集
Map<Set<String>, Integer> FrequentItemsets = generateFrequentItemsets(transactions, minSupport);
System.out.println(FrequentItemsets.entrySet());
// 生成关联规则
generateAssociationRules(minConfidence,FrequentItemsets);
}
public static List<List<String>> readTransactions(String filename){
List<List<String>> transactions = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filename))) {
String line;
boolean firstLine = true;
while ((line = br.readLine()) != null) {
if (firstLine) {
firstLine = false;
continue;
}
String[] parts = line.split("\\s+");
String[] items = parts[1].split("、");
transactions.add(Arrays.asList(items));
}
} catch (IOException e) {
e.printStackTrace();
}
return transactions;
}
public static Map<Set<String>, Integer> generateFrequentItemsets(List<List<String>> transactions, double minSupport) {
Map<Set<String>, Integer> FrequentItemsets = new HashMap<>();
Map<Set<String>, Integer> frequentItemsets = new HashMap<>();
Map<Set<String>, Set<String>> Closures = new HashMap<>();
Map<Set<String>,Integer> closures = new HashMap<>();
// 获取长度为1的项并计算初始支持度
for (List<String> transaction : transactions) {
for (String item : transaction) {
Set<String> candidate = new HashSet<>();
candidate.add(item);
frequentItemsets.put(candidate,frequentItemsets.getOrDefault(candidate, 0) + 1);
}
}
System.out.println("-------------------FCC1--------------------------------");
for (Map.Entry<Set<String>, Integer> entry : frequentItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
Closures.put(itemset,calculateClosure(itemset,transactions));
int support = entry.getValue();
System.out.println(itemset + " => " + support);
System.out.println(calculateClosure(itemset,transactions));
System.out.println();
}
System.out.println("--------------------------------------------------------");
// 保留支持度不小于最小支持度的1-项目集
frequentItemsets.keySet().removeIf(itemset -> {
// double support = (double) frequentItemsets.get(itemset);
double support = (double) frequentItemsets.get(itemset) / transactions.size();
if (support < minSupport) {
return true;
}
return false;
});
// 获取 frequentItemsets 中的键集合
Set<Set<String>> frequentKeys = frequentItemsets.keySet();
// 交集retainall()
Closures.keySet().retainAll(frequentKeys);
// 遍历输出
for (Map.Entry<Set<String>, Set<String>> entry : Closures.entrySet()) {
Set<String> key = entry.getKey();
Set<String> value = entry.getValue();
Closures.put(key,value);
System.out.println("Key: " + key);
System.out.println("Value: " + value);
System.out.println("------------");
}
Map<Set<String>, Integer> ffrequentItemsets = new HashMap<>(frequentItemsets);
System.out.println("-------------------FC1--------------------------------");
for (Map.Entry<Set<String>, Integer> entry : frequentItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
closures.put(Closures.get(itemset),entry.getValue());
int support = entry.getValue();
System.out.println(itemset + " => " + support);
}
System.out.println("--------------------------------------------------------");
int k = 2;
while (!ffrequentItemsets.isEmpty()) {
Map<Set<String>, Integer> candidateItemsets = generateFCC(ffrequentItemsets.keySet(), k);
ffrequentItemsets.clear();
System.out.println("------------------" + k + "-connection---------------------------");
if (!candidateItemsets.isEmpty()) {
// 计算k-候选项目集的支持度
for (List<String> transaction : transactions) {
for (Set<String> itemset : candidateItemsets.keySet()) {
if (transaction.containsAll(itemset)) {
candidateItemsets.put(itemset, candidateItemsets.get(itemset) + 1);
}
}
}
for (Map.Entry<Set<String>, Integer> entry : candidateItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
int support = entry.getValue();
System.out.println(itemset + " => " + support);
}
} else System.out.println("【空】");
System.out.println("-------------------------------------------------------");
System.out.println("-----------------------------筛选-----------------------");
Iterator<Map.Entry<Set<String>, Integer>> iterator = candidateItemsets.entrySet().iterator();
while (iterator.hasNext()) {
Map.Entry<Set<String>, Integer> entry = iterator.next();
Set<String> itemset = entry.getKey();
int flag = 0;
List<Set<String>> Sp = generateKsubsets(itemset);
System.out.println("itemset=> "+itemset);
for (Set<String> sp : Sp) {
Set<String> closure = Closures.get(sp);
System.out.println(sp+"的闭包"+closure);
if (closure.containsAll(itemset)) {
iterator.remove(); // 使用迭代器安全删除
flag = 1;
break;
}
}
if (flag == 1) System.out.println("----------------------------------->"+itemset+"=>DELETE!");
}
for (Map.Entry<Set<String>, Integer> entry : candidateItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
int support = entry.getValue();
System.out.println(itemset + " => " + support);
}
System.out.println("计算各产生式的闭合和支持度");
for (Map.Entry<Set<String>, Integer> entry : candidateItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
Closures.put(itemset,calculateClosure(itemset,transactions));
int support = entry.getValue();
System.out.println(itemset + " => " + support+" "+calculateClosure(itemset,transactions));
}
candidateItemsets.keySet().removeIf(itemset -> {
int count = candidateItemsets.get(itemset);
double support = (double) count / transactions.size();
if (count==0 ) {
return true;
}
return false;
});
System.out.println("修剪");
candidateItemsets.keySet().removeIf(itemset -> {
int count = candidateItemsets.get(itemset);
double support = (double) count / transactions.size();
if (support<minSupport) {
return true;
}
return false;
});
Closures.clear();
for (Map.Entry<Set<String>, Integer> entry : candidateItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
Closures.put(itemset,calculateClosure(itemset,transactions));
closures.put(itemset,candidateItemsets.get(itemset));
int support = entry.getValue();
System.out.println(itemset + " => " + support+" "+calculateClosure(itemset,transactions));
}
frequentItemsets.putAll(candidateItemsets);
ffrequentItemsets.putAll(candidateItemsets);
/* for (Set<String> itemset : frequentItemsets.keySet()) {
int support = frequentItemsets.get(itemset);
System.out.println("f-频繁项集: " + itemset + ", 支持度: " + support);
}*/
System.out.println("-------------------FC " + k + "------------------------------");
if (!candidateItemsets.isEmpty()) {
for (Map.Entry<Set<String>, Integer> entry : ffrequentItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
int support = entry.getValue();
System.out.println(itemset + " => " + support);
}
} else System.out.println("【空】");
System.out.println("--------------------------------------------------------");
k++;
}
// 输出频繁项目集
List<Map.Entry<Set<String>, Integer>> sortedclosures = new ArrayList<>(closures.entrySet());
Collections.sort(sortedclosures, new Comparator<Map.Entry<Set<String>, Integer>>() {
@Override
public int compare(Map.Entry<Set<String>, Integer> entry1, Map.Entry<Set<String>, Integer> entry2) {
int length = Integer.compare(entry1.getKey().size(), entry2.getKey().size());
if (length != 0) {
return length; // 先按长度排序
} else {
return entry1.getKey().hashCode() - entry2.getKey().hashCode(); // 长度相同时按哈希值排序
}
}
});
for (Map.Entry<Set<String>, Integer> entry : sortedclosures) {
ffrequentItemsets.put(entry.getKey(), entry.getValue());
}
int maxLength=0;
System.out.println("-------------------【所有的闭合集】-----------------");
for (Map.Entry<Set<String>, Integer> entry : sortedclosures) {
if (entry.getKey().size()>maxLength) maxLength=entry.getKey().size();
System.out.println(entry.getKey());
}
FrequentItemsets.putAll(closures);
System.out.println("maxLength=>"+maxLength);
System.out.println("closures=>"+closures);
/* for (Map.Entry<Set<String>, Integer> entry : closures.entrySet()) {
Set<String> key = entry.getKey();
Integer value = entry.getValue();
System.out.println("Key: " + key + ", Value: " + value);
}*/
System.out.println("FrequentItemsets=>"+FrequentItemsets);
Set<Set<String>> endFrequentItemsets = new HashSet<>(closures.keySet());
Map<Set<String>, Integer> medium = new HashMap<>();
while (maxLength >1) {
Iterator<Set<String>> iterator = closures.keySet().iterator();
while (iterator.hasNext()) {
Set<String> key = iterator.next();
if (key.size() == maxLength) {
List<Set<String>> subSetsk = generateKsubsets(key);
System.out.println(key+"======>SUB "+subSetsk);
for (Set<String> sub : subSetsk) {
if (!closures.containsKey(sub)) {
System.out.println(key + " newAdd => " + sub+"=>"+closures.get(key));
endFrequentItemsets.add(sub);
// FrequentItemsets.put(sub,0);
FrequentItemsets.put(sub,closures.get(key));
medium.put(sub,closures.get(key));
System.out.println("medium=>"+medium);
}
}
}
}
for (Set<String> newClosure:medium.keySet()){
closures.put(newClosure,medium.get(newClosure));
}
maxLength--;
}
System.out.println("This is end");
List<Set<String>> list=new ArrayList<>(endFrequentItemsets);
Collections.sort(list, new Comparator<Set<String>>() {
@Override
public int compare(Set<String> entry1, Set<String> entry2) {
int length = Integer.compare(entry1.size(), entry2.size());
if (length != 0) {
return length; // 先按长度排序
} else {
return entry1.hashCode() - entry2.hashCode(); // 长度相同时按哈希值排序
}
}
});
System.out.println("END all=>"+endFrequentItemsets);
System.out.println("END all=>"+FrequentItemsets);
return FrequentItemsets;
}
public static Set<String> calculateClosure(Set<String> item, List<List<String>> transactions) {
Set<String> currentClosure = null;
// 遍历每个事务
for (List<String> transaction : transactions) {
// 如果事务包含项(item),则将事务中的所有项添加到当前闭包中
if (transaction.containsAll(item)) {
if (currentClosure == null) {
currentClosure = new HashSet<>(transaction);
} else {
currentClosure.retainAll(transaction);
}
}
}
return currentClosure != null ? currentClosure : new HashSet<>();
}
// 生成K-候选项目集
public static Map<Set<String>, Integer> generateFCC(Set<Set<String>> frequentItemsets, int k) {
Map<Set<String>, Integer> candidateItemsets = new HashMap<>();
for (Set<String> itemset1 : frequentItemsets) {
for (Set<String> itemset2 : frequentItemsets) {
if (k == 2) {
Set<String> connection = Connection(itemset1, itemset2);
if (connection.size() == k && hasInfrequentSubsets(connection, frequentItemsets, k - 1)) {
candidateItemsets.put(connection, 0);
}
} else if (k!=1){
List<String> list1 = new ArrayList<>(itemset1);
List<String> list2 = new ArrayList<>(itemset2);
List<String> subList1 = list1.subList(0, k - 2);
List<String> subList2 = list2.subList(0, k - 2);
if (subList1.equals(subList2)) {
Set<String> connection = Connection(itemset1, itemset2);
if (connection.size() == k && hasInfrequentSubsets(connection, frequentItemsets, k - 1)) {
candidateItemsets.put(connection, 0);
}
}
}
}
}
return candidateItemsets;
}
private static Set<String> Connection(Set<String> itemset1, Set<String> itemset2) {
Set<String> connection = new HashSet<>(itemset1);
connection.addAll(itemset2);
return connection;
}
private static boolean hasInfrequentSubsets(Set<String> connection, Set<Set<String>> frequentItemsets, int k) {
List<String> list = new ArrayList<>(connection);
for (int i = 0; i < list.size(); i++) {
// 生成长度为 K-1 的子集
List<String> subList = new ArrayList<>(list);
subList.remove(i);
Set<String> subsetSet = new HashSet<>(subList);
if (!containsSubset(frequentItemsets, subsetSet)) {
return false;
}
}
return true;
}
//长度为(k-1)的子集是不是在Fi中,只要有一个包含就是在
private static boolean containsSubset(Set<Set<String>> frequentItemsets, Set<String> subsetSet) {
for (Set<String> itemset : frequentItemsets) {
if (itemset.containsAll(subsetSet)) {
return true;
}
}
return false;
}
private static List<Set<String>> generateSubsets(Set<String> itemset) {
List<Set<String>> subsets = new ArrayList<>();
for (int i = 0; i < (1 << itemset.size()); i++) {
Set<String> subset = new HashSet<>();
int index = 0;
for (String item : itemset) {
if ((i & (1 << index)) >0) {
subset.add(item);
}
index++;
}
if (subset.size() > 0 && subset.size() <=itemset.size()) {
subsets.add(subset);
}
}
return subsets;
}
private static List<Set<String>> generateKsubsets(Set<String> itemset) {
List<String> list = new ArrayList<>(itemset);
List<Set<String>> set = new ArrayList<>();
for (int i = 0; i < list.size(); i++) {
// 生成长度为 K-1 的子集
List<String> subList = new ArrayList<>(list);
subList.remove(i);
Set<String> subsetSet = new HashSet<>(subList);
set.add(subsetSet);
}
return set;
}
public static Map<Set<String>, Integer> genrateMaxF(Map<Set<String>, Integer> FrequentItemsets){
Map<Set<String>, Integer> maxFrequentItemsets=new HashMap<>();
for (Map.Entry<Set<String>, Integer> entry : FrequentItemsets.entrySet()) {
Set<String> itemset = entry.getKey();
int support=entry.getValue();
boolean isMax = true;
// 检查itemset是否被其他频繁项目集包含
for (Map.Entry<Set<String>, Integer> entry2 : FrequentItemsets.entrySet()) {
Set<String> otherItemset = entry2.getKey();
if (otherItemset.equals(itemset)) {
continue;
}
if (otherItemset.containsAll(itemset)) {
isMax = false;
break;
}
}
// 将不被其他频繁项目集包含的itemset添加到最大频繁项目集中
if (isMax) {
maxFrequentItemsets.put(itemset,support);
}
}
System.out.println("-------------------【不被其他频繁项目集包含的最大频繁项目集】-----------------");
for (Map.Entry<Set<String>, Integer> entry : maxFrequentItemsets.entrySet()) {
System.out.println(entry.getKey());
}
System.out.println("-----------------------------------------------------------------------------");
return maxFrequentItemsets;
}
public static void generateAssociationRules(double minConfidence,Map<Set<String>, Integer> FrequentItemsets) {
Map<Set<String>, Set<String>> rules = new HashMap<>();
for (Set<String> itemset : FrequentItemsets.keySet()) {//所有键的 Set 集合
if (itemset.size() > 1) {
List<Set<String>> subsets = generateSubsets(itemset);
System.out.println("-------------【遍历频繁项目集】-------------------------------");
System.out.println(itemset);
System.out.println(subsets);
System.out.println("------------------------------------------------------------------");
System.out.println("------------------------------------------------------------------");
System.out.println(itemset);
for (Set<String> subset : subsets) {
if (subset.equals(itemset)) {
continue;}
Set<String> remaining = new HashSet<>(itemset);
remaining.removeAll(subset);
int subsetSupportCount = FrequentItemsets.get(subset); // 获取子集的支持度计数
double confidence = (double) FrequentItemsets.get(itemset) /subsetSupportCount;
if (confidence >= minConfidence) {
System.out.println(subset + " => " + remaining);
// rules.put(subset, remaining);
// System.out.println(rules);
}
}
}
}
}
}
dataset.txt
序号 商品
1 短裤、帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
2 帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
3 短裤、帽子、裙子、棉衣、短袖、衬衫、袜子
4 短裤、帽子、棉衣、短袖、衬衫、袜子
5 短裤、帽子、长裤、裙子、棉衣、短袖、袜子
6 短裤、帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
7 短裤、帽子、长裤、裙子、棉衣、短袖、
8 帽子、长裤、裙子、棉衣、衬衫、袜子
9 短裤、帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
10 短裤、帽子、长裤、裙子、短袖、衬衫、袜子
11 短裤、帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
12 短裤、帽子、长裤、棉衣、短袖、衬衫、袜子
13 短裤、长裤、裙子、棉衣、短袖、衬衫、袜子
14 帽子、长裤、裙子、棉衣、短袖、衬衫、袜子
15 短裤、帽子、长裤、裙子、棉衣、短袖、衬衫、袜子