Apriori 关联规则挖掘的一种算法,其逻辑简单,网上有很多关于算法逻辑的介绍,在此不再赘述。使用hadoop实现Apriori算法的核心在于,循环计算,在Map过程对候选项进行识别,Combine过程和Reduce 过程实现对候选项集的频次统计,并最终输出满足条件的项集合。
可以这样理解:
- 根据输入文件,识别所有候选项A,B,C…
- Reduce过程统计ABC等候选项的频次,输出不小于最小支持数量的项集;
- 此时上一个MapReduc过程已经结束,保存输出结果后,开启下一次的Map过程,结合上一个过程的输出,对候选项做连接操作,如生成AB,AC等,Map过程扫描源数据,类似Wordcount的Map过程;
- Combine和Reduce过程简单时间频率统计,输出满足条件的项集;
- 重复3-4直到输出内容为空。
按照逻辑过程,实现代码如下,首先需要辅助类Assitance .java,实现连接项集,保存输出结果等操作:
package myapriori;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.LineReader;
public class Assitance {
public static List<List<String>> getNextRecord(String nextrecord,
String isDirectory) {
boolean isdy = false ;
if(isDirectory.equals("true")){
isdy = true;
}
List<List<String>> result = new ArrayList<List<String>>();
try {
Path path = new Path(nextrecord);
Configuration conf = new Configuration();
FileSystem fileSystem = path.getFileSystem(conf);
if (isdy) {
FileStatus[] listFile = fileSystem.listStatus(path);
for (int i = 0; i < listFile.length; i++) {
result.addAll(getNextRecord(listFile[i].getPath()
.toString(), "false"));
}
return result;
}
FSDataInputStream fsis = fileSystem.open(path);
LineReader lineReader = new LineReader(fsis, conf);
Text line = new Text();
while (lineReader.readLine(line) > 0) {
List<String> tempList = new ArrayList<String>();
// ArrayList<Double> tempList = textToArray(line);
String[] fields = line.toString()
.substring(0, line.toString().indexOf("]"))
.replaceAll("\\[", "").replaceAll("\\]", "").replaceAll("\t", "").split(",");
for (int i = 0; i < fields.length; i++) {
tempList.add(fields[i].trim());
}
Collections.sort(tempList);
result.add(tempList);
}
lineReader.close();
result = connectRecord(result);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;
}
private static List<List<String>> connectRecord(List<List<String>> result) {
List<List<String>> nextCandidateItemset = new ArrayList<List<String>>();
for (int i = 0; i < result.size()-1; i++) {
HashSet<String> hsSet = new HashSet<String>();
HashSet<String> hsSette