数据处理，匹配 - 批量多文件匹配，文件选择性去重，选择性添加 - Java代码实现 - Java

最新推荐文章于 2024-11-16 19:27:52 发布

小圆酱

最新推荐文章于 2024-11-16 19:27:52 发布

阅读量812

点赞数 1

CC 4.0 BY-SA版权

分类专栏：数据处理

本文链接：https://blog.youkuaiyun.com/sinat_26933727/article/details/80700298

数据处理专栏收录该内容

2 篇文章

订阅专栏

Dataset描述：数据集按照某一列的数据选择性的添加和删除。选择条件存在另一数据集中。通过遍历比对，重写原文件。

文件命名：原文件 master.*；比对文件 compare.*；

文件格式：以逗号分隔。如果不是以逗号分隔，可以使用excel 分隔列工具帮助分隔原数据

比对选择条件： 对于一个master中特定比对对象，

如果对象的比对列在compare中存在，则continue，
如果不存在，插入compare中的数值，
如果存在且不是compare中的数据，删除该对象

举例： master：匹配条件 compare: 筛选项： result:

01, a, b, 0203 a-a a, 0203 0203-0203 01, a, b, 0203

01, a, b, 0204 a-a a, 0206 0204 x 01, a, b, 0206

01, a, b, 0205 a-a 0205 x

02, a, b, 0206 a-a 0206-0206 02, a, b, 0206

0203 <-- 02, a, b, 0203

03, c, b, 0206 c-a x

实现思想：

将数据分组， 01+a+b 可形成一个unique 组号
将 compare 文件数据存在 cmp_map 中
将每一组的数据存在 master_map 中
将master_map 中将匹配的所有筛选项存在 match_map中
在一个分组结尾，比对 master_map 和 match_map 生成 record_map (存储在两个文件中都出现的筛选项)
比对 record_map 和 match_map 获取所需要插入master 文件的新筛选项
在操作完成之后清除 master_map, 使用 hashMap.clear(); 或在每个group 开头新建一个master_map
如果需要批量处理，在最外层添加文件和文件夹遍历,添加在 2,3 之间
如果对每个文件进行修改，最后使用 .bat 整合成一个完整的csv 文件

Shell 合并文件

https://blog.youkuaiyun.com/sinat_26933727/article/details/80710914

GitHub 完整代码下载：

https://github.com/Roundlet/Data-processing

部分功能代码块展示：

2. 将compare 数据讯在cmp_map 中

/*Set up the cmp_map*/

			/*<Test Id, Readline>*/

			Map<Integer,String> cmp_map = new HashMap<Integer,String>();

			while((readline_cmp = br_cmp.readLine())!=null) {

				String[] array = readline_cmp.split(",");

				int id = Integer.parseInt(array[0]);

				if(!cmp_map.containsKey(id)) {

					cmp_map.put(id, readline_cmp);

				}

			}

3. 将每一组的数据存在 master_map 中

/*The bound condition is that there are only one member in the group*/  
                    /*Loop in one group. 
                     *Loop will stop at the first element in the next group*/  
                    while((readline_master = br.readLine())!=null) {  
                        String cur_key = "";  
                        attr = readline_master.split(",");  
                        /*Get the current key*/  
                        for(int i=0;i<18;i++) {  
                            cur_key = cur_key + attr[i]+",";  
                        }  
                        /*Using keys to identify whether they belong to the 
                         * same group*/  
                        if(!cur_key.equals(pre_key)) {  
                            break;  
                        }  
                        master_map.put(readline_master, Integer.parseInt(attr[22]));      
                    }

4. 将master_map 中将匹配的所有筛选项存在 match_map中

/*Set up the map that contains all the compare code (Test ID) that match the 
                     * dir name in master file use the parameter dir_master*/  
                    Map<String,Integer> match_map = new HashMap<String,Integer>();  
                    for(Map.Entry<Integer, String> entry :cmp_map.entrySet()) {  
                        attr = entry.getValue().split(",");  
                        String dir_cmp = attr[6];  
                        if(dir_master.equals(dir_cmp)) {  
                            String cmp_line = pre_key+attr[1]+","+","+","+","+entry.getValue();  
                            match_map.put(cmp_line, Integer.parseInt(attr[0]));  
                        }  
                          
                    }

5. 在一个分组结尾，比对 master_map 和 match_map 生成 record_map (存储在两个文件中都出现的筛选项)

/*Record the code that existing in both master and match map*/  
                    Map<String,Integer> record_map = new HashMap<String,Integer>();  
                    for(Map.Entry<String,Integer> entry_master : master_map.entrySet()) {  
                        for(Map.Entry<String,Integer> entry_match : match_map.entrySet()) {  
                            if(entry_master.getValue().equals(entry_match.getValue() )) {  
                                record_map.put(entry_master.getKey(),entry_master.getValue());  
                                writer.write(entry_master.getKey());  
                                writer.newLine();  
                            }  
                        }  
                    }

6. 比对 record_map 和 match_map 获取所需要插入master 文件的新筛选项

/*Find the element we need to insert*/  
                    if(record_map.isEmpty()) {  
                        for(Map.Entry<String,Integer> entry_match : match_map.entrySet()) {  
                            writer.write(entry_match.getKey());  
                            writer.newLine();  
                        }  
                    }  
                    else {  
                        for(Map.Entry<String,Integer> entry_match : match_map.entrySet()) {  
                            for(Map.Entry<String,Integer> entry_record : record_map.entrySet()) {  
                                if(!entry_match.getValue().equals(entry_record.getValue()) ) {  
                                    writer.write(entry_match.getKey());  
                                    writer.newLine();  
                                }  
                            }  
                        }  
                    }  
                      
                }