How to find duplicate rows with SQL

本文介绍如何使用SQL查询数据库中重复的行,包括单列和多列的重复项,并提供了删除重复记录的方法。

How to find duplicate rows with SQL

This article shows how to find duplicated rows in a database table. This is a very common beginner question. The basic technique is straightforward. I’ll also show some variations, such as how to find “duplicates in two columns” (a recent question on the #mysql IRC channel).

How to find duplicated rows

The first step is to define what exactly makes a row a duplicate of another row. Most of the time this is easy: they have the same value in some column. I’ll take this as a working definition for this article, but you may need to alter the queries below if your notion of “duplicate” is more complicated.

For this article, I’ll use this sample data:

create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08');
insert into test(id, day) values(2, '2006-10-08');
insert into test(id, day) values(3, '2006-10-09');

select * from test;
+----+------------+
| id | day        |
+----+------------+
|  1 | 2006-10-08 |
|  2 | 2006-10-08 |
|  3 | 2006-10-09 |
+----+------------+

The first two rows have the same value in the day column, so if I consider those to be duplicates, here’s a query to find them. The query uses a GROUP BY clause to put all the rows with the same day value into one “group” and then count the size of the group:

 select day, count(*) from test GROUP BY day;
+------------+----------+
| day        | count(*) |
+------------+----------+
| 2006-10-08 |        2 |
| 2006-10-09 |        1 |
+------------+----------+

The duplicated rows have a count greater than one. If you only want to see rows that are duplicated, you need to use a HAVING clause (not a WHERE clause), like this:

select day, count(*) from test group by day HAVING count(*) > 1;
+------------+----------+
| day        | count(*) |
+------------+----------+
| 2006-10-08 |        2 |
+------------+----------+

This is the basic technique: group by the column that contains duplicates, and show only those groups having more than one row.

Why can’t you use a WHERE clause?

WHERE clause filters the rows before they are grouped together. A HAVING clause filters them after grouping. That’s why you can’t use a WHERE clause in the above query.

How to delete duplicate rows

A related question is how to delete the ‘duplicate’ rows once you find them. A common task when cleaning up bad data is to delete all but one of the duplicates, so you can put proper indexes and primary keys on the table, and prevent duplicates from getting into the table again.

Again, the first thing to do is make sure your definition is clear. Exactly which row do you want to keep? The ‘first’ one? The one with the largest value of some column? For this article, I’ll assume you want to keep the ‘first’ row — the one with the smallest value of the id column. That means you want to delete every other row.

Probably the easiest way to do this is with a temporary table. Especially in MySQL, there are some restrictions about selecting from a table and updating it in the same query. You can get around these, as I explain in my article How to select from an update target in MySQL, but I’ll just avoid these complications and use a temporary table.

The exact definition of the task is to delete every row that has a duplicate, except the row with the minimal value of id for that group. So you need to find not only the rows where there’s more than one in the group, you also need to find the row you want to keep. You can do that with the MIN() function. Here are some queries to create the temporary table and find the data you need to do the DELETE:

create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
   select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
+------------+--------+
| day        | min_id |
+------------+--------+
| 2006-10-08 |      1 |
+------------+--------+

Now that you have this data, you can proceed to delete the ‘bad’ rows. There are many ways to do this, and some are better than others (see my article about many-to-one problems in SQL), but again I’ll avoid the finer points and just show you a standard syntax that ought to work in any RDBMS that supports subqueries:

delete from test
   where exists(
      select * from to_delete
      where to_delete.day = test.day and to_delete.min_id <> test.id
   )

If your RDBMS does not support subqueries, or if it’s more efficient, you may wish to do a multi-table delete. The syntax for this varies between systems, so you need to consult your system’s documentation. You may also need to do all of this in a transaction to avoid other users changing the data while you’re working, if that’s a concern.

How to find duplicates in multiple columns

Someone recently asked a question similar to this on the #mysql IRC channel:

I have a table with columns b and c that links two other tables b and c, and I want to find all rows that have duplicates in either b or c.

It was difficult to understand exactly what this meant, but after some conversation I grasped it: the person wanted to be able to put unique indexes on columns b and c separately.

It’s pretty easy to find rows with duplicate values in one or the other column, as I showed you above: just group by that column and count the group size. And it’s easy to find entire rows that are exact duplicates of other rows: just group by as many columns as you need. But it’s harder to identify rows that have either a duplicated b value or a duplicated c value. Take the following sample table, which is roughly what the person described:

create table a_b_c(
   a int not null primary key auto_increment,
   b int,
   c int
);

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);

Now, you can easily see there are some ‘duplicate’ rows in this table, but no two rows actually have the same tuple {b, c}. That’s why this is a bit more difficult to solve.

Queries that don’t work

If you group by two columns together, you’ll get various results depending on how you group and count. This is where the IRC user was getting stumped. Sometimes queries would find some duplicates but not others. Here are some of the things this person tried:

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
   or count(distinct c > 1);

This query returns every row in the table, with a COUNT(*) of 1, which seems to be wrong behavior, but it’s actually not. Why? Because the > 1 is inside the COUNT(). It’s pretty easy to miss, but this query is actually the same as

select b, c, count(*) from a_b_c
group by b, c
having count(1)
   or count(1);

Why? Because (b > 1) is a boolean expression. That’s not what you want at all. You want

select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
   or count(distinct c) > 1;

This returns zero rows, of course, because there are no duplicate {b, c} tuples. The person tried many other combinations of HAVING clauses and ORs and ANDs, grouping by one column and counting the other, and so forth:

select b, count(*) from a_b_c group by b having count(distinct c) > 1;
+------+----------+
| b    | count(*) |
+------+----------+
|    1 |        3 |
|    2 |        3 |
|    3 |        3 |
+------+----------+

Nothing found all the duplicates, though. What I think made it most frustrating is that it partially worked, making the person think it was almost the right query… perhaps just another variation would get it…

In fact, it’s impossible to do with this type of simple GROUP BY query. Why is this? It’s because when you group by one column, you distribute like values of the other column across multiple groups. You can see this visually by ordering by those columns, which is what grouping does. First, order by column b and see how they are grouped:

When you order (group) by column b, the duplicate values in column c are distributed into different groups, so you can’t count them with COUNT(DISTINCT c) as the person was trying to do. Aggregate functions such asCOUNT() only operate within a group, and have no access to rows that are placed in other groups. Similarly, when you order by c, the duplicate values in column b are distributed into different groups. It is not possible to make this query do what’s desired.

Some correct solutions

Probably the simplest solution is to find the duplicates for each column separately and UNION them together, like this:

select b as value, count(*) as cnt, 'b' as what_col
 from a_b_c group by b having count(*) > 1
 union
 select c as value, count(*) as cnt, 'c' as what_col
 from a_b_c group by c having count(*) > 1;
+-------+-----+----------+
| value | cnt | what_col |
+-------+-----+----------+
|     1 |   3 | b        |
|     2 |   3 | b        |
|     3 |   3 | b        |
|     1 |   3 | c        |
|     2 |   3 | c        |
|     3 |   3 | c        |
+-------+-----+----------+

The what_col column in the output indicates what column the duplicate value was found in. Another approach is to use subqueries:

select a, b, c from a_b_c
 where b in (select b from a_b_c group by b having count(*) > 1)
    or c in (select c from a_b_c group by c having count(*) > 1);
+----+------+------+
| a  | b    | c    |
+----+------+------+
|  7 |    1 |    1 |
|  8 |    1 |    2 |
|  9 |    1 |    3 |
| 10 |    2 |    1 |
| 11 |    2 |    2 |
| 12 |    2 |    3 |
| 13 |    3 |    1 |
| 14 |    3 |    2 |
| 15 |    3 |    3 |
+----+------+------+

This is probably much less efficient than the UNION approach, and will show every duplicated row, not just the values that are duplicated. Still another approach is to do self-joins against grouped subqueries in the FROMclause. This is more complicated to write correctly, but might be necessary for some complex data, or for efficiency:

select a, a_b_c.b, a_b_c.c
from a_b_c
   left outer join (
      select b from a_b_c group by b having count(*) > 1
   ) as b on a_b_c.b = b.b
   left outer join (
      select c from a_b_c group by c having count(*) > 1
   ) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null

Any of these queries will do, and I’m sure there are other ways too. If you can use UNION, it’s probably the easiest.

From: http://www.xaprb.com/blog/2006/10/09/how-to-find-duplicate-rows-with-sql/

内容概要:本文详细介绍了一种基于Simulink的表贴式永磁同步电机(SPMSM)有限控制集模型预测电流控制(FCS-MPCC)仿真系统。通过构建PMSM数学模型、坐标变换、MPC控制器、SVPWM调制等模块,实现了对电机定子电流的高精度跟踪控制,具备快速动态响应和低稳态误差的特点。文中提供了完整的仿真建模步骤、关键参数设置、核心MATLAB函数代码及仿真结果分析,涵盖转速、电流、转矩和三相电流波形,验证了MPC控制策略在动态性能、稳态精度和抗负载扰动方面的优越性,并提出了参数自整定、加权代价函数、模型预测转矩控制和弱磁扩速等优化方向。; 适合人群:自动化、电气工程及其相关专业本科生、研究生,以及从事电机控制算法研究与仿真的工程技术人员;具备一定的电机原理、自动控制理论和Simulink仿真基础者更佳; 使用场景及目标:①用于永磁同步电机模型预测控制的教学演示、课程设计或毕业设计项目;②作为电机先进控制算法(如MPC、MPTC)的仿真验证平台;③支撑科研中对控制性能优化(如动态响应、抗干扰能力)的研究需求; 阅读建议:建议读者结合Simulink环境动手搭建模型,深入理解各模块间的信号流向与控制逻辑,重点掌握预测模型构建、代价函数设计与开关状态选择机制,并可通过修改电机参数或控制策略进行拓展实验,以增强实践与创新能力。
import glob import pandas as pd import datetime as dt import time import re import warnings import gc import numpy as np warnings.filterwarnings('ignore', category=FutureWarning) pd.set_option('display.unicode.east_asian_width', True) # 获取当前日期 today = dt.date.today() print(f"当前日期: {today}") start_time = time.perf_counter() print("=== 第一部分:数据加载和预处理 ===") # 使用glob获取所有符合条件的文件列表 file_list = glob.glob('销售主题分析_明细*.xlsx') spkc_files = glob.glob('商品资料*.xlsx') file_list_dhcg = glob.glob('*大货采购明细单*.xlsx') file_list_wph = glob.glob('商品明细-条码粒度_大婉酱*.xlsx') # 唯品会 print(f"找到销售主题分析文件: {len(file_list)}个") print(f"找到商品资料文件: {len(spkc_files)}个") print(f"找到大货采购明细单文件: {len(file_list_dhcg)}个") print(f"找到唯品会文件: {len(file_list_wph)}个") # 读取商品库存数据 df_stock = None if spkc_files: df_stock = pd.read_excel(spkc_files[0], engine='openpyxl') print(f"使用商品资料文件: {spkc_files[0]}") else: raise FileNotFoundError("未找到商品库存文件(需包含'商品资料')") # ========== 商品资料表去重处理 ========== print(f"\n=== 商品资料表去重处理 ===") print(f"原始商品资料表形状: {df_stock.shape}") print(f"款式编码唯一值数量: {df_stock['款式编码'].nunique()}") duplicate_styles = df_stock[df_stock.duplicated('款式编码', keep=False)] if len(duplicate_styles) > 0: print(f"发现 {len(duplicate_styles)} 条重复款式编码记录") print(f"重复的款式编码数量: {duplicate_styles['款式编码'].nunique()}") df_stock_deduplicated = df_stock.drop_duplicates('款式编码', keep='first') print(f"去重后商品资料表形状: {df_stock_deduplicated.shape}") print(f"删除了 {len(df_stock) - len(df_stock_deduplicated)} 条重复记录") df_stock = df_stock_deduplicated else: print("未发现重复的款式编码记录") # 存储所有销售数据的列表 dfs = [] # 遍历销售主题分析_明细列表,读取每个文件并添加到列表中 for file in file_list: try: df_xs = pd.read_excel(file) print(f"\n处理文件: {file}") print(f"原始形状: {df_xs.shape}") # 删除季节为空的行 if '季节' in df_xs.columns: original_rows = len(df_xs) df_xs = df_xs.dropna(subset=['季节']) removed_rows = original_rows - len(df_xs) print(f"删除了 {removed_rows} 行季节为空的数据") else: print("未找到'季节'列") # 检查销售数据中是否有重复记录 if df_xs.duplicated().sum() > 0: duplicate_rows = df_xs.duplicated().sum() df_xs = df_xs.drop_duplicates() print(f"删除了 {duplicate_rows} 条完全重复的记录") print(f"处理后形状: {df_xs.shape}") dfs.append(df_xs) print(f"成功读取: {file}") except Exception as e: print(f"读取文件 {file} 失败: {e}") gc.collect() # 向下拼接所有销售数据 result_xs = None if dfs: result_xs = pd.concat(dfs, axis=0, ignore_index=True) print(f"\n合并后销售主题分析数据形状: {result_xs.shape}") # 再次检查合并后是否有重复记录 if result_xs.duplicated().sum() > 0: duplicate_rows = result_xs.duplicated().sum() result_xs = result_xs.drop_duplicates() print(f"合并后删除了 {duplicate_rows} 条重复记录") print(f"最终销售数据形状: {result_xs.shape}") # 删除不再使用的变量 del dfs gc.collect() # 提取平台信息 result_xs["平台"] = result_xs["店铺"].astype(str).apply(lambda x: ''.join(re.findall(r'\[(.*?)\]', x))) # 提取颜色信息 if '颜色规格' in result_xs.columns: result_xs['颜色'] = result_xs['颜色规格'].astype(str).str.split(';').str[0].str.strip() elif '颜色' not in result_xs.columns: result_xs['颜色'] = '未知' # 计算仅退数和仅退金额 result_xs['仅退数'] = result_xs.apply(lambda row: row['销售数量'] if row['订单状态'] == '已取消' else 0, axis=1) result_xs['仅退金额'] = result_xs.apply(lambda row: row['销售金额'] if row['订单状态'] == '已取消' else 0, axis=1) # 计算退货退款率 result_xs['退货退款率(数量)'] = result_xs.apply(lambda row: row['退货数量'] if row['订单状态'] == '已发货' else 0, axis=1) result_xs['退货退款率(金额)'] = result_xs.apply(lambda row: row['实退金额'] if row['订单状态'] == '已发货' else 0, axis=1) # 确保实退数量列存在 if '实退数量' not in result_xs.columns and '退货数量' in result_xs.columns: print("警告:未找到'实退数量'列,使用'退货数量'替代") result_xs['实退数量'] = result_xs['退货数量'] # 再次检查并删除合并前季节为空的行 if '季节' in result_xs.columns: final_original_rows = len(result_xs) result_xs = result_xs.dropna(subset=['季节']) final_removed_rows = final_original_rows - len(result_xs) if final_removed_rows > 0: print(f"合并后: 额外删除了 {final_removed_rows} 行季节为空的数据") print(f"最终销售数据形状: {result_xs.shape}") print("\n=== 第二部分:基础销售数据分析 ===") # 选择基础分析需要的列 base_data_supplement = result_xs[["店铺", "平台", "款式编码", "商品编码", "商品名称", "颜色", "销售数量", "实发数量", "实发金额", "销售金额", "退货数量", "实退金额", "退货金额", "订单状态", "仅退数", "仅退金额", "退货退款率(数量)", "退货退款率(金额)"]].copy() # 添加实退数量列 if '实退数量' in result_xs.columns: base_data_supplement['实退数量'] = result_xs['实退数量'] elif '退货数量' in result_xs.columns: base_data_supplement['实退数量'] = result_xs['退货数量'] print("警告:使用'退货数量'作为'实退数量'") print("\n=== 基础数据预览 ===") print(base_data_supplement.head().to_string(index=False)) print(f"\n数据总行数:{len(base_data_supplement)}") # 初始化各个分析结果的DataFrame monthly_shop_summary = pd.DataFrame() monthly_store_summary = pd.DataFrame() monthly_attr_summary_sorted = pd.DataFrame() price_band_summary_sorted = pd.DataFrame() price_band_total_sorted = pd.DataFrame() season_price_band_sorted = pd.DataFrame() province_ranking = pd.DataFrame() size_summary_sorted = pd.DataFrame() pivot_table = pd.DataFrame() final_result = pd.DataFrame() season_attr_summary = pd.DataFrame() monthly_style_summary_sorted = pd.DataFrame() style_sales_summary = pd.DataFrame() # 新增:用于TOP100分析的变量 high_refund_items = pd.DataFrame() # 新增:退款率高的前50个款式 low_refund_items = pd.DataFrame() # 新增:退款率低的前50个款式 # 按平台分组统计 print("\n=== 按平台统计 ===") monthly_shop_summary = base_data_supplement.groupby(['平台']).agg({ '销售数量': 'sum', '销售金额': 'sum', '实发金额': 'sum', '实发数量': 'sum', '退货数量': 'sum', '实退金额': 'sum', '退货金额': 'sum', '仅退数': 'sum', '仅退金额': 'sum', '退货退款率(数量)': 'sum', '退货退款率(金额)': 'sum' }).reset_index() # 添加实退数量统计 if '实退数量' in base_data_supplement.columns: monthly_shop_summary['实退数量'] = base_data_supplement.groupby(['平台'])['实退数量'].sum().values # 计算取消率 monthly_shop_summary['仅退率'] = (monthly_shop_summary['仅退数'] / monthly_shop_summary['销售数量'] * 100).round(2) print("按平台统计结果:") print(monthly_shop_summary.to_string(index=False)) # 按店铺分组统计 print("\n=== 按店铺统计 ===") monthly_store_summary = base_data_supplement.groupby(['店铺']).agg({ '销售数量': 'sum', '销售金额': 'sum', '实发金额': 'sum', '实发数量': 'sum', '退货数量': 'sum', '实退金额': 'sum', '退货金额': 'sum', '仅退数': 'sum', '仅退金额': 'sum', '退货退款率(数量)': 'sum', '退货退款率(金额)': 'sum' }).reset_index() # 添加实退数量统计 if '实退数量' in base_data_supplement.columns: monthly_store_summary['实退数量'] = base_data_supplement.groupby(['店铺'])['实退数量'].sum().values # 计算取消率 monthly_store_summary['仅退率'] = (monthly_store_summary['仅退数'] / monthly_store_summary['销售数量'] * 100).round(2) print("按店铺统计结果:") print(monthly_store_summary.to_string(index=False)) print("\n=== 第三部分:季节和属性分析 ===") # ========== 销售数据去重处理 ========== print(f"\n=== 销售数据款式编码统计 ===") print(f"销售数据款式编码唯一值数量: {result_xs['款式编码'].nunique()}") print(f"销售数据总记录数: {len(result_xs)}") # ========== 合并前的关键统计 ========== print(f"\n=== 合并前关键统计 ===") print(f"商品资料表款式编码数量: {df_stock['款式编码'].nunique()}") print(f"销售数据表款式编码数量: {result_xs['款式编码'].nunique()}") print(f"商品资料表总记录数: {len(df_stock)}") print(f"销售数据表总记录数: {len(result_xs)}") # 使用内连接避免数据膨胀 print(f"\n=== 开始合并数据 ===") base_data = pd.merge(result_xs, df_stock, on=['款式编码'], how='inner', suffixes=('', '_from_stock')) base_data = base_data.fillna(0) print(f"合并后数据形状: {base_data.shape}") print(f"合并后款式编码唯一值数量: {base_data['款式编码'].nunique()}") # 选择需要的列 season_data_columns = ["款式编码", "商品编码", "商品名称", "颜色", "季节", '其它属性3', "销售数量", "实发数量", "实发金额", "销售金额", "退货数量", "实退金额", "退货金额", "订单状态", "仅退数", "仅退金额", "退货退款率(数量)", "退货退款率(金额)"] # 添加实退数量列(如果存在) if '实退数量' in base_data.columns: season_data_columns.append('实退数量') season_data = base_data[[col for col in season_data_columns if col in base_data.columns]].copy() # 按季节和属性分组统计 if '季节' in season_data.columns and '其它属性3' in season_data.columns: monthly_attr_summary = season_data.groupby(['季节', '其它属性3']).agg({ '销售数量': 'sum', '销售金额': 'sum', '实发金额': 'sum', '实发数量': 'sum', '退货数量': 'sum', '实退金额': 'sum', '退货金额': 'sum', '仅退数': 'sum', '仅退金额': 'sum', '退货退款率(数量)': 'sum', '退货退款率(金额)': 'sum' }).reset_index() # 添加实退数量统计 if '实退数量' in season_data.columns: monthly_attr_summary['实退数量'] = season_data.groupby(['季节', '其它属性3'])['实退数量'].sum().values # 按季节和销量从高到低排序 monthly_attr_summary_sorted = monthly_attr_summary.sort_values( by=['季节', '销售数量'], ascending=[True, False] ).reset_index(drop=True) # 计算取消率 monthly_attr_summary_sorted['仅退率'] = np.where( monthly_attr_summary_sorted['销售数量'] > 0, (monthly_attr_summary_sorted['仅退数'] / monthly_attr_summary_sorted['销售数量'] * 100).round(2), 0 ) print("\n按季节和属性统计结果:") print(monthly_attr_summary_sorted.to_string(index=False)) # 按季节和属性3统计,重点关注已取消订单的销售数量 monthly_style_summary = season_data.groupby(['季节', '其它属性3', '款式编码']).agg({ '销售数量': 'sum', }).reset_index() # 按季节和销量从高到低排序 monthly_style_summary_sorted = monthly_style_summary.sort_values( by=['季节', '销售数量'], ascending=[True, False] ).reset_index(drop=True) print("\n=== 第四部分:价格带分析 ===") # ========== 价格带划分功能 ========== if base_data is not None: print(f"\n=== 价格带划分分析 ===") # 定义价格带区间 price_ranges = [ (0, 100, '0-100元'), (100, 150, '100-150元'), (150, 200, '150-200元'), (200, 250, '200-250元'), (250, 300, '250-300元'), (300, float('inf'), '300元以上') ] # 定义价格带的自定义排序顺序 price_band_order = ['0-100元', '100-150元', '150-200元', '200-250元', '250-300元', '300元以上'] # 确保售价是数值类型 if '售价' in base_data.columns: base_data['售价'] = pd.to_numeric(base_data['售价'], errors='coerce') base_data['售价'] = base_data['售价'].fillna(0) # 使用apply方法创建价格带 def get_price_band(price): for min_val, max_val, label in price_ranges: if min_val <= price < max_val: return label return '300元以上' # 使用apply方法创建价格带列 base_data['价格带'] = base_data['售价'].apply(get_price_band) print(f"价格带列创建成功,唯一值: {base_data['价格带'].unique()}") # 确保所有分组列都是字符串类型 base_data['季节'] = base_data['季节'].astype(str) if '其它属性2' in base_data.columns: base_data['其它属性2'] = base_data['其它属性2'].astype(str) base_data['价格带'] = base_data['价格带'].astype(str) # 选择价格带分析需要的列 price_data_columns = ["季节", "销售数量", "实发数量", "实发金额", "销售金额", "退货数量", "实退金额", "退货金额", "仅退数", "仅退金额", "退货退款率(数量)", "退货退款率(金额)", "售价", "价格带"] # 添加实退数量列(如果存在) if '实退数量' in base_data.columns: price_data_columns.append('实退数量') # 添加其它属性2列(如果存在) if '其它属性2' in base_data.columns: price_data_columns.insert(1, '其它属性2') price_data = base_data[[col for col in price_data_columns if col in base_data.columns]].copy() # 先排除销售数量为0的记录 price_data_filtered = price_data[price_data['销售数量'] > 0].copy() print(f"排除销售数量为0后的数据行数: {len(price_data_filtered)}") def process_group_optimized(group): # 确保分组不为空 if len(group) == 0: return pd.Series() # 计算该分组的累计统计 result = pd.Series({ '季节': group['季节'].iloc[0], '销售数量': group['销售数量'].sum(), '销售金额': group['销售金额'].sum(), '实发金额': group['实发金额'].sum(), '实发数量': group['实发数量'].sum(), '退货数量': group['退货数量'].sum(), '实退金额': group['实退金额'].sum(), '退货金额': group['退货金额'].sum(), '仅退数': group['仅退数'].sum(), '仅退金额': group['仅退金额'].sum(), '退货退款率(数量)': group['退货退款率(数量)'].sum(), '退货退款率(金额)': group['退货退款率(金额)'].sum(), '价格带': group['价格带'].iloc[0] }) # 添加实退数量(如果存在) if '实退数量' in group.columns: result['实退数量'] = group['实退数量'].sum() # 添加其它属性2(如果存在) if '其它属性2' in group.columns: result['其它属性2'] = group['其它属性2'].iloc[0] # 计算售价众数 price_mode = group['售价'].mode() if len(price_mode) > 0: result['售价'] = price_mode.iloc[0] else: # 如果没有众数,使用平均值 result['售价'] = group['售价'].mean() return result # 构建分组键 group_keys = ['季节', '价格带'] if '其它属性2' in price_data_filtered.columns: group_keys.insert(1, '其它属性2') # 按季节、属性2和价格带分组,应用处理函数 price_band_summary = price_data_filtered.groupby(group_keys, group_keys=False).apply(process_group_optimized) # 重置索引 price_band_summary = price_band_summary.reset_index(drop=True) # 将价格带列转换为有序分类类型,按照自定义顺序排序 price_band_summary['价格带'] = pd.Categorical(price_band_summary['价格带'], categories=price_band_order, ordered=True) # 按季节、价格带自定义顺序和销量从高到低排序 sort_keys = ['季节', '价格带', '销售数量'] if '其它属性2' in price_band_summary.columns: sort_keys.insert(1, '其它属性2') # 修复:确保ascending参数长度与sort_keys一致 ascending_values = [True, True, False] if '其它属性2' in price_band_summary.columns: ascending_values.insert(1, True) price_band_summary_sorted = price_band_summary.sort_values( by=sort_keys, ascending=ascending_values ).reset_index(drop=True) # 计算取消率 price_band_summary_sorted['仅退率'] = np.where( price_band_summary_sorted['销售数量'] > 0, (price_band_summary_sorted['仅退数'] / price_band_summary_sorted['销售数量'] * 100).round(2), 0 ) print("\n按季节、属性和价格带统计结果:") print(price_band_summary_sorted.to_string(index=False)) # 按价格带分组统计 price_band_total = price_data.groupby('价格带').agg({ '销售数量': 'sum', '销售金额': 'sum', '实发金额': 'sum', '退货数量': 'sum', '实退金额': 'sum', '仅退数': 'sum', '仅退金额': 'sum' }).reset_index() # 添加实退数量统计 if '实退数量' in price_data.columns: price_band_total['实退数量'] = price_data.groupby('价格带')['实退数量'].sum().values # 将价格带列转换为有序分类类型,按照自定义顺序排序 price_band_total['价格带'] = pd.Categorical(price_band_total['价格带'], categories=price_band_order, ordered=True) # 计算价格带相关指标 price_band_total['均价'] = (price_band_total['销售金额'] / price_band_total['销售数量']).round(2) price_band_total['退货率'] = (price_band_total['退货数量'] / price_band_total['销售数量'] * 100).round(2) price_band_total['实退率'] = np.where( price_band_total['销售数量'] > 0, (price_band_total['实退数量'] / price_band_total['销售数量'] * 100).round(2) if '实退数量' in price_band_total.columns else 0, 0 ) price_band_total['仅退率'] = (price_band_total['仅退数'] / price_band_total['销售数量'] * 100).round(2) price_band_total['销售占比'] = (price_band_total['销售金额'] / price_band_total['销售金额'].sum() * 100).round(2) # 按销售金额排序,但保持价格带的自定义顺序 price_band_total_sorted = price_band_total.sort_values(['价格带'], ascending=[True]) print("\n价格带统计结果:") print(price_band_total_sorted.to_string(index=False)) print("\n=== 第五部分:大货采购数据分析 ===") # 遍历大货采购明细单列表,读取每个文件并添加到列表中 dhcgmx = [] for file in file_list_dhcg: try: df_dh = pd.read_excel(file) dhcgmx.append(df_dh) print(f"成功读取大货采购文件: {file},数据形状: {df_dh.shape}") except Exception as e: print(f"读取文件 {file} 失败: {e}") gc.collect() # 检查是否成功读取到数据 if dhcgmx: # 合并所有数据 df_combined = pd.concat(dhcgmx, ignore_index=True) print(f"合并后大货采购数据总形状: {df_combined.shape}") # 显示数据的基本信息 print("\n大货采购数据基本信息:") print(f"数据列: {df_combined.columns.tolist()}") if '订单类型' in df_combined.columns: print(f"\n订单类型分布:") print(df_combined['订单类型'].value_counts()) # 分别计算翻单和首单的数量 print("\n按款号和订单类型分组统计:") monthly_shop_summary_dh = df_combined.groupby(['款号', '订单类型'])['总数量'].sum().reset_index() print(monthly_shop_summary_dh) # 重塑数据格式,使翻单和首单成为列 print("\n重塑后的数据格式:") pivot_table = monthly_shop_summary_dh.pivot(index='款号', columns='订单类型', values='总数量').fillna(0) pivot_table = pivot_table.astype(int) # 确保首单和翻单列为整数类型 if '首单' in pivot_table.columns: pivot_table['首单'] = pivot_table['首单'].astype(int) else: pivot_table['首单'] = 0 if '翻单' in pivot_table.columns: pivot_table['翻单'] = pivot_table['翻单'].astype(int) else: pivot_table['翻单'] = 0 print(pivot_table) # 计算每个款号的总数量 pivot_table['总数量'] = pivot_table.sum(axis=1) print("\n包含总数量的最终统计:") print(pivot_table) else: print("大货采购数据中未找到'订单类型'列") else: print("没有成功读取到任何大货采购明细单文件数据") print("\n=== 第六部分:TOP100款式分析 ===") # 按款式编码分组统计销售数据 - 使用新的变量名避免覆盖 style_sales_summary = base_data_supplement.groupby(['款式编码']).agg({ '商品名称': 'first', '销售数量': 'sum', '退货数量': 'sum', '实退数量': 'sum' if '实退数量' in base_data_supplement.columns else 0, '退货退款率(数量)': 'sum', '退货退款率(金额)': 'sum' }).reset_index() # 计算款式级别的退款率 - 使用实退数量 style_sales_summary['退款率'] = np.where( style_sales_summary['销售数量'] > 0, (style_sales_summary['实退数量'] / style_sales_summary['销售数量'] * 100).round(2) if '实退数量' in style_sales_summary.columns else 0, 0 ) # 合并销售数据和采购数据(首单和翻单信息) if not pivot_table.empty: # 确保款号列名一致 pivot_table_reset = pivot_table.reset_index() if '款号' in pivot_table_reset.columns: # 合并销售数据和采购数据 final_result = pd.merge( style_sales_summary, pivot_table_reset[['款号', '首单', '翻单']], left_on='款式编码', right_on='款号', how='left' ) # 删除重复的款号列 final_result = final_result.drop('款号', axis=1) else: # 如果采购数据中没有款号列,只保留销售数据 final_result = style_sales_summary.copy() final_result['首单'] = 0 final_result['翻单'] = 0 else: # 如果没有采购数据,只保留销售数据 final_result = style_sales_summary.copy() final_result['首单'] = 0 final_result['翻单'] = 0 # 填充缺失的首单和翻单数据 final_result['首单'] = final_result['首单'].fillna(0).astype(int) final_result['翻单'] = final_result['翻单'].fillna(0).astype(int) # 按销量从高到低排序并只保留前100条记录 print(f"\n=== 按销量排序并选择前100款 ===") final_result_sorted = final_result.sort_values( by=['销售数量'], ascending=[False] ).reset_index(drop=True).head(100) # 只保留前100条 # 重命名列以符合需求 final_result_sorted = final_result_sorted.rename(columns={ '款式编码': '款号', '商品名称': '产品名称', '销售数量': '销量' }) # 确保列的顺序正确 final_result_sorted = final_result_sorted[['款号', '产品名称', '销量', '退款率', '首单', '翻单']] print("\n=== 最终统计结果(销量前100款)===") print(final_result_sorted.to_string(index=False)) total_sales = final_result_sorted['销量'].sum() print(f"\n前100款总销量: {total_sales}") print("\n=== 第七部分:退款率分析 ===") # 筛选销售数量大于0的款式进行退款率分析 valid_style_data = final_result[final_result['销售数量'] > 0].copy() print(f"\n有效分析款式数量: {len(valid_style_data)}") # 计算总的退货率(退款率) valid_style_data['总退货率'] = valid_style_data['退款率'] # 退款率高的前50个款式(按总退货率降序排序,然后按销量降序排序) print(f"\n=== 退款率高的前50个款式 ===") high_refund_items = valid_style_data.sort_values( by=['总退货率', '销售数量'], ascending=[False, False] ).reset_index(drop=True).head(50) # 重命名列 high_refund_items = high_refund_items.rename(columns={ '款式编码': '款号', '商品名称': '产品名称', '销售数量': '销量' }) # 确保列的顺序正确 high_refund_items = high_refund_items[['款号', '产品名称', '销量', '退款率', '首单', '翻单']] print(high_refund_items.to_string(index=False)) print(f"\n高退款率款式平均退款率: {high_refund_items['退款率'].mean():.2f}%") # 退款率低的前50个款式(按总退货率升序排序,然后按销量降序排序,排除退款率为0的款式) print(f"\n=== 退款率低的前50个款式 ===") low_refund_candidates = valid_style_data[valid_style_data['退款率'] > 0].copy() if len(low_refund_candidates) >= 50: low_refund_items = low_refund_candidates.sort_values( by=['总退货率', '销售数量'], ascending=[True, False] ).reset_index(drop=True).head(50) else: # 如果有效款式不足50个,包含所有退款率大于0的款式 low_refund_items = low_refund_candidates.sort_values( by=['总退货率', '销售数量'], ascending=[True, False] ).reset_index(drop=True) print(f"警告:退款率大于0的款式不足50个,仅显示{len(low_refund_items)}个款式") # 重命名列 low_refund_items = low_refund_items.rename(columns={ '款式编码': '款号', '商品名称': '产品名称', '销售数量': '销量' }) # 确保列的顺序正确 low_refund_items = low_refund_items[['款号', '产品名称', '销量', '退款率', '首单', '翻单']] print(low_refund_items.to_string(index=False)) print(f"\n低退款率款式平均退款率: {low_refund_items['退款率'].mean():.2f}%") print("\n=== 第八部分:数据拼接和最终结果 ===") merged_data = pd.DataFrame() if not pivot_table.empty and season_data is not None and '商品编码' in season_data.columns: # 重命名pivot_table的索引名为商品编码,以便与销售数据合并 pivot_table_renamed = pivot_table.reset_index().rename(columns={'款号': '商品编码'}) print(f"\n大货采购数据形状: {pivot_table_renamed.shape}") # 按商品编码分组统计销售数据 sales_summary = season_data.groupby('商品编码').agg({ '销售数量': 'sum', }).reset_index() print(f"\n销售数据汇总形状: {sales_summary.shape}") # 根据商品编码拼接数据 merged_data = pd.merge(sales_summary, pivot_table_renamed, on='商品编码', how='outer') merged_data = merged_data.fillna(0) print(f"\n合并后总数据形状: {merged_data.shape}") # 确保数值列都是整数类型 numeric_columns = ['销售数量', '总数量'] for col in merged_data.columns: if col in numeric_columns or col in pivot_table.columns: merged_data[col] = merged_data[col].astype(int) print("\n最终数据类型:") print(merged_data.dtypes) # 创建最终的合并数据,只包含所需字段 print("\n=== 创建最终结果 ===") if not merged_data.empty and '商品编码' in merged_data.columns and season_data is not None: # 先合并款式编码信息 style_code_mapping = season_data[['商品编码', '款式编码', '季节', '其它属性3']].drop_duplicates() # 修复:使用正确的变量名 monthly_style_summary_sorted if not monthly_style_summary_sorted.empty: final_result = pd.merge( monthly_style_summary_sorted[['季节', '其它属性3', '款式编码', '销售数量']], merged_data[['商品编码', '销售数量', '首单', '翻单', '总数量']], left_on='款式编码', right_on='商品编码', how='left', suffixes=('_sales', '_purchase') ) # 处理重复的销售数量列 final_result['销售数量'] = final_result['销售数量_sales'].fillna(final_result['销售数量_purchase']) final_result = final_result.drop(['销售数量_sales', '销售数量_purchase', '商品编码'], axis=1, errors='ignore') # 确保所有数值列为整数 for col in ['销售数量', '首单', '翻单', '总数量']: if col in final_result.columns: final_result[col] = final_result[col].fillna(0).astype(int) print(f"\n最终结果形状: {final_result.shape}") else: print("警告:monthly_style_summary_sorted 为空,无法创建最终结果") # 创建基础的季节统计结果 if not monthly_attr_summary_sorted.empty: final_result = monthly_attr_summary_sorted[['季节', '其它属性3', '销售数量']].copy() # 添加空的采购列 for col in ['首单', '翻单', '总数量']: final_result[col] = 0 print(f"\n创建基础结果(仅包含季节属性数据): {final_result.shape}") elif season_data is not None: # 如果无法通过商品编码合并,则创建基础的季节统计结果 if not monthly_attr_summary_sorted.empty: final_result = monthly_attr_summary_sorted[['季节', '其它属性3', '销售数量']].copy() # 添加空的采购列 for col in ['首单', '翻单', '总数量']: if col not in final_result.columns: final_result[col] = 0 print(f"\n创建基础结果(仅包含季节属性数据): {final_result.shape}") elif not monthly_style_summary_sorted.empty: final_result = monthly_style_summary_sorted[['季节', '其它属性3', '款式编码', '销售数量']].copy() # 添加空的采购列 for col in ['首单', '翻单', '总数量']: if col not in final_result.columns: final_result[col] = 0 print(f"\n创建基础结果(仅包含款式数据): {final_result.shape}") # 确保结果只包含所需字段 required_columns = ['季节', '其它属性3', '销售数量', '首单', '翻单', '总数量'] if final_result is not None and not final_result.empty: # 确保所有必需的列都存在 for col in required_columns: if col not in final_result.columns: final_result[col] = 0 final_result = final_result.reindex(columns=required_columns, fill_value=0) print("\n按季节和其它属性3汇总数据:") season_attr_summary = final_result.groupby(['季节', '其它属性3']).agg({ '销售数量': 'sum', '首单': 'sum', '翻单': 'sum', '总数量': 'sum' }).reset_index() # 按季节和销售数量排序(销售数量降序) season_attr_summary = season_attr_summary.sort_values( by=['季节', '销售数量'], ascending=[True, False] ).reset_index(drop=True) print(season_attr_summary.to_string(index=False)) print("\n=== 保存最终结果 ===") # 优化:检查数据大小,避免保存过大的明细数据 def check_data_size(df, max_rows=1000000): """检查DataFrame大小是否超过Excel限制""" if len(df) > max_rows: print(f"警告:数据行数 {len(df)} 超过Excel限制 {max_rows},将不保存该表") return False return True # 保存结果到Excel文件 output_file = '销售分析综合结果.xlsx' with pd.ExcelWriter(output_file, engine='openpyxl') as writer: # 基础分析工作表 if not monthly_shop_summary.empty: monthly_shop_summary.to_excel(writer, sheet_name='按平台统计', index=False) print("- 按平台统计") if not monthly_store_summary.empty: monthly_store_summary.to_excel(writer, sheet_name='按店铺统计', index=False) print("- 按店铺统计") # 季节分析工作表 if not monthly_attr_summary_sorted.empty: monthly_attr_summary_sorted.to_excel(writer, sheet_name='季节-属性统计', index=False) print("- 季节-属性统计") # 款式分析工作表 if not monthly_style_summary_sorted.empty: monthly_style_summary_sorted.to_excel(writer, sheet_name='季节-款式统计', index=False) print("- 季节-款式统计") # 价格带分析工作表 if not price_band_summary_sorted.empty: price_band_summary_sorted.to_excel(writer, sheet_name='价格带详细分析', index=False) print("- 价格带详细分析") if not price_band_total_sorted.empty: price_band_total_sorted.to_excel(writer, sheet_name='价格带汇总', index=False) print("- 价格带汇总") # 大货采购分析工作表 if not pivot_table.empty: pivot_table_renamed = pivot_table.reset_index().rename(columns={'款号': '商品编码'}) pivot_table_renamed.to_excel(writer, sheet_name='大货采购统计', index=False) print("- 大货采购统计") # TOP100分析工作表 final_result_sorted.to_excel(writer, sheet_name='销量前100款', index=False) print("- 销量前100款") # 新增:退款率分析工作表 if not high_refund_items.empty: high_refund_items.to_excel(writer, sheet_name='退款率高前50款', index=False) print("- 退款率高前50款") if not low_refund_items.empty: low_refund_items.to_excel(writer, sheet_name='退款率低前50款', index=False) print("- 退款率低前50款") # 最终结果工作表 if final_result is not None and not final_result.empty: final_result.to_excel(writer, sheet_name='最终结果_销量采购', index=False) print("- 最终结果_销量采购") if not season_attr_summary.empty: season_attr_summary.to_excel(writer, sheet_name='季节-属性汇总', index=False) print("- 季节-属性汇总") # 优化:只保存销售明细数据 if base_data_supplement is not None: if check_data_size(base_data_supplement): base_data_supplement.to_excel(writer, sheet_name='销售明细数据', index=False) print("- 销售明细数据") else: # 如果数据太大,只保存前10000行作为示例 sample_size = min(10000, len(base_data_supplement)) base_data_supplement.head(sample_size).to_excel(writer, sheet_name='销售明细数据(示例)', index=False) print(f"- 销售明细数据(示例)") print(f"\n结果已保存到:{output_file}") # 显示总体统计信息 - 使用原始的monthly_shop_summary if not monthly_shop_summary.empty: # 安全地获取列数据 if '仅退数' in monthly_shop_summary.columns and '销售数量' in monthly_shop_summary.columns: total_cancelled = monthly_shop_summary['仅退数'].sum() total_sales = monthly_shop_summary['销售数量'].sum() print(f"\n总体已取消订单销售数量: {total_cancelled}") print(f"总体销售数量: {total_sales}") if total_sales > 0: print(f"总体取消率: {total_cancelled/total_sales*100:.2f}%") else: print("总体销售数量为0,无法计算取消率") # 显示实退相关统计 if '实退数量' in monthly_shop_summary.columns and '销售数量' in monthly_shop_summary.columns: total_refund = monthly_shop_summary['实退数量'].sum() print(f"总体实退数量: {total_refund}") if total_sales > 0: print(f"总体实退率: {total_refund/total_sales*100:.2f}%") else: print("总体销售数量为0,无法计算实退率") else: print("\n警告:无法计算总体取消率,因为数据中缺少必要的列") # 显示可用的列 print(f"可用的列: {monthly_shop_summary.columns.tolist()}") # 显示退款率分析的总体统计 if not valid_style_data.empty: avg_refund_rate = valid_style_data['退款率'].mean() max_refund_rate = valid_style_data['退款率'].max() min_refund_rate = valid_style_data[valid_style_data['退款率'] > 0]['退款率'].min() if len(valid_style_data[valid_style_data['退款率'] > 0]) > 0 else 0 print(f"\n退款率分析总体统计:") print(f"平均退款率: {avg_refund_rate:.2f}%") print(f"最高退款率: {max_refund_rate:.2f}%") print(f"最低退款率: {min_refund_rate:.2f}%") print(f"有效分析款式数量: {len(valid_style_data)}") elapsed_time = time.perf_counter() - start_time print(f"\n- 总耗时:{elapsed_time:.2f} 秒") 把功能写写成BI看板 用python处理
01-03
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值