报错 pandas.errors.inavlidaindexerror: Reindexing only valid with uniquely valued index objects

Rhys..

于 2025-01-09 17:31:38 发布

阅读量566

点赞数 6

文章标签： pandas python 搜索引擎

本文链接：https://blog.youkuaiyun.com/m0_60008263/article/details/145038792

版权

你收到的错误 pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued index objects，通常发生在尝试对包含重复索引的数据框进行重索引操作时。

在你的具体场景中，很可能是由于尝试对 DataFrame 进行 concat 或者 merge 操作时，出现了重复的索引。我们可以通过确保索引列的唯一性来避免这个问题。

解决方案：

确保主键列（ID 列）是唯一的。
在拼接或合并前重置索引。

更新后的代码：

import pandas as pd
import os

def update_and_compare_tables(file1, file2, sheet_name1, sheet_name2, id_column, columns_to_update, output_file):
    # 验证文件是否存在
    if not os.path.isfile(file1):
        print(f"File not found: {file1}")
        return
    if not os.path.isfile(file2):
        print(f"File not found: {file2}")
        return

    # 读取 Excel 文件
    df1 = pd.read_excel(file1, sheet_name=sheet_name1)
    df2 = pd.read_excel(file2, sheet_name=sheet_name2)

    # 确保没有多余空白
    df1.columns = df1.columns.str.strip()
    df2.columns = df2.columns.str.strip()

    # 打印列名以调试
    print("Table 1 columns:", df1.columns)
    print("Table 2 columns:", df2.columns)

    # 重设索引，确保索引的唯一性
    df1 = df1.reset_index(drop=True)
    df2 = df2.reset_index(drop=True)

    # 找出在 table1 中存在但在 table2 中不存在的行，将 table1 中的 BCD 列填入 table2
    unmatched_rows_from_df1 = df1[~df1[id_column].isin(df2[id_column])]
    
    # 确保 unmatched_rows_from_df1 和 df2 的列名一致
    table2_schema = df2.columns.tolist()
    for col in ['A'] + columns_to_update:
        if col not in table2_schema:
            print(f"Column {col} not in Table 2 schema.")
            return

    updated_df2 = pd.concat([df2, unmatched_rows_from_df1[['A'] + columns_to_update]], ignore_index=True)

    # 找出 table2 中存在但 table1 中不存在的 A 列值
    unmatched_ids_from_df2 = df2[~df2[id_column].isin(df1[id_column])][id_column].tolist()

    # 将更新后的表2保存到新的 Excel 文件中
    updated_df2.to_excel(output_file, index=False)

    return unmatched_rows_from_df1[id_column].tolist(), unmatched_ids_from_df2

# 文件路径、工作表名称、ID列名以及要更新的列
file1 = 'C:/path/to/table1.xlsx'
file2 = 'C:/path/to/table2.xlsx'
output_file = 'C:/path/to/table2.xlsx'
sheet_name1 = 'Sheet1'
sheet_name2 = 'Sheet2'
id_column = 'A'  # 进行匹配的列
columns_to_update = ['B', 'C', 'D']  # 表1中要更新到表2的列

# 调用函数并获取未匹配的ID
unmatched_ids_from_df1, unmatched_ids_from_df2 = update_and_compare_tables(
    file1, file2, sheet_name1, sheet_name2, id_column, columns_to_update, output_file
)

print("IDs in table1 not in table2:", unmatched_ids_from_df1)
print("IDs in table2 not in table1:", unmatched_ids_from_df2)

详细解释

检查文件是否存在：

使用 if not os.path.isfile(file1) 以确保文件存在

if not os.path.isfile(file1):
    print(f"File not found: {file1}")
    return
if not os.path.isfile(file2):
    print(f"File not found: {file2}")
    return

读取 Excel 文件：

使用 pd.read_excel(file1, sheet_name=sheet_name1) 读取文件

df1 = pd.read_excel(file1, sheet_name=sheet_name1)
df2 = pd.read_excel(file2, sheet_name=sheet_name2)

处理列名：

使用 columns.str.strip() 去除空格

df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()

重设索引：

使用 reset_index(drop=True) 确保索引唯一性

df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)

确保列名一致：

检查并确保 unmatched_rows_from_df1 和 df2 的列名一致

table2_schema = df2.columns.tolist()
for col in ['A'] + columns_to_update:
    if col not in table2_schema:
        print(f"Column {col} not in Table 2 schema.")
        return

合并数据：

使用 pd.concat 合并数据并避免“重索引”错误

updated_df2 = pd.concat([df2, unmatched_rows_from_df1[['A'] + columns_to_update]], ignore_index=True)

保存数据：
- 使用 to_excel 保存更新后的数据
```
updated_df2.to_excel(output_file, index=False)
```

通过以上方法，确保索引唯一性和列名一致性，避免了重索引问题，并实现了对两张表格的比较和更新操作。