pandas 入门

原创已于 2025-08-13 11:33:53 修改 · 347 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#pandas

于 2025-08-04 14:10:17 首次发布

python 同时被 2 个专栏收录

50 篇文章

订阅专栏

数据库

10 篇文章

订阅专栏

Python pandas是一个强大的数据分析库，适合处理各种数据集。以下是一些入门的基础知识：

1. Pandas基础

Pandas是基于NumPy的专业数据分析工具，它提供了便捷的数据读写操作、SQL的增删改查功能、数据透视表功能等。

1.1 数据结构

Pandas主要有两种数据结构：

Series：类似于一维数组，由索引和值组成。
DataFrame：二维数据结构，类似于表格，有行和列这两个维度。

1.2 数据类型

Pandas支持多种数据类型，包括：

object：相当于Python的字符串类型
int64：整型
float64：浮点型
bool：布尔类型
datetime64：日期和时间类型
timedelta[ns]：时间间隔类型
category：分类类别

import pandas as pd

# 创建Series
ser = pd.Series([1, 2, 3, 4, 5])
print(ser)

# 创建DataFrame
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
                   'age': [25, 30, 35, 40, 45],
                   'city': ['New York', 'London', 'Paris', 'Berlin', 'Tokyo']})
print(df)

# 访问数据
print(df['name'][0])  # 访问第一行的名字 Alice
print(df.loc[2, 'age'])  # 访问第三行的年龄 35

# 数据筛选
print(df[df['age'] > 35])  # 筛选出年龄大于35的行

2. Pandas的数据读取和写入操作

Pandas支持大部分的主流文件格式进行数据读写，包括：

文本文件：csv, txt等
Excel文件：xls, xlsx等
SQL文件：支持大部分主流关系型数据库

# 读写csv,txt
import pandas as pd
ser = pd.Series([1, 2, 3, 4, 5])
print(ser)
# 创建一个示例DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 28]}
df = pd.DataFrame(data)
print(df)
# 将DataFrame写入txt文件
df.to_csv('example.txt', index=False) # sep默认为逗号,index 为False不显示行索引
ser.to_csv('example.txt', index=False) # 不是追加重新写入了
# 读取txt文件
new_df = pd.read_csv('example.txt')

print(new_df)

read_csv参数说明

filepath_or_buffer：文件路径或缓冲区，可以是绝对路径或相对路径。
sep：分隔符，用于分隔文件中的数据。默认为逗号（,）。
header：指定列名的行号。默认为第一行。
names：指定列名列表。如果不指定，则使用文件中的第一行作为列名。
index_col：指定用作索引的列。默认为None，即不创建索引。
usecols：指定要读取的列的列名列表。默认为读取所有列。
skiprows：跳过指定行数的数据。默认为0，即不跳过任何行。
encoding：文件的编码方式。默认为'utf-8'。
engine：使用的解析引擎。默认为'c'，即使用C引擎，速度快但功能较少；或者使用'python'，功能更完备但速度较慢。

如何追加写入？

1、如果你想把 DataFrame 或 Series 的内容以文本形式追加写入 TXT 文件，可以使用 to_string() 方法，然后以 追加模式 ('a') 写入文件。

2、如果你不想用 to_string()，而是想 逐行写入数据（比如 CSV 格式或自定义格式），可以手动遍历 DataFrame 或 Series 并写入文件

import pandas as pd

# 示例 DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

# 1. 将 DataFrame 转换为字符串（表格形式）
df_str = df.to_string(index=False)  # index=False 不显示行索引

# 2. 追加写入 TXT 文件
with open('output.txt', 'a', encoding='utf-8') as f:
    f.write(df_str + '\n\n')  # 写入 DataFrame 并加空行分隔

print("DataFrame 已追加写入 output.txt")

import pandas as pd

# 示例 Series
s = pd.Series([10, 20, 30], name='Numbers')

# 1. 将 Series 转换为字符串
s_str = s.to_string()

# 2. 追加写入 TXT 文件
with open('output.txt', 'a', encoding='utf-8') as f:
    f.write(s_str + '\n\n')  # 写入 Series 并加空行分隔

print("Series 已追加写入 output.txt")

import pandas as pd

# 示例 DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

# 1. 以追加模式打开文件
with open('output.txt', 'a', encoding='utf-8') as f:
    # 2. 遍历 DataFrame 的每一行
    for _, row in df.iterrows():
        f.write(f"{row['Name']}, {row['Age']}\n")  # 写入 CSV 格式

print("DataFrame 已逐行追加写入 output.txt")


import pandas as pd

# 示例 Series
s = pd.Series([10, 20, 30], name='Numbers')

# 1. 以追加模式打开文件
with open('output.txt', 'a', encoding='utf-8') as f:
    # 2. 遍历 Series 的每一项
    for value in s:
        f.write(f"{value}\n")  # 每行写入一个数字

print("Series 已逐行追加写入 output.txt")

合并多个excel文件，并去重

import pandas as pd

import pandas as pd
import os

# 指定包含Excel文件的文件夹路径
folder_path = r'D:\bigdatacharge\data'

# 获取文件夹中所有Excel文件的列表
excel_files = [f for f in os.listdir(folder_path) if f.endswith('.xlsx')]
print(excel_files)
# 初始化一个空的DataFrame用于存储所有数据
combined_df = pd.DataFrame()

# 遍历所有的Excel文件
for file in excel_files:
    file_path = os.path.join(folder_path, file)
    # 读取Excel文件中的第一个sheet，可以根据需要调整sheet名称
    df = pd.read_excel(file_path, sheet_name=0)
    # 将读取的数据追加到combined_df中
    combined_df = pd.concat([combined_df, df], ignore_index=True)
combined_df = combined_df.drop_duplicates(subset='question', keep='first')
# 将合并后的DataFrame保存到一个新的Excel文件中
output_path = r'D:\bigdatacharge\data\combined_file2.xlsx'
combined_df.to_excel(output_path, index=False)

python pandas sql 的增删改查功能

import pyodbc
import pandas as pd

# 数据库连接字符串
conn_str = (
    r'DRIVER={ODBC Driver 17 for SQL Server};'
    r'SERVER=your_server_name;'
    r'DATABASE=your_database_name;'
    r'UID=your_username;'
    r'PWD=your_password;'
)

# 连接到数据库
conn = pyodbc.connect(conn_str)

# 查询SQL数据库并创建DataFrame
query = "SELECT * FROM your_table_name"
df = pd.read_sql_query(query, conn)

# 插入新数据
new_data = {'column1': 'value1', 'column2': 'value2', 'column3': 'value3'}
df.append(new_data, ignore_index=True).to_sql('your_table_name', conn, if_exists='append', index=False)

# 更新数据
df.loc[df['column1'] == 'target_value', 'column2'] = 'new_value'
df.to_sql('your_table_name', conn, if_exists='replace', index=False)

# 删除数据
df_to_delete = df[df['column1'] == 'target_value']
df_to_delete.drop(df_to_delete.index, inplace=True)
df.to_sql('your_table_name', conn, if_exists='replace', index=False)

# 关闭数据库连接
conn.close()