Pandas数据读取全攻略：从MySQL到Excel的实战指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01165/article/details/148682359

Pandas数据读取全攻略：从MySQL到Excel的实战指南

dealing_with_data Material that I use for a variety of classes and tutorials 项目地址: https://gitcode.com/gh_mirrors/de/dealing_with_data

前言

在数据分析工作中，数据读取是最基础也是最重要的一环。本文将基于实际项目案例，详细介绍如何使用Pandas库从多种数据源读取数据，包括MySQL数据库、CSV文件、Excel文件以及HTML网页表格等。通过本教程，你将掌握Pandas中各种数据读取方法的实际应用技巧。

环境准备

在开始之前，我们需要确保已安装必要的Python库：

!pip install pandas numpy matplotlib seaborn PyMySQL sqlalchemy xlrd

然后导入所需模块：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. 从MySQL数据库读取数据

1.1 建立数据库连接

使用SQLAlchemy创建与MySQL数据库的连接：

from sqlalchemy import create_engine

conn_string = 'mysql+pymysql://{user}:{password}@{host}/{db}?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org', 
    user = 'student',
    password = 'dwdstudent2015', 
    db = 'doh_restaurants',
    encoding = 'utf8mb4')

engine = create_engine(conn_string)

1.2 执行SQL查询并读取数据

使用read_sql方法执行查询并将结果读取为DataFrame：

sql = '''
    SELECT R.CAMIS, R.DBA, R.BUILDING, R.STREET, R.ZIPCODE, R.BORO, 
          R.CUISINE_DESCRIPTION, R.LATITUDE, R.LONGITUDE
        FROM doh_restaurants.restaurants R
'''

with engine.connect() as connection:
    restaurants = pd.read_sql(text(sql), con=connection)

2. 从CSV文件读取数据

2.1 下载并读取CSV文件

以2018年中央公园松鼠普查数据为例：

# 方法1：先下载文件再读取
!curl https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD -o squirrel_census.csv
squirrels = pd.read_csv("squirrel_census.csv")

# 方法2：直接从URL读取（适合小数据集）
url = 'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD'
squirrels = pd.read_csv(url)

2.2 数据清洗与转换

# 标准化列名
cols = squirrels.columns.map(lambda x: x.replace(' ', '_').upper())
squirrels.columns = cols

# 转换日期格式
squirrels['DATE'] = pd.to_datetime(squirrels['DATE'], format='%m%d%Y')

# 删除不需要的列
squirrels = squirrels.drop(["LAT/LONG"], axis="columns")

3. 从Excel文件读取数据

3.1 读取泰坦尼克号数据集

titanic_url = 'https://storage.googleapis.com/datasets_nyu/titanic.xls'
titanic = pd.read_excel(titanic_url)

3.2 读取数据字典

url = 'https://data.cityofnewyork.us/api/views/43nn-pn8j/files/ec33d2c8-81f5-499a-a238-0213a38239cd?download=true&filename=RestaurantInspectionDataDictionary_09242018.xlsx'
restaurants_data_dictionary = pd.read_excel(url, sheet_name=1, header=1)

# 数据清洗
restaurants_data_dictionary.columns = ['Column_Name', 'Description', 'Code_Definitions', 'Notes']
restaurants_data_dictionary = restaurants_data_dictionary.drop(0, axis='index')

4. 读取固定宽度文件

4.1 读取美国意外死亡数据

deaths = pd.read_fwf("https://storage.googleapis.com/datasets_nyu/acc-deaths.txt")

4.2 数据重塑与转换

# 使用melt函数转换数据结构
unpivoted_deaths = pd.melt(deaths, id_vars=['Year'], 
        value_vars=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
        var_name='Month', value_name='Deaths')

# 合并日期列
unpivoted_deaths["Date"] = unpivoted_deaths["Month"] + "-" + unpivoted_deaths["Year"].astype(str)
unpivoted_deaths["Date"] = pd.to_datetime(unpivoted_deaths["Date"], format='%b-%Y')

# 设置日期索引并排序
unpivoted_deaths = unpivoted_deaths.drop(["Month","Year"], axis='columns')
unpivoted_deaths = unpivoted_deaths.set_index(keys="Date")
unpivoted_deaths = unpivoted_deaths.sort_index()

# 绘制图表
unpivoted_deaths.plot()

5. 从HTML网页读取表格数据

5.1 从在线百科读取国家人口数据

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
df_list = pd.read_html(
            url,
            match='Population',
            header=0
        )