Pandas数据分析基础教程
- 1 第一章:数据载入及初步观察
- 1.1 载入数据
- 1.2 初步观察
- 1.3 保存数据
- 1.4 知道你的数据叫什么(Series与DataFrame两种基础数据结构)
- 1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]
- 1.4.2 任务二:根据上节课的方法载入"train.csv"文件
- 1.4.3 任务三:查看DataFrame数据的每列的名称
- 1.4.4任务四:查看"Cabin"这列的所有值[有多种方法]
- 1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
- 1.4.6 任务六: 将['PassengerId','Name','Age','Ticket']这几个列元素隐藏,只观察其他几个列元素
- 1.5 筛选的逻辑
- 1.6 了解你的数据吗?
- 参考资料
1 第一章:数据载入及初步观察
1.1 载入数据
数据集下载 https://www.kaggle.com/c/titanic/overview
下载后数据集存放于相对路径 path= r"…/MyAnswer/titanic"
1.1.1 任务一:导入numpy和pandas
#写入代码
import numpy as np
import pandas as pd
【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库
1.1.2 任务二:载入数据
(1) 使用相对路径载入数据
(2) 使用绝对路径载入数据
#写入代码
df_abs = pd.read_csv(r'D:\D SoftWare\坚果云同步\我的坚果云\MyJupyterFolder\hands-on-data-analysis\第一单元项目集合\MyAnswer\titanic\train.csv')
print(df_abs)
# 相对路径出错, os.getcwd()获取当前路径
import os
os.getcwd()
df_realtive = pd.read_csv(r".\MyAnswer\titanic\train.csv")
df_realtive.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#写入代码 read_table 读取csv文件
df_table = pd.read_table(r".\MyAnswer\titanic\train.csv")
print(df_table) # 读取文件失败,需要设置正确的分隔符。 每行没遇到Tab键,因此只有一行
df_table = pd.read_table(r".\MyAnswer\titanic\train.csv")
print(df_table)
df_table.head()
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
.. ...
886 887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887 888,1,1,"Graham, Miss. Margaret Edith",female,...
888 889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889 890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890 891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...
[891 rows x 1 columns]
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
.. ...
886 887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887 888,1,1,"Graham, Miss. Margaret Edith",female,...
888 889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889 890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890 891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...
[891 rows x 1 columns]
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked | |
---|---|
0 | 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/... |
1 | 2,1,1,"Cumings, Mrs. John Bradley (Florence Br... |
2 | 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S... |
3 | 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ... |
4 | 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3... |
df_table = pd.read_table(r".\MyAnswer\titanic\train.csv", sep= ',')
df_table.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。
【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
1)read_table()的默认分隔符时tab键,因此读取csv文件时,需要指定特定的分隔符 sep=’,’
2)csv tsv txt 本质都是文本文件,csv采用半角逗号分割’,‘,tsv采用tab键分割,txt不指定分隔符,可以为逗号,空格,tab键等。 pd.read_csv(‘aaa.csv’) pd.read_csv(‘bbb.tsv’)
【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。
1.1.3 任务三:每1000行为一个数据模块,逐块读取
#写入代码
df_chunker = pd.read_csv(r".\MyAnswer\titanic\train.csv", sep= ',', chunksize =300)
df_chunker
<pandas.io.parsers.TextFileReader at 0x162c75d4b08>
【思考】什么是逐块读取?为什么要逐块读取呢?
当文件较大的时候,需要逐块读取文件。
【提示】大家可以chunker(数据块)是什么类型?用for
循环打印出来出处具体的样子是什么?
指定chunkersize 后生成一个可迭代对象
for i in df_chunker: #逐块打印chunker
print(i)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
295 296 0 1
296 297 0 3
297 298 0 1
298 299 1 1
299 300 1 1
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
295 Lewy, Mr. Ervin G male NaN 0
296 Hanna, Mr. Mansour male 23.5 0
297 Allison, Miss. Helen Loraine female 2.0 1
298 Saalfeld, Mr. Adolphe male NaN 0
299 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
295 0 PC 17612 27.7208 NaN C
296 0 2693 7.2292 NaN C
297 2 113781 151.5500 C22 C26 S
298 0 19988 30.5000 C106 S
299 1 PC 17558 247.5208 B58 B60 C
[300 rows x 12 columns]
PassengerId Survived Pclass \
300 301 1 3
301 302 1 3
302 303 0 3
303 304 1 2
304 305 0 3
.. ... ... ...
595 596 0 3
596 597 1 2
597 598 0 3
598 599 0 3
599 600 1 1
Name Sex Age SibSp Parch \
300 Kelly, Miss. Anna Katherine "Annie Kate" female NaN 0 0
301 McCoy, Mr. Bernard male NaN 2 0
302 Johnson, Mr. William Cahoone Jr male 19.0 0 0
303 Keane, Miss. Nora A female NaN 0 0
304 Williams, Mr. Howard Hugh "Harry" male NaN 0 0
.. ... ... ... ... ...
595 Van Impe, Mr. Jean Baptiste male 36.0 1 1
596 Leitch, Miss. Jessie Wills female NaN 0 0
597 Johnson, Mr. Alfred male 49.0 0 0
598 Boulos, Mr. Hanna male NaN 0 0
599 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49.0 1 0
Ticket Fare Cabin Embarked
300 9234 7.7500 NaN Q
301 367226 23.2500 NaN Q
302 LINE 0.0000 NaN S
303 226593 12.3500 E101 Q
304 A/5 2466 8.0500 NaN S
.. ... ... ... ...
595 345773 24.1500 NaN S
596 248727 33.0000 NaN S
597 LINE 0.0000 NaN S
598 2664 7.2250 NaN C
599 PC 17485 56.9292 A20 C
[300 rows x 12 columns]
PassengerId Survived Pclass \
600 601 1 2
601 602 0 3
602 603 0 1
603 604 0 3
604 605 1 1
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Name Sex Age SibSp \
600 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... female 24.0 2
601 Slabenoff, Mr. Petco male NaN 0
602 Harrington, Mr. Charles H male NaN 0
603 Torber, Mr. Ernst William male 44.0 0
604 Homer, Mr. Harry ("Mr E Haven") male 35.0 0
.. ... ... ... ...
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0
Parch Ticket Fare Cabin Embarked
600 1 243847 27.0000 NaN S
601 0 349214 7.8958 NaN S
602 0 113796 42.4000 NaN S
603 0 364511 8.0500 NaN S
604 0 111426 26.5500 NaN C
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q
[291 rows x 12 columns]
# 一次读取前若干行
#df = pd.read_csv(r".\MyAnswer\titanic\train.csv")
df= pd.read_csv(r".\MyAnswer\titanic\train.csv", nrows=10)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口
#方式1:读取文件后修改列名
df= pd.read_csv(r".\MyAnswer\titanic\train.csv")
df.head(3)
df.columns = ['乘客ID', '是否幸存', '乘客等级(1/2/3等舱位)','乘客姓名', '性别', '年龄','堂兄弟/妹个数', '父母与小孩个数', '船票信息','票价', '客舱', '登船港口']
df.head(3)
乘客ID | 是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
# 方式2:读取文件过程中,直接设置对应的names
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
#设置索引列为'乘客ID',取代自然生成的索引
df.head(3)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?
1.2 初步观察
导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等
1.2.1 任务一:查看数据的基本信息
df.describe()
是否幸存 | 仓位等级 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 票价 | |
---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 是否幸存 891 non-null int64
1 仓位等级 891 non-null int64
2 姓名 891 non-null object
3 性别 891 non-null object
4 年龄 714 non-null float64
5 兄弟姐妹个数 891 non-null int64
6 父母子女个数 891 non-null int64
7 船票信息 891 non-null object
8 票价 891 non-null float64
9 客舱 204 non-null object
10 登船港口 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
【提示】有多个函数可以这样做,你可以做一下总结
1.2.2 任务二:观察表格前10行的数据和后15行的数据
df.head(10)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
df.tail(15)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
#写入代码
null_data = df.isnull()
print(null_data.head(10))
是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 \
乘客ID
1 False False False False False False False False False True
2 False False False False False False False False False False
3 False False False False False False False False False True
4 False False False False False False False False False False
5 False False False False False False False False False True
6 False False False False True False False False False True
7 False False False False False False False False False False
8 False False False False False False False False False True
9 False False False False False False False False False True
10 False False False False False False False False False True
登船港口
乘客ID
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
【总结】上面的操作都是数据分析中对于数据本身的观察
【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助
尝试思考:观察一个数据大小(多少行、多少列)、数据类型(每列数据是数值型or标称型)、每列数据数值分布、数据是否完整
1.3 保存数据
1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
#写入代码
# 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’utf-8‘‘`
df.to_csv(r".\MyAnswer\titanic\train_chinese_GBK.csv", encoding= 'GBK')
df.to_csv(r".\MyAnswer\titanic\train_chinese_utf8.csv", encoding= 'utf-8') # 'utf-8'仍然会乱码?
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-73-2d27237f9be7> in <module>
2 # 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’utf-8‘‘`
3 df.to_csv(r".\MyAnswer\titanic\train_chinese_GBK.csv", encoding= 'GBK')
----> 4 df.to_csv(r".\MyAnswer\titanic\train_chinese_utf8.csv", encoding= utf-8)
NameError: name 'utf' is not defined
【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。
1.4 知道你的数据叫什么(Series与DataFrame两种基础数据结构)
我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?
开始前导入numpy和pandas
import numpy as np
import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
# DateFrame和Series是pandas的两种基本数据格式
#Series 包含四部分,数据data(list,dict等), 索引index, 类型dtype, 列名name
my_sr= pd.Series( data=[1,2,3,4,5,6,7,8,9,10,11,12], dtype='float64', name='col_name' )
print(my_sr)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
Name: col_name, dtype: float64
'''
#我们举的例子
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
'''
#Series 包含四部分,数据data(list,dict等), 行索引index, 列索引columns, 类型dtype, 列名name
my_data = np.arange(0, 12).reshape(-1,3)
my_df= pd.DataFrame( data=my_data, dtype='float64', index=['a', 'b','c','d'],columns=['A', 'B', 'C'])
my_df.head()
A | B | C | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | 3.0 | 4.0 | 5.0 |
c | 6.0 | 7.0 | 8.0 |
d | 9.0 | 10.0 | 11.0 |
1.4.2 任务二:根据上节课的方法载入"train.csv"文件
#写入代码
df = pd.read_csv(r".\MyAnswer\titanic\train.csv", encoding= 'GBK') #存储为GBK格式,必须指定对应的读取格式
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作
1.4.3 任务三:查看DataFrame数据的每列的名称
#写入代码
print(df.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
1.4.4任务四:查看"Cabin"这列的所有值[有多种方法]
df['Cabin']
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
df.Cabin
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去
df_test= pd.read_csv('test_1.csv')
df_test.head(3)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
df_train= pd.read_csv(r".\MyAnswer\titanic\train.csv")
df_train.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
# 删除多余的列
del df_test['a']
df_test.head(3)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
【思考】还有其他的删除多余的列的方式吗?
# df.drop()方法删除列
df_train= pd.read_csv(r".\MyAnswer\titanic\train.csv")
del_df_train = df_train.drop(['Name', 'Sex'], axis=1) # 指定某一列删除,axis必须设置
print('df1:\n',del_df_train.head(3))
df_train= pd.read_csv(r".\MyAnswer\titanic\train.csv")
df_train.drop(['Name', 'Sex'], axis=1, inplace=True) # 指定某一列删除,axis必须设置
print('df2:\n',df_train.head(3))
df_train.drop(df_train.columns[0:3], axis=1, inplace=True)
print('df3:\n',df_train.head(3))
df1:
PassengerId Survived Pclass Age SibSp Parch Ticket \
0 1 0 3 22.0 1 0 A/5 21171
1 2 1 1 38.0 1 0 PC 17599
2 3 1 3 26.0 0 0 STON/O2. 3101282
Fare Cabin Embarked
0 7.2500 NaN S
1 71.2833 C85 C
2 7.9250 NaN S
df2:
PassengerId Survived Pclass Age SibSp Parch Ticket \
0 1 0 3 22.0 1 0 A/5 21171
1 2 1 1 38.0 1 0 PC 17599
2 3 1 3 26.0 0 0 STON/O2. 3101282
Fare Cabin Embarked
0 7.2500 NaN S
1 71.2833 C85 C
2 7.9250 NaN S
df3:
Age SibSp Parch Ticket Fare Cabin Embarked
0 22.0 1 0 A/5 21171 7.2500 NaN S
1 38.0 1 0 PC 17599 71.2833 C85 C
2 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
df_train= pd.read_csv(r".\MyAnswer\titanic\train.csv")
df_train.drop(['PassengerId','Name','Age','Ticket'], axis =1) #inplace =False, 不覆盖原数据,仅返回删除后的数据
【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?
【思考回答】
如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用
隐藏某一列数据,与删除某一列数据,都可以使用drop方法
1.5 筛选的逻辑
表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
下面我们还是用实战来学习pandas这个功能。
1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
df_train= pd.read_csv(r".\MyAnswer\titanic\train.csv")
df_train['Age']< 10 #返回一个列向量
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
df_train[ df_train['Age']< 10 ]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.00 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.00 | 3 | 1 | 349909 | 21.0750 | NaN | S |
43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.00 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
827 | 828 | 1 | 2 | Mallet, Master. Andre | male | 1.00 | 0 | 2 | S.C./PARIS 2079 | 37.0042 | NaN | C |
831 | 832 | 1 | 2 | Richards, Master. George Sibley | male | 0.83 | 1 | 1 | 29106 | 18.7500 | NaN | S |
850 | 851 | 0 | 3 | Andersson, Master. Sigvard Harald Elias | male | 4.00 | 4 | 2 | 347082 | 31.2750 | NaN | S |
852 | 853 | 0 | 3 | Boulos, Miss. Nourelain | female | 9.00 | 1 | 1 | 2678 | 15.2458 | NaN | C |
869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
62 rows × 12 columns
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
midage= df_train[ (df_train['Age'] <50) & (df_train['Age'] >10) ]
midage.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作
加括号! & | 交并集符号
(?)&(?) (?)|(?)
1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
#写入代码
midage.reset_index(drop= True)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
571 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
572 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
573 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
574 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
575 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
576 rows × 12 columns
【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?
midage.loc[[100], ['Pclass', 'Sex']]
Pclass | Sex | |
---|---|---|
100 | 3 | female |
1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.loc[[100,105,108], ['Pclass','Name', 'Sex']]
Pclass | Name | Sex | |
---|---|---|---|
100 | 3 | Petranec, Miss. Matilda | female |
105 | 3 | Mionoff, Mr. Stoytcho | male |
108 | 3 | Rekic, Mr. Tido | male |
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.iloc[[100,105,108],[2,3,4]]
Pclass | Name | Sex | |
---|---|---|---|
149 | 2 | Byles, Rev. Thomas Roussel Davids | male |
160 | 3 | Cribb, Mr. John Hatfield | male |
163 | 3 | Calic, Mr. Jovo | male |
【思考】对比iloc
和loc
的异同
loc 基于元素的索引
iloc 基于位置的索引
1.6 了解你的数据吗?
教材《Python for Data Analysis》第五章
1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序
# 具体请看《利用Python进行数据分析》第五章 排序和排名 部分
#自己构建一个都为数字的DataFrame数据
my_data = np.arange(0, 12).reshape(-1,3)
my_df= pd.DataFrame( data=my_data, dtype='float64', index=['a', 'b','c','d'],columns=['A', 'B', 'C'])
my_df.head()
A | B | C | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | 3.0 | 4.0 | 5.0 |
c | 6.0 | 7.0 | 8.0 |
d | 9.0 | 10.0 | 11.0 |
【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列
my_df.sort_values(by='B', ascending=False)
A | B | C | |
---|---|---|---|
d | 9.0 | 10.0 | 11.0 |
c | 6.0 | 7.0 | 8.0 |
b | 3.0 | 4.0 | 5.0 |
a | 0.0 | 1.0 | 2.0 |
【思考】通过书本你能说出Pandas对DataFrame数据的其他排序方式吗?
【总结】下面将不同的排序方式做一个总结
1.让行索引升序排序
#代码
my_df.sort_index()
A | B | C | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | 3.0 | 4.0 | 5.0 |
c | 6.0 | 7.0 | 8.0 |
d | 9.0 | 10.0 | 11.0 |
2.让列索引升序排序
#代码
my_df.sort_index(axis=1)
A | B | C | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | 3.0 | 4.0 | 5.0 |
c | 6.0 | 7.0 | 8.0 |
d | 9.0 | 10.0 | 11.0 |
3.让列索引降序排序
#代码
my_df.sort_index(axis=1, ascending=False)
C | B | A | |
---|---|---|---|
a | 2.0 | 1.0 | 0.0 |
b | 5.0 | 4.0 | 3.0 |
c | 8.0 | 7.0 | 6.0 |
d | 11.0 | 10.0 | 9.0 |
4.让任选两列数据同时降序排序
#代码
my_df.sort_values(by=['A', 'B'], ascending=False)
A | B | C | |
---|---|---|---|
d | 9.0 | 10.0 | 11.0 |
c | 6.0 | 7.0 | 8.0 |
b | 3.0 | 4.0 | 5.0 |
a | 0.0 | 1.0 | 2.0 |
1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?
'''
在开始我们已经导入了train_chinese.csv数据,而且前面我们也学习了导入数据过程,根据上面学习,我们直接对目标列进行排序即可
head(20) : 读取前20条数据
'''
#代码
df = pd.read_csv(r".\MyAnswer\titanic\train_chinese.csv", encoding= 'GBK')
df.head()
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df.sort_values(by = ['票价','年龄'], ascending= False).head(30)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.00 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.00 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.00 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.00 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
742 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
311 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
299 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
118 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
380 | 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42.00 | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.00 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
700 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.00 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
557 | 558 | 0 | 1 | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
527 | 528 | 0 | 1 | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S |
377 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.00 | 0 | 2 | 113503 | 211.5000 | C82 | C |
779 | 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.00 | 0 | 1 | 24160 | 211.3375 | B3 | S |
730 | 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S |
689 | 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15.00 | 0 | 1 | 24160 | 211.3375 | B5 | S |
856 | 857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.00 | 1 | 1 | 36928 | 164.8667 | NaN | S |
318 | 319 | 1 | 1 | Wick, Miss. Mary Natalie | female | 31.00 | 0 | 2 | 36928 | 164.8667 | C7 | S |
268 | 269 | 1 | 1 | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.00 | 0 | 1 | PC 17582 | 153.4625 | C125 | S |
609 | 610 | 1 | 1 | Shutes, Miss. Elizabeth W | female | 40.00 | 0 | 0 | PC 17582 | 153.4625 | C125 | S |
332 | 333 | 0 | 1 | Graham, Mr. George Edward | male | 38.00 | 0 | 1 | PC 17582 | 153.4625 | C91 | S |
498 | 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
708 | 709 | 1 | 1 | Cleaver, Miss. Alice | female | 22.00 | 0 | 0 | 113781 | 151.5500 | NaN | S |
297 | 298 | 0 | 1 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
305 | 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
195 | 196 | 1 | 1 | Lurette, Miss. Elise | female | 58.00 | 0 | 0 | PC 17569 | 146.5208 | B80 | C |
【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。
当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
多做几个数据的排序
#代码
df.sort_values(by = ['年龄','是否幸存'], ascending= False).head(20)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
630 | 631 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0 | 0 | 0 | 27042 | 30.0000 | A23 | S |
851 | 852 | 0 | 3 | Svensson, Mr. Johan | male | 74.0 | 0 | 0 | 347060 | 7.7750 | NaN | S |
96 | 97 | 0 | 1 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
493 | 494 | 0 | 1 | Artagaveytia, Mr. Ramon | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | NaN | C |
116 | 117 | 0 | 3 | Connors, Mr. Patrick | male | 70.5 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
672 | 673 | 0 | 2 | Mitchell, Mr. Henry Michael | male | 70.0 | 0 | 0 | C.A. 24580 | 10.5000 | NaN | S |
745 | 746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0000 | B22 | S |
33 | 34 | 0 | 2 | Wheadon, Mr. Edward H | male | 66.0 | 0 | 0 | C.A. 24579 | 10.5000 | NaN | S |
54 | 55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
280 | 281 | 0 | 3 | Duane, Mr. Frank | male | 65.0 | 0 | 0 | 336439 | 7.7500 | NaN | Q |
456 | 457 | 0 | 1 | Millet, Mr. Francis Davis | male | 65.0 | 0 | 0 | 13509 | 26.5500 | E38 | S |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
545 | 546 | 0 | 1 | Nicholson, Mr. Arthur Ernest | male | 64.0 | 0 | 0 | 693 | 26.0000 | NaN | S |
275 | 276 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0 | 1 | 0 | 13502 | 77.9583 | D7 | S |
483 | 484 | 1 | 3 | Turkula, Mrs. (Hedwig) | female | 63.0 | 0 | 0 | 4134 | 9.5875 | NaN | S |
570 | 571 | 1 | 2 | Harris, Mr. George | male | 62.0 | 0 | 0 | S.W./PP 752 | 10.5000 | NaN | S |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0000 | B28 | NaN |
252 | 253 | 0 | 1 | Stead, Mr. William Thomas | male | 62.0 | 0 | 0 | 113514 | 26.5500 | C87 | S |
555 | 556 | 0 | 1 | Wright, Mr. George | male | 62.0 | 0 | 0 | 113807 | 26.5500 | NaN | S |
170 | 171 | 0 | 1 | Van der hoef, Mr. Wyckoff | male | 61.0 | 0 | 0 | 111240 | 33.5000 | B19 | S |
df.sort_values(by = ['年龄','是否幸存'], ascending= False).tail(20)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
766 | 767 | 0 | 1 | Brewe, Dr. Arthur Jackson | male | NaN | 0 | 0 | 112379 | 39.6000 | NaN | C |
768 | 769 | 0 | 3 | Moran, Mr. Daniel J | male | NaN | 1 | 0 | 371110 | 24.1500 | NaN | Q |
773 | 774 | 0 | 3 | Elias, Mr. Dibo | male | NaN | 0 | 0 | 2674 | 7.2250 | NaN | C |
776 | 777 | 0 | 3 | Tobin, Mr. Roger | male | NaN | 0 | 0 | 383121 | 7.7500 | F38 | Q |
778 | 779 | 0 | 3 | Kilgannon, Mr. Thomas J | male | NaN | 0 | 0 | 36865 | 7.7375 | NaN | Q |
783 | 784 | 0 | 3 | Johnston, Mr. Andrew G | male | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
790 | 791 | 0 | 3 | Keane, Mr. Andrew "Andy" | male | NaN | 0 | 0 | 12460 | 7.7500 | NaN | Q |
792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
793 | 794 | 0 | 1 | Hoyt, Mr. William Fisher | male | NaN | 0 | 0 | PC 17600 | 30.6958 | NaN | C |
815 | 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0000 | B102 | S |
825 | 826 | 0 | 3 | Flynn, Mr. John | male | NaN | 0 | 0 | 368323 | 6.9500 | NaN | Q |
826 | 827 | 0 | 3 | Lam, Mr. Len | male | NaN | 0 | 0 | 1601 | 56.4958 | NaN | S |
832 | 833 | 0 | 3 | Saad, Mr. Amin | male | NaN | 0 | 0 | 2671 | 7.2292 | NaN | C |
837 | 838 | 0 | 3 | Sirota, Mr. Maurice | male | NaN | 0 | 0 | 392092 | 8.0500 | NaN | S |
846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
859 | 860 | 0 | 3 | Razi, Mr. Raihed | male | NaN | 0 | 0 | 2629 | 7.2292 | NaN | C |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
# 具体请看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分
#自己构建两个都为数字的DataFrame数据
"""
我们举了一个例子:
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
columns=['a', 'b', 'c'],
index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
columns=['a', 'e', 'c'],
index=['first', 'one', 'two', 'second'])
frame1_a
"""
#代码
df1 = pd.DataFrame(np.arange(0,9).reshape(3, 3),
columns=['a', 'b', 'c'],
index=['one', 'two', 'three'])
df1
a | b | c | |
---|---|---|---|
one | 0 | 1 | 2 |
two | 3 | 4 | 5 |
three | 6 | 7 | 8 |
df2 = pd.DataFrame(np.arange(1,13).reshape(4, 3),
columns=['one', 'two', 'three'],
index=['a', 'e', 'c','d' ])
df2
one | two | three | |
---|---|---|---|
a | 1 | 2 | 3 |
e | 4 | 5 | 6 |
c | 7 | 8 | 9 |
d | 10 | 11 | 12 |
将frame_a和frame_b进行相加
#代码
df1+ df2
a | b | c | one | three | two | |
---|---|---|---|---|---|---|
a | NaN | NaN | NaN | NaN | NaN | NaN |
c | NaN | NaN | NaN | NaN | NaN | NaN |
d | NaN | NaN | NaN | NaN | NaN | NaN |
e | NaN | NaN | NaN | NaN | NaN | NaN |
one | NaN | NaN | NaN | NaN | NaN | NaN |
three | NaN | NaN | NaN | NaN | NaN | NaN |
two | NaN | NaN | NaN | NaN | NaN | NaN |
【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。
1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
'''
还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?
'''
df['兄弟姐妹个数']
0 1
1 1
2 0
3 1
4 0
..
886 0
887 0
888 1
889 0
890 0
Name: 兄弟姐妹个数, Length: 891, dtype: int64
df['父母子女个数']
0 0
1 0
2 0
3 0
4 0
..
886 0
887 0
888 2
889 0
890 0
Name: 父母子女个数, Length: 891, dtype: int64
max( df['兄弟姐妹个数']+ df['父母子女个数'])
10
【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。
多做几个数据的相加,看看你能分析出什么?
1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息
调用 describe 函数,观察frame2的数据基本信息
df2.describe()
1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
'''
看看泰坦尼克号数据集中 票价 这列数据的基本统计数据
'''
#代码
df['票价'].describe()
df['父母子女个数'].describe()
count 891.000000
mean 0.381594
std 0.806057
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 6.000000
Name: 父母子女个数, dtype: float64