笔记——pandas学习1

导入pandas

import pandas as pd

1. 字典键将用作列标题,而每个列表中的值将用作的行DataFrame。

In [2]: df = pd.DataFrame({
   ...:     "Name": ["Braund, Mr. Owen Harris",
   ...:              "Allen, Mr. William Henry",
   ...:              "Bonnell, Miss. Elizabeth"],
   ...:     "Age": [22, 35, 58],
   ...:     "Sex": ["male", "male", "female"]}
   ...: )
df
NameAgeSex
0Braund, Mr. Owen Harris22male
1Allen, Mr. William Henry35male
2Bonnell, Miss. Elizabeth58female

2. DataFrame的每一列都是一个Series,并且每个Series都又是一个DataFrame

df["Age"]
0    22
1    35
2    58
Name: Age, dtype: int64
ages = pd.Series([22, 35, 58], name="Age")
ages
0    22
1    35
2    58
Name: Age, dtype: int64

3. 对数据进行操作

df["Age"].max()
58
ages.max()
58
df.describe()#提供了对数值数据的快速概述DataFrame。
#由于Name和Sex列是文本数据,因此默认情况下该describe()方法不考虑.
Age
count3.000000
mean38.333333
std18.230012
min22.000000
25%28.500000
50%35.000000
75%46.500000
max58.000000

读取和写入表格数据

1.read_csv()将存储为csv文件的数据读入

titanic=pd.read_csv("data/titanic.csv") #read_csv()将存储为csv文件的数据读入
titanic
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

2.使用head()查看前n行,tail()查看后m行

titanic.head(8)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
titanic.tail(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ

3. dtypes属性:解释每种列数据类型

titanic.dtypes#数据类型DataFrame为整数(int64),浮点数(float63)和字符串(object)。
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

4.数据存储

  1. read_*函数用来读取数据
  2. to_*方法用于存储数据
  3. to_excel()方法将数据存储为excel文件。
titanic.to_excel('titanic.xlsx',sheet_name='passengers',index=False)#sheet_name名为passemgers,通过设置 index=False行索引标签不会保存在电子表格中。
titanic=pd.read_excel('titanic.xlsx',sheet_name='passengers')
titanic.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

5. 方法info()提供有关的技术信息

titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

小结:

  1. read_*函数支持将数据从许多不同的文件格式或数据源导入。

  2. 通过不同的to_*方法可以将数据导出 。

  3. head/ tail/ info方法和dtypes属性。

选择子集

1.选择单列

ages=titanic['Age']
ages.head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
type(titanic["Age"])#每一列DataFrame都是一个Series
pandas.core.series.Series
titanic["Age"].shape#单个列返回的对象是仍然是DataFrame
(891,)

2.选择多列

age_sex=titanic[["Age","Sex"]]
age_sex.head()
AgeSex
022.0male
138.0female
226.0female
335.0female
435.0male
type(titanic[["Age","Sex"]])
pandas.core.frame.DataFrame
titanic[["Age","Sex"]].shape#返回了DataFrame891行和2列。请记住,a DataFrame是二维的,具有行和列的维
(891, 2)

过滤特定行

above_35=titanic[titanic["Age"]>35]
above_35.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
titanic["Age"]>35
0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool
above_35.shape
(217, 12)
class_23=titanic[titanic["Pclass"].isin([2,3])]
class_23.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS

1.isin()条件函数

  1. 与条件表达式类似,返回布尔型
  2. 为在列表中的每一行某个不为空的值都返回一个True
    ( the isin() conditional function returns a True for each row the values are in the provided list.)
class_23=titanic[(titanic["Pclass"]==2)|(titanic["Pclass"]==3)]
class_23.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS

注意

  1. 组合多个条件语句时,每个条件必须用括号括起来()。
  2. 不能使用 or,and只能使用|和&。

2.notna():为每一行的某一个not Null的值返回一个True

age_no_na=titanic[titanic["Age"].notna()]
age_no_na.shape
(714, 12)

选择特定的行和列

1. loc[] :行和列名

adult_names=titanic.loc[titanic["Age"]>35,"Name"]
adult_names.head()
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

2. iloc[]:基于位置

titanic.iloc[9:25,2:5]#取第10至25行和第3至5列
PclassNameSex
92Nasser, Mrs. Nicholas (Adele Achem)female
103Sandstrom, Miss. Marguerite Rutfemale
111Bonnell, Miss. Elizabethfemale
123Saundercock, Mr. William Henrymale
133Andersson, Mr. Anders Johanmale
143Vestrom, Miss. Hulda Amanda Adolfinafemale
152Hewlett, Mrs. (Mary D Kingcome)female
163Rice, Master. Eugenemale
172Williams, Mr. Charles Eugenemale
183Vander Planke, Mrs. Julius (Emelia Maria Vande...female
193Masselmani, Mrs. Fatimafemale
202Fynney, Mr. Joseph Jmale
212Beesley, Mr. Lawrencemale
223McGowan, Miss. Anna "Annie"female
231Sloper, Mr. William Thompsonmale
243Palsson, Miss. Torborg Danirafemale

3.将表中元素进行替换

titanic.iloc[:3,3]="YuHongxia"#将前三行的第四列换成YuHongxia
titanic.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103YuHongxiamale22.010A/5 211717.2500NaNS
1211YuHongxiafemale38.010PC 1759971.2833C85C
2313YuHongxiafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

小结:

  1. 选择数据子集时,使用方括号[]不是()。

  2. 在括号内,可以使用单个列/行标签,列/行标签列表,标签切片,条件表达式或冒号。

  3. 选择特定的行和列时,如果使用行和列名可以使用loc

  4. 选择特定的行和列时,如果使用表中的位置可以使用iloc

  5. 可以基于loc/ iloc为选择分配新值。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值