导入pandas
import pandas as pd
1. 字典键将用作列标题,而每个列表中的值将用作的行DataFrame。
In [ 2 ] : df = pd. DataFrame( {
. . . : "Name" : [ "Braund, Mr. Owen Harris" ,
. . . : "Allen, Mr. William Henry" ,
. . . : "Bonnell, Miss. Elizabeth" ] ,
. . . : "Age" : [ 22 , 35 , 58 ] ,
. . . : "Sex" : [ "male" , "male" , "female" ] }
. . . : )
df
Name Age Sex 0 Braund, Mr. Owen Harris 22 male 1 Allen, Mr. William Henry 35 male 2 Bonnell, Miss. Elizabeth 58 female
2. DataFrame的每一列都是一个Series,并且每个Series都又是一个DataFrame
df[ "Age" ]
0 22
1 35
2 58
Name: Age, dtype: int64
ages = pd. Series( [ 22 , 35 , 58 ] , name= "Age" )
ages
0 22
1 35
2 58
Name: Age, dtype: int64
3. 对数据进行操作
df[ "Age" ] . max ( )
58
ages. max ( )
58
df. describe( )
Age count 3.000000 mean 38.333333 std 18.230012 min 22.000000 25% 28.500000 50% 35.000000 75% 46.500000 max 58.000000
读取和写入表格数据
1.read_csv()将存储为csv文件的数据读入
titanic= pd. read_csv( "data/titanic.csv" )
titanic
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
891 rows × 12 columns
2.使用head()查看前n行,tail()查看后m行
titanic. head( 8 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
titanic. tail( 3 )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
3. dtypes属性:解释每种列数据类型
titanic. dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
4.数据存储
read_*函数用来读取数据 to_*方法用于存储数据 to_excel()方法将数据存储为excel文件。
titanic. to_excel( 'titanic.xlsx' , sheet_name= 'passengers' , index= False )
titanic= pd. read_excel( 'titanic.xlsx' , sheet_name= 'passengers' )
titanic. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5. 方法info()提供有关的技术信息
titanic. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
小结:
read_*函数支持将数据从许多不同的文件格式或数据源导入。
通过不同的to_*方法可以将数据导出 。
head/ tail/ info方法和dtypes属性。
选择子集
1.选择单列
ages= titanic[ 'Age' ]
ages. head( )
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
type ( titanic[ "Age" ] )
pandas.core.series.Series
titanic[ "Age" ] . shape
(891,)
2.选择多列
age_sex= titanic[ [ "Age" , "Sex" ] ]
age_sex. head( )
Age Sex 0 22.0 male 1 38.0 female 2 26.0 female 3 35.0 female 4 35.0 male
type ( titanic[ [ "Age" , "Sex" ] ] )
pandas.core.frame.DataFrame
titanic[ [ "Age" , "Sex" ] ] . shape
(891, 2)
过滤特定行
above_35= titanic[ titanic[ "Age" ] > 35 ]
above_35. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S 15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
titanic[ "Age" ] > 35
0 False
1 True
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
above_35. shape
(217, 12)
class_23= titanic[ titanic[ "Pclass" ] . isin( [ 2 , 3 ] ) ]
class_23. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
1.isin()条件函数
与条件表达式类似,返回布尔型 为在列表中的每一行某个不为空的值都返回一个True
( the isin() conditional function returns a True for each row the values are in the provided list.)
class_23= titanic[ ( titanic[ "Pclass" ] == 2 ) | ( titanic[ "Pclass" ] == 3 ) ]
class_23. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
注意
组合多个条件语句时,每个条件必须用括号括起来()。 不能使用 or,and只能使用|和&。
2.notna():为每一行的某一个not Null的值返回一个True
age_no_na= titanic[ titanic[ "Age" ] . notna( ) ]
age_no_na. shape
(714, 12)
选择特定的行和列
1. loc[] :行和列名
adult_names= titanic. loc[ titanic[ "Age" ] > 35 , "Name" ]
adult_names. head( )
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
2. iloc[]:基于位置
titanic. iloc[ 9 : 25 , 2 : 5 ]
Pclass Name Sex 9 2 Nasser, Mrs. Nicholas (Adele Achem) female 10 3 Sandstrom, Miss. Marguerite Rut female 11 1 Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male 14 3 Vestrom, Miss. Hulda Amanda Adolfina female 15 2 Hewlett, Mrs. (Mary D Kingcome) female 16 3 Rice, Master. Eugene male 17 2 Williams, Mr. Charles Eugene male 18 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 19 3 Masselmani, Mrs. Fatima female 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 1 Sloper, Mr. William Thompson male 24 3 Palsson, Miss. Torborg Danira female
3.将表中元素进行替换
titanic. iloc[ : 3 , 3 ] = "YuHongxia"
titanic. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 YuHongxia male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 YuHongxia female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 YuHongxia female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
小结:
选择数据子集时,使用方括号[]不是()。
在括号内,可以使用单个列/行标签,列/行标签列表,标签切片,条件表达式或冒号。
选择特定的行和列时,如果使用行和列名可以使用loc
选择特定的行和列时,如果使用表中的位置可以使用iloc
可以基于loc/ iloc为选择分配新值。