一、查看数据
- 查看frame中头部和尾部的行:
具体可以查看 pandas 0.25.0 documentationdata1 = data.head(6) # 取前六行数据 data2 = data.tail(6) # 取后六行数据 print(data1) print('----'*50) print(data2) Rank City State Population Date of census/estimate 0 1 London[2] United Kingdom 8,615,246 1-Jun-14 1 2 Berlin Germany 3,437,916 31-May-14 2 3 Madrid Spain 3,165,235 1-Jan-14 3 4 Rome Italy 2,872,086 30-Sep-14 4 5 Paris France 2,273,305 1-Jan-13 5 6 Bucharest Romania 1,883,425 20-Oct-11 --------------------------------------------------------------------------- Rank City State Population Date of census/estimate 99 100 Valladolid Spain 311,501 1-Jan-12 100 101 Bonn Germany 309,869 31-Dec-12 101 102 Malm枚 Sweden 309,105 31-Mar-13 102 103 Nottingham United Kingdom 308,735 30-Jun-12 103 104 Katowice Poland 308,269 30-Jun-12 104 105 Kaunas Lithuania 306,888 1-Jan-13
- 显示索引、列和底层的numpy数据:
print(data.index) print('--'*40) print(data.columns) print('--'*40) print(data.values) RangeIndex(start=0, stop=105, step=1) -------------------------------------------------------------------------------- Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object') -------------------------------------------------------------------------------- [[1 'London[2]' ' United Kingdom' '8,615,246' '1-Jun-14'] [2 'Berlin' ' Germany' '3,437,916' '31-May-14'] [3 'Madrid' ' Spain' '3,165,235' '1-Jan-14'] [4 'Rome' ' Italy' '2,872,086' '30-Sep-14'] [5 'Paris' ' France' '2,273,305' '1-Jan-13'] ... [101 'Bonn' ' Germany' '309,869' '31-Dec-12'] [102 'Malm枚' ' Sweden' '309,105' '31-Mar-13'] [103 'Nottingham' ' United Kingdom' '308,735' '30-Jun-12'] [104 'Katowice' ' Poland' '308,269' '30-Jun-12'] [105 'Kaunas' ' Lithuania' '306,888' '1-Jan-13']]
- describe()函数对于数据的快速统计汇总:
print(data.describe()) Rank count 105.000000 mean 53.057143 std 30.428298 min 1.000000 25% 27.000000 50% 53.000000 75% 79.000000 max 105.000000
- 对数据的转置(行列位置互换):
print(data) print('--'*40) print(data.T) Rank City State Population Date of census/estimate 0 1 London[2] United Kingdom 8,615,246 1-Jun-14 1 2 Berlin Germany 3,437,916 31-May-14 2 3 Madrid Spain 3,165,235 1-Jan-14 3 4 Rome Italy 2,872,086 30-Sep-14 4 5 Paris France 2,273,305 1-Jan-13 .. ... ... ... ... ... 100 101 Bonn Germany 309,869 31-Dec-12 101 102 Malm枚 Sweden 309,105 31-Mar-13 102 103 Nottingham United Kingdom 308,735 30-Jun-12 103 104 Katowice Poland 308,269 30-Jun-12 104 105 Kaunas Lithuania 306,888 1-Jan-13 [105 rows x 5 columns] -------------------------------------------------------------------------------- 0 1 ... 103 104 Rank 1 2 ... 104 105 City London[2] Berlin ... Katowice Kaunas State United Kingdom Germany ... Poland Lithuania Population 8,615,246 3,437,916 ... 308,269 306,888 Date of census/estimate 1-Jun-14 31-May-14 ... 30-Jun-12 1-Jan-13 [5 rows x 105 columns]
- 按轴进行排序:
print(data.sort_index(axis=0, ascending=False)) # 当 axis=1 按照纵轴排序, ascending:升序 Rank City State Population Date of census/estimate 104 105 Kaunas Lithuania 306,888 1-Jan-13 103 104 Katowice Poland 308,269 30-Jun-12 102 103 Nottingham United Kingdom 308,735 30-Jun-12 101 102 Malm枚 Sweden 309,105 31-Mar-13 100 101 Bonn Germany 309,869 31-Dec-12 .. ... ... ... ... ... 4 5 Paris France 2,273,305 1-Jan-13 3 4 Rome Italy 2,872,086 30-Sep-14 2 3 Madrid Spain 3,165,235 1-Jan-14 1 2 Berlin Germany 3,437,916 31-May-14 0 1 London[2] United Kingdom 8,615,246 1-Jun-14
- 按值进行排序:
print(data.sort_values(['City'])) Rank City State Population Date of census/estimate 91 92 Aarhus Denmark 326,676 1-Oct-14 85 86 Alicante Spain 334,678 1-Jan-12 22 23 Amsterdam Netherlands 813,562 31-May-14 58 59 Antwerp Belgium 510,610 1-Jan-14 33 34 Athens Greece 664,046 24-May-11 .. ... ... ... ... ... 34 35 Wroc艂aw Poland 632,432 31-Mar-14 82 83 Wuppertal Germany 342,885 31-Dec-12 23 24 Zagreb Croatia 790,017 31-Mar-11 32 33 Zaragoza Spain 666,058 1-Jan-14 27 28 艁贸d藕 Poland 709,757 31-Mar-14
二、选择
- 获取
- 选择一个单独的列,这将返回一个Series, 等同于data.State
print(data['State']) 0 United Kingdom 1 Germany 2 Spain 3 Italy 4 France ... 100 Germany 101 Sweden 102 United Kingdom 103 Poland 104 Lithuania Name: State, Length: 105, dtype: object
- 通过[]进行选择,这将会进行切片
print(data[:5]) Rank City State Population Date of census/estimate 0 1 London[2] United Kingdom 8,615,246 1-Jun-14 1 2 Berlin Germany 3,437,916 31-May-14 2 3 Madrid Spain 3,165,235 1-Jan-14 3 4 Rome Italy 2,872,086 30-Sep-14 4 5 Paris France 2,273,305 1-Jan-13
- 通过标签选择
- 使用标签来获取一个交叉的区域
print(data.loc[data.index[0]]) Rank 1 City London[2] State United Kingdom Population 8,615,246 Date of census/estimate 1-Jun-14 Name: 0, dtype: object
- 通过标签在多个轴上进行选择
print(data.loc[:, ['State', 'Population']]) State Population 0 United Kingdom 8,615,246 1 Germany 3,437,916 2 Spain 3,165,235 3 Italy 2,872,086 4 France 2,273,305 .. ... ... 100 Germany 309,869 101 Sweden 309,105 102 United Kingdom 308,735 103 Poland 308,269 104 Lithuania 306,888 [105 rows x 2 columns]
- 标签切片
这张表的结构就是:DataFrameprint(data.loc[1: 4, ['State', 'Population']]) print(data.loc[1: 4, 'Rank':'Population']) print(data.loc[[1, 3], 'City':'Population']) # loc方法里,可以用切片的方法也可以用标签单独取值,这里说的切片和你想的不一样 # 其实二维数组就是张表了,有字段,有值就是一个表结构,由横纵两个轴构建。横:axis=0, 纵:axis=1 # 横纵轴用于定位元素的(因为在科学统计时我们往往需要批量的操作数据),批量操作数据就需要在宏观上 # 定义数据,定义的方式是把它们都放在列表里,通过下标来取值,而它们每个字段又是横纵方向的键,那值 # 当然是跟在屁股后面的整段数据。这或许就是二维数组的本质:将表格的每行每列按照 '键'='值' 它就是 # Series, Series交织起来的结构叫 DataFrame # 属于个人理解(不喜勿喷)
type index Series Series Series - - Rank State Population Series 0 1 A A Series 1 2 B B - 对于返回的对象进行维度缩减
print(data.loc[1, ['State', 'Population']]) # 说的挺高级,就是定位数据,返回<class 'pandas.core.series.Series'> State Germany Population 3,437,916 Name: 1, dtype: object
- 获取一个标量
print(data.loc[1, 'Population']) # 确实像一颗洋葱,如果你愿意一层一层的拨开我的心,你会发现,你会压抑,最深处的秘密。 3,437,916 <class 'str'>
- 快速访问一个标量(与上一个方法等价)
print(data.at[1, 'Population']) # 与5是等价的 3,437,916 <class 'str'>
- 通过位置选择
- 通过传递数值进行位置选择(选择的是行)
print(data.iloc[1]) Rank 2 City Berlin State Germany Population 3,437,916 Date of census/estimate 31-May-14 Name: 1, dtype: object
- 通过数值进行切片
data.iloc[1:3, 0: 4] Rank City State Population 1 2 Berlin Germany 3,437,916 2 3 Madrid Spain 3,165,235
- 通过指定一个位置的列表
data.iloc[[1, 3, 5], [0, 1, 2]] Rank City State 1 2 Berlin Germany 3 4 Rome Italy 5 6 Bucharest Romania
- 对行进行切片
print(data.iloc[1:3, :]) Rank City State Population Date of census/estimate 1 2 Berlin Germany 3,437,916 31-May-14 2 3 Madrid Spain 3,165,235 1-Jan-14
- 对列进行切片
print(data.iloc[:, 0:3]) Rank City State 0 1 London[2] United Kingdom 1 2 Berlin Germany 2 3 Madrid Spain 3 4 Rome Italy 4 5 Paris France .. ... ... ... 100 101 Bonn Germany 101 102 Malm枚 Sweden 102 103 Nottingham United Kingdom 103 104 Katowice Poland 104 105 Kaunas Lithuania [105 rows x 3 columns]
- 获取特定的值
print(data.iloc[1, 1]) print(data.at[1,1]) Berlin Berlin
- 布尔索引
- 使用一个单独列的值来选择数据
data.Population = data.Population.apply(lambda x: int(x.replace(',', ''))) # 相当于获取到Population下的所有数据然后利用匿名函数 修改 数据结构 然后重新赋值给 # Population 这个字段 print(data[data.Population > 1000000]) Rank City State Population Date of census/estimate 0 1 London[2] United Kingdom 8615246 1-Jun-14 1 2 Berlin Germany 3437916 31-May-14 2 3 Madrid Spain 3165235 1-Jan-14 3 4 Rome Italy 2872086 30-Sep-14 4 5 Paris France 2273305 1-Jan-13 5 6 Bucharest Romania 1883425 20-Oct-11 6 7 Vienna Austria 1794770 1-Jan-15 7 8 Hamburg[10] Germany 1746342 30-Dec-13 8 9 Budapest Hungary 1744665 1-Jan-14 9 10 Warsaw Poland 1729119 31-Mar-14 10 11 Barcelona Spain 1602386 1-Jan-14 11 12 Munich Germany 1407836 31-Dec-13 12 13 Milan Italy 1332516 30-Sep-14 13 14 Sofia Bulgaria 1291895 14-Dec-14 14 15 Prague Czech Republic 1246780 1-Jan-13 15 16 Brussels[17] Belgium 1175831 1-Jan-14 16 17 Birmingham United Kingdom 1092330 30-Jun-13 17 18 Cologne Germany 1034175 31-Dec-13
- 使用where操作来选择数据:
print(data[data > 0])
- 使用isin()方法来过滤
a = [x for x in range(len(data.index))] a = pd.Series(a, index=data.index) # 这列数据的索引必须和原数据一致 data1 = data.copy() data1['E'] = a print(data1[data1['E'].isin(['2', '4'])]) Rank City State Population Date of census/estimate E 2 3 Madrid Spain 3165235 1-Jan-14 2 4 5 Paris France 2273305 1-Jan-13 4
- 设置
- 设置一个新的列:
#上篇已经插入了
- 通过标签设置新的值:
data1.at[data.index[0], 'f'] = 1 print(data1) Rank City State Population Date of census/estimate f 0 1 London[2] United Kingdom 8615246 1-Jun-14 1 1 2 Berlin Germany 3437916 31-May-14 1 2 3 Madrid Spain 3165235 1-Jan-14 2 3 4 Rome Italy 2872086 30-Sep-14 3 4 5 Paris France 2273305 1-Jan-13 4 .. ... ... ... ... ... ... 100 101 Bonn Germany 309869 31-Dec-12 100 101 102 Malm枚 Sweden 309105 31-Mar-13 101 102 103 Nottingham United Kingdom 308735 30-Jun-12 102 103 104 Katowice Poland 308269 30-Jun-12 103 104 105 Kaunas Lithuania 306888 1-Jan-13 104 [105 rows x 6 columns]
- 通过位置设置新的值:
data1.iat[1,2] = 0 print(data1) Rank City State Population Date of census/estimate f 0 1 London[2] United Kingdom 8615246 1-Jun-14 0 1 2 Berlin 0 3437916 31-May-14 1 2 3 Madrid Spain 3165235 1-Jan-14 2 3 4 Rome Italy 2872086 30-Sep-14 3 4 5 Paris France 2273305 1-Jan-13 4 .. ... ... ... ... ... ... 100 101 Bonn Germany 309869 31-Dec-12 100 101 102 Malm枚 Sweden 309105 31-Mar-13 101 102 103 Nottingham United Kingdom 308735 30-Jun-12 102 103 104 Katowice Poland 308269 30-Jun-12 103 104 105 Kaunas Lithuania 306888 1-Jan-13 104 [105 rows x 6 columns]
- 通过一个numpy数组设置一组新值
data1.loc[:, 'D'] = np.array([5] * len(data1)) print(data1) Rank City State ... Date of census/estimate f D 0 1 London[2] United Kingdom ... 1-Jun-14 0 5 1 2 Berlin Germany ... 31-May-14 1 5 2 3 Madrid Spain ... 1-Jan-14 2 5 3 4 Rome Italy ... 30-Sep-14 3 5 4 5 Paris France ... 1-Jan-13 4 5 .. ... ... ... ... ... ... .. 100 101 Bonn Germany ... 31-Dec-12 100 5 101 102 Malm枚 Sweden ... 31-Mar-13 101 5 102 103 Nottingham United Kingdom ... 30-Jun-12 102 5 103 104 Katowice Poland ... 30-Jun-12 103 5 104 105 Kaunas Lithuania ... 1-Jan-13 104 5 [105 rows x 7 columns]
- 通过where操作来设置新的值:
data2.f[data2.f > 0] = -data2.f print(data2) Rank City State Population Date of census/estimate f 0 1 London[2] United Kingdom 8615246 1-Jun-14 0 1 2 Berlin Germany 3437916 31-May-14 -1 2 3 Madrid Spain 3165235 1-Jan-14 -2 3 4 Rome Italy 2872086 30-Sep-14 -3 4 5 Paris France 2273305 1-Jan-13 -4 .. ... ... ... ... ... ... 100 101 Bonn Germany 309869 31-Dec-12 -100 101 102 Malm枚 Sweden 309105 31-Mar-13 -101 102 103 Nottingham United Kingdom 308735 30-Jun-12 -102 103 104 Katowice Poland 308269 30-Jun-12 -103 104 105 Kaunas Lithuania 306888 1-Jan-13 -104 [105 rows x 6 columns]
三、空值处理
在pandas中,使用np.nan来替代空值,这些值将默认不包含在计算中。
- index()方法可以对指定轴上的索引进行改变/增加/删除操作,返回原始数据的拷贝。
data3 = data2.reindex(index=data2.index, columns=list(data2.columns) + ['E']) data3.loc[0:2, 'E'] = 1 print(data3) Rank City State ... Date of census/estimate f E 0 1 London[2] United Kingdom ... 1-Jun-14 0 1.0 1 2 Berlin Germany ... 31-May-14 1 1.0 2 3 Madrid Spain ... 1-Jan-14 2 1.0 3 4 Rome Italy ... 30-Sep-14 3 NaN 4 5 Paris France ... 1-Jan-13 4 NaN .. ... ... ... ... ... ... ... 100 101 Bonn Germany ... 31-Dec-12 100 NaN 101 102 Malm枚 Sweden ... 31-Mar-13 101 NaN 102 103 Nottingham United Kingdom ... 30-Jun-12 102 NaN 103 104 Katowice Poland ... 30-Jun-12 103 NaN 104 105 Kaunas Lithuania ... 1-Jan-13 104 NaN [105 rows x 7 columns] ```
- 去掉包含缺失值的行:
data4 = data3.dropna() print(data4) Rank City State Population Date of census/estimate f E 0 1 London[2] United Kingdom 8615246 1-Jun-14 0 0.0 1 2 Berlin Germany 3437916 31-May-14 1 1.0 2 3 Madrid Spain 3165235 1-Jan-14 2 2.0
- 对缺失值进行填充:
data2.loc[0, 'f'] = None data3 = data2.fillna(value=5) data3.f = data3.f.apply(lambda x: int(x)) print(data3) Rank City State Population Date of census/estimate f 0 1 London[2] United Kingdom 8615246 1-Jun-14 5 1 2 Berlin Germany 3437916 31-May-14 1 2 3 Madrid Spain 3165235 1-Jan-14 2 3 4 Rome Italy 2872086 30-Sep-14 3 4 5 Paris France 2273305 1-Jan-13 4 .. ... ... ... ... ... ... 100 101 Bonn Germany 309869 31-Dec-12 100 101 102 Malm枚 Sweden 309105 31-Mar-13 101 102 103 Nottingham United Kingdom 308735 30-Jun-12 102 103 104 Katowice Poland 308269 30-Jun-12 103 104 105 Kaunas Lithuania 306888 1-Jan-13 104 [105 rows x 6 columns]
- 对数据进行布尔填充:
data2.loc[0, 'f'] = None data3 = pd.isnull(data2) print(data3) Rank City State Population Date of census/estimate f 0 False False False False False True 1 False False False False False False 2 False False False False False False 3 False False False False False False 4 False False False False False False .. ... ... ... ... ... ... 100 False False False False False False 101 False False False False False False 102 False False False False False False 103 False False False False False False 104 False False False False False False [105 rows x 6 columns]
四、相关操作
- 统计(相关操作需要数据不包含空值)
- 执行描述性统计
print(round(data2.mean(), 2)) Rank 53.06 Population 787679.09 f 52.50 dtype: float64
- 在其他轴上进行相同的操作
print(round(data2.mean(1), 2)) # 就是按照纵轴来取平均值,很少用 0 4307623.50 1 1145973.00 2 1055080.00 3 957364.33 4 757771.33 ... 100 103356.67 101 103102.67 102 102980.00 103 102825.33 104 102365.67 Length: 105, dtype: float64
- 对于拥有不同维度,需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播
- Apply
- 对数据应用函数
data.Population = data.Population.apply(lambda x: int(x.replace(',', '')))
- 直方图
具体请参照:Histogramming and Discretizations = pd.Series(np.random.randint(0, 7, size=10)) s.value_counts() 0 0 1 2 2 1 3 2 4 1 5 1 6 3 7 4 8 1 9 5 dtype: int32 1 4 2 2 5 1 4 1 3 1 0 1 dtype: int64
- 字符串方法
Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。更详细参考:Working with text data¶s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower() 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object