Pandas笔记

最新推荐文章于 2024-07-22 21:21:17 发布

LanderXX

最新推荐文章于 2024-07-22 21:21:17 发布

阅读量278

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Python 文章标签： numpy

本文链接：https://blog.youkuaiyun.com/LanderXX/article/details/105143542

Python 专栏收录该内容

4 篇文章

订阅专栏

Pandas Notebook

由于使用emacs-org进行编辑，为方便暂且使用英文

Series
1. create
2. manip
DataFrame
IO

import numpy as np
import pandas as pd

Series

one-dimensional, labeled list, hold any datatype.

labels are referred to as index.

s = pd.Series(data, index = index)

where data can be: python dict, np ndarray, or scalar.

non-unique index is enabled.

create

from ndarray

index should be as long as ndarray,

or by default 0 ~ len(data) - 1.

>>> s = pd.Series(np.random.rand(5), 
              index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.845771
b    0.508507
c    0.964314
d    0.924016
e    0.687160
dtype: float64

>>> s1 = pd.Series(np.random.rand(5))
>>> s1
0    0.353030
1    0.419253
2    0.121847
3    0.631009
4    0.696326
dtype: float64

from dict

if index wasn’t passed, the key of dict would be used,

if index was passed, index will match the dict’s keys,
index is prior.

>>> d = {'a': 1, 'b': 2, 'v': 3}
>>> pd.Series(d)
a    1
b    2
v    3
dtype: int64
>>>
>>> pd.Series(d, index = ['a', 'b', 'c'])
a    1.0
b    2.0
c    NaN
dtype: float64

notice: nan is a float number,
generally showing missing values,
shouldn’t be compared with
somevalue == np.nan, but use
somevalue is np.nan instead.

from scalar

index is must.

value’ll be the same.

>>> pd.Series(7, index = ['a', 'b', 'c', 'z'])
a    7
b    7
c    7
z    7
dtype: int64

manip

much like numpy and python dict

>>> s = pd.Series(np.random.rand(5), 
                  index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64

>>> s[0]
0.20877084185831607

>>> s[:3]
a    0.208771
b    0.456772
c    0.055980
dtype: float64

>>> s[s > s.median()]
b    0.456772
d    0.481634
dtype: float64

>>> s[[4, 3, 1]]
e    0.215942
d    0.481634
b    0.456772
dtype: float64

>>> s * 2
a    0.417542
b    0.913543
c    0.111960
d    0.963267
e    0.431885
dtype: float64

# notice that operations recognize the index
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64
>>> s[1:] + s[:-1]
a         NaN
b    0.913543
c    0.111960
d    0.963267
e         NaN
dtype: float64

>>> s['a']
0.20877084185831607

>>> s.get('a')
0.20877084185831607

>>> s.get('f')
>>> # <- this is 'none'

>>> s.get('f', np.nan)
nan

# get rid of the index?
>>> s.array
<PandasArray>
[ 0.20877084185831607,   0.4567715456417446, 0.055979995016902984,
   0.4816336854012986,  0.21594244986110378]
Length: 5, dtype: float64

# get a genuine array?
>>> s.to_numpy()
array([0.20877084, 0.45677155, 0.05598   , 0.48163369, 0.21594245])

# name? 
>>> s.name
>>> # <- this is 'none'

>>> s = s.rename('seriesA')
>>> s.name
'seriesA'

DataFrame

SQL table

data can be:
python dict of 1-D ndarrays, lists, dicts, or Series,
2-D ndarray, structured or record ndarray,
a Series, another DataFrame.

pass index*/*columns to force row/column labels.

create

from dict of Series or dicts

nested dicts will be firstly convert to pd.Series.

the created DataFrame’s index length is decided firstly by
the passed index variable, or by the longest data (Series).

the created DataFrame’s column width is decided firstly by
the passed columns variable, or by the number of data.

Series as data don’t have to follow length requirement,

while list or ndarray as data should be as long as the
DataFrame’s index.

>>> d = {'one': pd.Series(np.arange(1., 4.), 
                          index = ['a', 'b', 'c']), 
         'two': pd.Series(np.arange(1., 5.), 
                          index = ['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(d)
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

# follow index and columns --hard
>>> pd.DataFrame(d, index = ['d', 'c', 'a'])
   one  two
d  NaN  4.0
c  3.0  3.0
a  1.0  1.0

>>> pd.DataFrame(d, index = ['d', 'c', 'a'], columns = ['one', 'X'])
   one    X
d  NaN  NaN
c  3.0  NaN
a  1.0  NaN

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')

from dict of ndarrays/lists

ndarrays and lists should be as long as the DataFrame’s index,
and as wide as the DataFrame’s columns.

from structured or record arrays

>>> arr = [[1, 2, 'asdf'], [11, 12, 'asd']]

>>> pd.DataFrame(arr, index = ['a', 'b'], columns = ['x', 'y', 'z'])
    x   y     z
a   1   2  asdf
b  11  12   asd

from a list of dicts

>>> ls = [{'a': 1, 'b': 2}, {'c': 3, 'd': 4, 'e': 5}]

>>> pd.DataFrame(ls)
     a    b    c    d    e
0  1.0  2.0  NaN  NaN  NaN
1  NaN  NaN  3.0  4.0  5.0

>>> pd.DataFrame(ls, index = ['x', 'y'], columns = ['a', 'e'])
     a    e
x  1.0  NaN
y  NaN  5.0

from a dict of tuples

>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2, ('B', 'A'): 5},
...      ('b', 'c'): {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
       a    b
       b    c
A B  1.0  3.0
  C  2.0  NaN
B A  5.0  NaN
  C  NaN  4.0

>>> d = {('a', 'b'): {('A', 'B'): 1, 'A': 2},
...      'b': {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
        (a, b)    b
(A, B)     1.0  3.0
A          2.0  NaN
(B, C)     NaN  4.0

from a Series

the only columns is the Series’s name.

use np.nan to imply missing data

manip

like a dict

>>> df = pd.DataFrame(np.arange(1, 13).reshape(3, 4), 
                      index = ['a', 'b', 'c'], 
                      columns = ['one', 'two', 'three', 'four'])
>>> del df['one']

>>> df
   two  three  four
a    2      3     4
b    6      7     8
c   10     11    12

>>> two = df.pop('two')

>>> two
a     2
b     6
c    10
Name: two, dtype: int32

>>> df['foo'] = 'bar'

>>> df
   three  four  foo
a      3     4  bar
b      7     8  bar
c     11    12  bar

>>> df['three_trunc'] = df['three'][:2]

>>> df
   three  four  foo  three_trunc
a      3     4  bar          3.0
b      7     8  bar          7.0
c     11    12  bar          NaN

# insert some columns but not at the end?
>>> df.insert(1, 'new', np.random.rand(3))

>>> df
   three       new  four  foo  three_trunc
a      3  0.932232     4  bar          3.0
b      7  0.962772     8  bar          7.0
c     11  0.792010    12  bar          NaN

# numerical
>>> df['flag'] = df['new'] > df['new'].median()
>>> df['mult'] = df['three'] * df['new']

>>> df
   three       new  four  foo  three_trunc   flag      mult
a      3  0.932232     4  bar          3.0  False  2.796697
b      7  0.962772     8  bar          7.0   True  6.739403
c     11  0.792010    12  bar          NaN  False  8.712110

# transpose
>>> n = 'abcdef'
>>> df1 = pd.DataFrame(np.arange(30).reshape(5, 6), columns = [x for x in n])

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.T
   0   1   2   3   4
a  0   6  12  18  24
b  1   7  13  19  25
c  2   8  14  20  26
d  3   9  15  21  27
e  4  10  16  22  28
f  5  11  17  23  29

# turn to array
>>> df1.to_numpy()
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

>>> np.asarray(df1)
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

notice: when inserting a Series that isn’t as long as the DataFrame’s
index length, it will be forced to follow the index length.

but if insert a list or ndarray, make sure its length match the index.

indexing/selection

operation syntax result select column df[col] Series select row by label df.loc[label] Series select row by integer location df.iloc[loc] Series slice rows df[5:10] DataFrame select rows by boolean vector df[bool\_vec] DataFrame

date as index/column

>>> index = pd.date_range('1/1/2020', periods = 10)

>>> index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

>>> df = pd.DataFrame(np.arange(30).reshape(10, 3), index = index, columns = index[:3])
>>> df
            2020-01-01  2020-01-02  2020-01-03
2020-01-01           0           1           2
2020-01-02           3           4           5
2020-01-03           6           7           8
2020-01-04           9          10          11
2020-01-05          12          13          14
2020-01-06          15          16          17
2020-01-07          18          19          20
2020-01-08          21          22          23
2020-01-09          24          25          26
2020-01-10          27          28          29

show

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.describe()
               a          b          c          d          e          f
count   5.000000   5.000000   5.000000   5.000000   5.000000   5.000000
mean   12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
std     9.486833   9.486833   9.486833   9.486833   9.486833   9.486833
min     0.000000   1.000000   2.000000   3.000000   4.000000   5.000000
25%     6.000000   7.000000   8.000000   9.000000  10.000000  11.000000
50%    12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
75%    18.000000  19.000000  20.000000  21.000000  22.000000  23.000000
max    24.000000  25.000000  26.000000  27.000000  28.000000  29.000000

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
a    5 non-null int32
b    5 non-null int32
c    5 non-null int32
d    5 non-null int32
e    5 non-null int32
f    5 non-null int32
dtypes: int32(6)
memory usage: 248.0 bytes

>>> df1.head(2)
   a  b  c  d   e   f
0  0  1  2  3   4   5
1  6  7  8  9  10  11
>>> df1.tail(2)
    a   b   c   d   e   f
3  18  19  20  21  22  23
4  24  25  26  27  28  29
>>> df1.head()
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

sort

# by index
>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.sort_index(axis = 1, ascending = False)
# axis: 0-> by index, 1-> by columns
    f   e   d   c   b   a
0   5   4   3   2   1   0
1  11  10   9   8   7   6
2  17  16  15  14  13  12
3  23  22  21  20  19  18
4  29  28  27  26  25  24

>>> df1.sort_values(by = 'f', ascending = False)
    a   b   c   d   e   f
4  24  25  26  27  28  29
3  18  19  20  21  22  23
2  12  13  14  15  16  17
1   6   7   8   9  10  11
0   0   1   2   3   4   5

IO

# csv
df.to_csv('./foo.csv')
df1 = pd.read_csv('./foo.csv')

# hdf5
df.to_hdf('./foo.h5', 'df')
df1 = pd.read_hdf('./foo.h5', 'df')

# excel
df.to_excel('./foo.xlsx', sheet_name = 'foooo')
df1 = pd.read_excel('./foo.xlsx', sheet_name = 'foooo')