Pandas笔记

Pandas Notebook

由于使用emacs-org进行编辑,为方便暂且使用英文

Table of Contents

  1. Series

    1. create
      1. from ndarray
      2. from dict
      3. from scalar
    2. manip
  2. DataFrame

    1. create
      1. from dict of Series or dicts
      2. from dict of ndarrays/lists
      3. from structured or record arrays
      4. from a list of dicts
      5. from a dict of tuples
      6. from a Series
      7. use np.nan to imply missing data
    2. manip
    3. indexing/selection
      1. date as index/column
    4. show
    5. sort
  3. IO

    import numpy as np
    import pandas as pd

Series

one-dimensional, labeled list, hold any datatype.

labels are referred to as index.

s = pd.Series(data, index = index)

where data can be: python dict, np ndarray, or scalar.

non-unique index is enabled.

create

from ndarray

index should be as long as ndarray,

or by default 0 ~ len(data) - 1.

>>> s = pd.Series(np.random.rand(5), 
              index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.845771
b    0.508507
c    0.964314
d    0.924016
e    0.687160
dtype: float64

>>> s1 = pd.Series(np.random.rand(5))
>>> s1
0    0.353030
1    0.419253
2    0.121847
3    0.631009
4    0.696326
dtype: float64

from dict

if index wasn’t passed, the key of dict would be used,

if index was passed, index will match the dict’s keys,
index is prior.

>>> d = {'a': 1, 'b': 2, 'v': 3}
>>> pd.Series(d)
a    1
b    2
v    3
dtype: int64
>>>
>>> pd.Series(d, index = ['a', 'b', 'c'])
a    1.0
b    2.0
c    NaN
dtype: float64

notice: nan is a float number,
generally showing missing values,
shouldn’t be compared with
somevalue == np.nan, but use
somevalue is np.nan instead.

from scalar

index is must.

value’ll be the same.

>>> pd.Series(7, index = ['a', 'b', 'c', 'z'])
a    7
b    7
c    7
z    7
dtype: int64

manip

much like numpy and python dict

>>> s = pd.Series(np.random.rand(5), 
                  index = ['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64

>>> s[0]
0.20877084185831607

>>> s[:3]
a    0.208771
b    0.456772
c    0.055980
dtype: float64

>>> s[s > s.median()]
b    0.456772
d    0.481634
dtype: float64

>>> s[[4, 3, 1]]
e    0.215942
d    0.481634
b    0.456772
dtype: float64

>>> s * 2
a    0.417542
b    0.913543
c    0.111960
d    0.963267
e    0.431885
dtype: float64

# notice that operations recognize the index
>>> s
a    0.208771
b    0.456772
c    0.055980
d    0.481634
e    0.215942
dtype: float64
>>> s[1:] + s[:-1]
a         NaN
b    0.913543
c    0.111960
d    0.963267
e         NaN
dtype: float64

>>> s['a']
0.20877084185831607

>>> s.get('a')
0.20877084185831607

>>> s.get('f')
>>> # <- this is 'none'

>>> s.get('f', np.nan)
nan

# get rid of the index?
>>> s.array
<PandasArray>
[ 0.20877084185831607,   0.4567715456417446, 0.055979995016902984,
   0.4816336854012986,  0.21594244986110378]
Length: 5, dtype: float64

# get a genuine array?
>>> s.to_numpy()
array([0.20877084, 0.45677155, 0.05598   , 0.48163369, 0.21594245])

# name? 
>>> s.name
>>> # <- this is 'none'

>>> s = s.rename('seriesA')
>>> s.name
'seriesA'

DataFrame

SQL table

data can be:
python dict of 1-D ndarrays, lists, dicts, or Series,
2-D ndarray, structured or record ndarray,
a Series, another DataFrame.

pass index*/*columns to force row/column labels.

create

from dict of Series or dicts

nested dicts will be firstly convert to pd.Series.

the created DataFrame’s index length is decided firstly by
the passed index variable, or by the longest data (Series).

the created DataFrame’s column width is decided firstly by
the passed columns variable, or by the number of data.

Series as data don’t have to follow length requirement,

while list or ndarray as data should be as long as the
DataFrame’s index.

>>> d = {'one': pd.Series(np.arange(1., 4.), 
                          index = ['a', 'b', 'c']), 
         'two': pd.Series(np.arange(1., 5.), 
                          index = ['a', 'b', 'c', 'd'])}
>>> df = pd.DataFrame(d)
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

# follow index and columns --hard
>>> pd.DataFrame(d, index = ['d', 'c', 'a'])
   one  two
d  NaN  4.0
c  3.0  3.0
a  1.0  1.0

>>> pd.DataFrame(d, index = ['d', 'c', 'a'], columns = ['one', 'X'])
   one    X
d  NaN  NaN
c  3.0  NaN
a  1.0  NaN

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns
Index(['one', 'two'], dtype='object')

from dict of ndarrays/lists

ndarrays and lists should be as long as the DataFrame’s index,
and as wide as the DataFrame’s columns.

from structured or record arrays

>>> arr = [[1, 2, 'asdf'], [11, 12, 'asd']]

>>> pd.DataFrame(arr, index = ['a', 'b'], columns = ['x', 'y', 'z'])
    x   y     z
a   1   2  asdf
b  11  12   asd

from a list of dicts

>>> ls = [{'a': 1, 'b': 2}, {'c': 3, 'd': 4, 'e': 5}]

>>> pd.DataFrame(ls)
     a    b    c    d    e
0  1.0  2.0  NaN  NaN  NaN
1  NaN  NaN  3.0  4.0  5.0

>>> pd.DataFrame(ls, index = ['x', 'y'], columns = ['a', 'e'])
     a    e
x  1.0  NaN
y  NaN  5.0

from a dict of tuples

>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2, ('B', 'A'): 5},
...      ('b', 'c'): {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
       a    b
       b    c
A B  1.0  3.0
  C  2.0  NaN
B A  5.0  NaN
  C  NaN  4.0

>>> d = {('a', 'b'): {('A', 'B'): 1, 'A': 2},
...      'b': {('A', 'B'): 3, ('B', 'C'): 4}}

>>> pd.DataFrame(d)
        (a, b)    b
(A, B)     1.0  3.0
A          2.0  NaN
(B, C)     NaN  4.0

from a Series

the only columns is the Series’s name.

use np.nan to imply missing data

manip

like a dict

>>> df = pd.DataFrame(np.arange(1, 13).reshape(3, 4), 
                      index = ['a', 'b', 'c'], 
                      columns = ['one', 'two', 'three', 'four'])
>>> del df['one']

>>> df
   two  three  four
a    2      3     4
b    6      7     8
c   10     11    12

>>> two = df.pop('two')

>>> two
a     2
b     6
c    10
Name: two, dtype: int32

>>> df['foo'] = 'bar'

>>> df
   three  four  foo
a      3     4  bar
b      7     8  bar
c     11    12  bar

>>> df['three_trunc'] = df['three'][:2]

>>> df
   three  four  foo  three_trunc
a      3     4  bar          3.0
b      7     8  bar          7.0
c     11    12  bar          NaN

# insert some columns but not at the end?
>>> df.insert(1, 'new', np.random.rand(3))

>>> df
   three       new  four  foo  three_trunc
a      3  0.932232     4  bar          3.0
b      7  0.962772     8  bar          7.0
c     11  0.792010    12  bar          NaN

# numerical
>>> df['flag'] = df['new'] > df['new'].median()
>>> df['mult'] = df['three'] * df['new']

>>> df
   three       new  four  foo  three_trunc   flag      mult
a      3  0.932232     4  bar          3.0  False  2.796697
b      7  0.962772     8  bar          7.0   True  6.739403
c     11  0.792010    12  bar          NaN  False  8.712110

# transpose
>>> n = 'abcdef'
>>> df1 = pd.DataFrame(np.arange(30).reshape(5, 6), columns = [x for x in n])

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.T
   0   1   2   3   4
a  0   6  12  18  24
b  1   7  13  19  25
c  2   8  14  20  26
d  3   9  15  21  27
e  4  10  16  22  28
f  5  11  17  23  29

# turn to array
>>> df1.to_numpy()
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

>>> np.asarray(df1)
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

notice: when inserting a Series that isn’t as long as the DataFrame’s
index length, it will be forced to follow the index length.

but if insert a list or ndarray, make sure its length match the index.

indexing/selection

operation syntax result select column df[col] Series select row by label df.loc[label] Series select row by integer location df.iloc[loc] Series slice rows df[5:10] DataFrame select rows by boolean vector df[bool\_vec] DataFrame

date as index/column

>>> index = pd.date_range('1/1/2020', periods = 10)

>>> index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

>>> df = pd.DataFrame(np.arange(30).reshape(10, 3), index = index, columns = index[:3])
>>> df
            2020-01-01  2020-01-02  2020-01-03
2020-01-01           0           1           2
2020-01-02           3           4           5
2020-01-03           6           7           8
2020-01-04           9          10          11
2020-01-05          12          13          14
2020-01-06          15          16          17
2020-01-07          18          19          20
2020-01-08          21          22          23
2020-01-09          24          25          26
2020-01-10          27          28          29

show

>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.describe()
               a          b          c          d          e          f
count   5.000000   5.000000   5.000000   5.000000   5.000000   5.000000
mean   12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
std     9.486833   9.486833   9.486833   9.486833   9.486833   9.486833
min     0.000000   1.000000   2.000000   3.000000   4.000000   5.000000
25%     6.000000   7.000000   8.000000   9.000000  10.000000  11.000000
50%    12.000000  13.000000  14.000000  15.000000  16.000000  17.000000
75%    18.000000  19.000000  20.000000  21.000000  22.000000  23.000000
max    24.000000  25.000000  26.000000  27.000000  28.000000  29.000000

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
a    5 non-null int32
b    5 non-null int32
c    5 non-null int32
d    5 non-null int32
e    5 non-null int32
f    5 non-null int32
dtypes: int32(6)
memory usage: 248.0 bytes

>>> df1.head(2)
   a  b  c  d   e   f
0  0  1  2  3   4   5
1  6  7  8  9  10  11
>>> df1.tail(2)
    a   b   c   d   e   f
3  18  19  20  21  22  23
4  24  25  26  27  28  29
>>> df1.head()
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

sort

# by index
>>> df1
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23
4  24  25  26  27  28  29

>>> df1.sort_index(axis = 1, ascending = False)
# axis: 0-> by index, 1-> by columns
    f   e   d   c   b   a
0   5   4   3   2   1   0
1  11  10   9   8   7   6
2  17  16  15  14  13  12
3  23  22  21  20  19  18
4  29  28  27  26  25  24

>>> df1.sort_values(by = 'f', ascending = False)
    a   b   c   d   e   f
4  24  25  26  27  28  29
3  18  19  20  21  22  23
2  12  13  14  15  16  17
1   6   7   8   9  10  11
0   0   1   2   3   4   5

IO

# csv
df.to_csv('./foo.csv')
df1 = pd.read_csv('./foo.csv')

# hdf5
df.to_hdf('./foo.h5', 'df')
df1 = pd.read_hdf('./foo.h5', 'df')

# excel
df.to_excel('./foo.xlsx', sheet_name = 'foooo')
df1 = pd.read_excel('./foo.xlsx', sheet_name = 'foooo')
1. 用户与权限管理模块 角色管理: 学生:查看实验室信息、预约设备、提交耗材申请、参与安全考核 教师:管理课题组预约、审批学生耗材申请、查看本课题组使用记录 管理员:设备全生命周期管理、审核预约、耗材采购与分发、安全检查 用户操作: 登录认证:统一身份认证(对接学号 / 工号系统,模拟实现),支持密码重置 信息管理:学生 / 教师维护个人信息(联系方式、所属院系),管理员管理所有用户 权限控制:不同角色仅可见对应功能(如学生不可删除设备信息) 2. 实验室与设备管理模块 实验室信息管理: 基础信息:实验室编号、名称、位置、容纳人数、开放时间、负责人 功能分类:按学科(计算机实验室 / 电子实验室 / 化学实验室)标记,关联可开展实验类型 状态展示:实时显示当前使用人数、设备运行状态(正常 / 故障) 设备管理: 设备档案:名称、型号、规格、购置日期、单价、生产厂家、存放位置、责任人 全生命周期管理: 入库登记:管理员录入新设备信息,生成唯一资产编号 维护记录:记录维修、校准、保养信息(时间、内容、执行人) 报废处理:登记报废原因、时间,更新设备状态为 "已报废" 设备查询:支持按名称、型号、状态多条件检索,显示设备当前可用情况 3. 预约与使用模块 预约管理: 预约规则:学生可预约未来 7 天内的设备 / 实验室,单次最长 4 小时(可设置) 预约流程:选择实验室→选择设备→选择时间段→提交申请(需填写实验目的) 审核机制:普通实验自动通过,高危实验(如化学实验)需教师审核 使用记录: 签到 / 签退:到达实验室后扫码签到,离开时签退,系统自动记录实际使用时长 使用登记:填写实验内容、设备运行情况(正常 / 异常),异常情况需详细描述 违规管理:迟到 15 分钟自动取消预约,多次违规限制预约权限 4. 耗材与安全管理模块 耗材管理: 耗材档案:名称、规格、数量、存放位置、
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值