十分钟搞定pandas（一）

最新推荐文章于 2024-04-07 10:25:17 发布

鹅厂程序小哥

最新推荐文章于 2024-04-07 10:25:17 发布

阅读量792

点赞数

分类专栏： Python 文章标签： python pandas

Python 专栏收录该内容

13 篇文章

订阅专栏

这是关于pandas的简短介绍，主要面向新用户。可以参阅Cookbook了解更复杂的使用方法。

习惯上,我们做以下导入

创建对象

使用传递的值列表序列创建序列, 让pandas创建默认整数索引

使用传递的numpy数组创建数据帧,并使用日期索引和标记列.

 
Python
 
        1
       
        2
       
        3
       
        4
       
        5
       
        6
       
        7
       
        8
       
        9
       
        10
       
        11
       
        12
       
        13
       
        14
       
        15
       
        16
       
        17
       
        In
         
        [
        6
        ]
        :
         
        dates
         
        =
         
        pd
        .
        date_range
        (
        '20130101'
        ,
        periods
        =
        6
        )
       
        In
         
        [
        7
        ]
        :
         
        dates
       
        Out
        [
        7
        ]
        :
         
        <
        class
         
        'pandas.tseries.index.DatetimeIndex'
        >
       
        [
        2013
        -
        01
        -
        01
        ,
         
        .
        .
        .
        ,
         
        2013
        -
        01
        -
        06
        ]
       
        Length
        :
         
        6
        ,
         
        Freq
        :
         
        D
        ,
         
        Timezone
        :
         
        None
       
        In
         
        [
        8
        ]
        :
         
        df
         
        =
         
        pd
        .
        DataFrame
        (
        np
        .
        random
        .
        randn
        (
        6
        ,
        4
        )
        ,
        index
        =
        dates
        ,
        columns
        =
        list
        (
        'ABCD'
        )
        )
       
        In
         
        [
        9
        ]
        :
         
        df
       
        Out
        [
        9
        ]
        :
         
        A
                 
        B
                 
        C
                 
        D
       
        2013
        -
        01
        -
        01
          
        0.469112
         
        -
        0.282863
         
        -
        1.509059
         
        -
        1.135632
       
        2013
        -
        01
        -
        02
          
        1.212112
         
        -
        0.173215
          
        0.119209
         
        -
        1.044236
       
        2013
        -
        01
        -
        03
         
        -
        0.861849
         
        -
        2.104569
         
        -
        0.494929
          
        1.071804
       
        2013
        -
        01
        -
        04
          
        0.721555
         
        -
        0.706771
         
        -
        1.039575
          
        0.271860
       
        2013
        -
        01
        -
        05
         
        -
        0.424972
          
        0.567020
          
        0.276232
         
        -
        1.087401
       
        2013
        -
        01
        -
        06
         
        -
        0.673690
          
        0.113648
         
        -
        1.478427
          
        0.524988

使用传递的可转换序列的字典对象创建数据帧.

Python

  
 
 
  
         1
       

         2
       

         3
       

         4
       

         5
       

         6
       

         7
       

         8
       

         9
       

         10
       

         11
       

         12
       

         13
       

         14
       
 
        In
         
        [
        10
        ]
        :
         
        df2
         
        =
         
        pd
        .
        DataFrame
        (
        {
         
        'A'
         
        :
         
        1.
        ,
       
 
           
        .
        .
        .
        .
        :
                              
        'B'
         
        :
         
        pd
        .
        Timestamp
        (
        '20130102'
        )
        ,
       
 
           
        .
        .
        .
        .
        :
                              
        'C'
         
        :
         
        pd
        .
        Series
        (
        1
        ,
        index
        =
        list
        (
        range
        (
        4
        )
        )
        ,
        dtype
        =
        'float32'
        )
        ,
       
 
           
        .
        .
        .
        .
        :
                              
        'D'
         
        :
         
        np
        .
        array
        (
        [
        3
        ]
         
        *
         
        4
        ,
        dtype
        =
        'int32'
        )
        ,
       
 
           
        .
        .
        .
        .
        :
                              
        'E'
         
        :
         
        pd
        .
        Categorical
        (
        [
        "test"
        ,
        "train"
        ,
        "test"
        ,
        "train"
        ]
        )
        ,
       
 
           
        .
        .
        .
        .
        :
                              
        'F'
         
        :
         
        'foo'
         
        }
        )
       
 
           
        .
        .
        .
        .
        :
         
       
 
        In
         
        [
        11
        ]
        :
         
        df2
       
 
        Out
        [
        11
        ]
        :
         
       
 
           
        A
                  
        B
          
        C
          
        D
              
        E
            
        F
       
 
        0
          
        1
         
        2013
        -
        01
        -
        02
          
        1
          
        3
           
        test
          
        foo
       
 
        1
          
        1
         
        2013
        -
        01
        -
        02
          
        1
          
        3
          
        train  
        foo
       
 
        2
          
        1
         
        2013
        -
        01
        -
        02
          
        1
          
        3
           
        test
          
        foo
       
 
        3
          
        1
         
        2013
        -
        01
        -
        02
          
        1
          
        3
          
        train  
        foo
       
 
 

所有明确类型

如果你这个正在使用IPython，标签补全列名（以及公共属性）将自动启用。这里是将要完成的属性的子集：

Python

  
 
 
  
         1
       

         2
       

         3
       

         4
       

         5
       

         6
       

         7
       

         8
       

         9
       

         10
       

         11
       

         12
       

         13
       

         14
       

         15
       

         16
       

         17
       

         18
       

         19
       

         20
       

         21
       

         22
       

         23
       

         24
       
 
        In
         
        [
        13
        ]
        :
         
        df2
        .
        <
        TAB
        >
       
 
        df2
        .
        A
                          
        df2
        .
        boxplot
       
 
        df2
        .
        abs
                        
        df2
        .
        C
       
 
        df2
        .
        add                
        df2
        .
        clip
       
 
        df2
        .
        add_prefix         
        df2
        .
        clip_lower
       
 
        df2
        .
        add_suffix         
        df2
        .
        clip_upper
       
 
        df2
        .
        align              
        df2
        .
        columns
       
 
        df2
        .
        all
                        
        df2
        .
        combine
       
 
        df2
        .
        any
                        
        df2
        .
        combineAdd
       
 
        df2
        .
        append             
        df2
        .
        combine_first
       
 
        df2
        .
        apply
                      
        df2
        .
        combineMult
       
 
        df2
        .
        applymap           
        df2
        .
        compound
       
 
        df2
        .
        as_blocks          
        df2
        .
        consolidate
       
 
        df2
        .
        asfreq             
        df2
        .
        convert_objects
       
 
        df2
        .
        as_matrix          
        df2
        .
        copy
       
 
        df2
        .
        astype             
        df2
        .
        corr
       
 
        df2
        .
        at                 
        df2
        .
        corrwith
       
 
        df2
        .
        at_time            
        df2
        .
        count
       
 
        df2
        .
        axes               
        df2
        .
        cov
       
 
        df2
        .
        B
                          
        df2
        .
        cummax
       
 
        df2
        .
        between_time       
        df2
        .
        cummin
       
 
        df2
        .
        bfill              
        df2
        .
        cumprod
       
 
        df2
        .
        blocks             
        df2
        .
        cumsum
       
 
        df2
        .
        bool
                       
        df2
        .
        D
       
 
 

如你所见, 列 A, B, C, 和 D 也是自动完成标签. E 也是可用的; 为了简便起见,后面的属性显示被截断.

查看数据

参阅基础部分

查看帧顶部和底部行

显示索引,列,和底层numpy数据

Python

  
 
 
  
         1
       

         2
       

         3
       

         4
       

         5
       

         6
       

         7
       

         8
       

         9
       

         10
       

         11
       

         12
       

         13
       

         14
       

         15
       

         16
       

         17
       
 
        In
         
        [
        16
        ]
        :
         
        df
        .
        index
       
 
        Out
        [
        16
        ]
        :
         
       
 
        <
        class
         
        'pandas.tseries.index.DatetimeIndex'
        >
       
 
        [
        2013
        -
        01
        -
        01
        ,
         
        .
        .
        .
        ,
         
        2013
        -
        01
        -
        06
        ]
       
 
        Length
        :
         
        6
        ,
         
        Freq
        :
         
        D
        ,
         
        Timezone
        :
         
        None
       

          
       
 
        In
         
        [
        17
        ]
        :
         
        df
        .
        columns
       
 
        Out
        [
        17
        ]
        :
         
        Index
        (
        [
        u
        'A'
        ,
         
        u
        'B'
        ,
         
        u
        'C'
        ,
         
        u
        'D'
        ]
        ,
         
        dtype
        =
        'object'
        )
       

          
       
 
        In
         
        [
        18
        ]
        :
         
        df
        .
        values
       
 
        Out
        [
        18
        ]
        :
         
       
 
        array
        (
        [
        [
         
        0.4691
        ,
         
        -
        0.2829
        ,
         
        -
        1.5091
        ,
         
        -
        1.1356
        ]
        ,
       
 
               
        [
         
        1.2121
        ,
         
        -
        0.1732
        ,
          
        0.1192
        ,
         
        -
        1.0442
        ]
        ,
       
 
               
        [
        -
        0.8618
        ,
         
        -
        2.1046
        ,
         
        -
        0.4949
        ,
          
        1.0718
        ]
        ,
       
 
               
        [
         
        0.7216
        ,
         
        -
        0.7068
        ,
         
        -
        1.0396
        ,
          
        0.2719
        ]
        ,
       
 
               
        [
        -
        0.425
         
        ,
          
        0.567
         
        ,
          
        0.2762
        ,
         
        -
        1.0874
        ]
        ,
       
 
               
        [
        -
        0.6737
        ,
          
        0.1136
        ,
         
        -
        1.4784
        ,
          
        0.525
         
        ]
        ]
        )
       
 
 

描述显示数据快速统计摘要

转置数据

按轴排序

按值排序

选择器

注释: 标准Python / Numpy表达式可以完成这些互动工作, 但在生产代码中, 我们推荐使用优化的pandas数据访问方法, .at, .iat, .loc, .iloc 和 .ix.

参阅索引文档索引和选择数据 and 多索引/高级索引

读取

选择单列, 这会产生一个序列, 等价df.A

使用[]选择行片断

使用标签选择

更多信息请参阅按标签选择

使用标签获取横截面

使用标签选择多轴

显示标签切片, 包含两个端点

降低返回对象维度

获取标量值

 
Python
 
        1
       
        2
       
        In
         
        [
        30
        ]
        :
         
        df
        .
        loc
        [
        dates
        [
        0
        ]
        ,
        'A'
        ]
       
        Out
        [
        30
        ]
        :
         
        0.46911229990718628

快速访问并获取标量数据 (等价上面的方法)

Python

  
 
 
  
         1
       

         2
       
 
        In
         
        [
        31
        ]
        :
         
        df
        .
        at
        [
        dates
        [
        0
        ]
        ,
        'A'
        ]
       
 
        Out
        [
        31
        ]
        :
         
        0.46911229990718628
       
 
 

按位置选择

更多信息请参阅按位置参阅

传递整数选择位置

使用整数片断,效果类似numpy/python

使用整数偏移定位列表,效果类似 numpy/python 样式

显式行切片

显式列切片

显式获取一个值

Python

         1
       
         2
       
        In
         
        [
        37
        ]
        :
         
        df
        .
        iloc
        [
        1
        ,
        1
        ]
       
        Out
        [
        37
        ]
        :
         
        -
        0.17321464905330861

快速访问一个标量（等同上个方法）

 
Python
 
        1
       
        2
       
        In
         
        [
        38
        ]
        :
         
        df
        .
        iat
        [
        1
        ,
        1
        ]
       
        Out
        [
        38
        ]
        :
         
        -
        0.17321464905330861

布尔索引

使用单个列的值选择数据.

where 操作.

使用 isin() 筛选：

赋值

赋值一个新列，通过索引自动对齐数据

按标签赋值

按位置赋值

通过numpy数组分配赋值

之前的操作结果

where 操作赋值.

丢失的数据

pandas主要使用np.nan替换丢失的数据. 默认情况下它并不包含在计算中. 请参阅 Missing Data section

重建索引允许更改/添加/删除指定轴索引,并返回数据副本.

删除任何有丢失数据的行.

填充丢失数据

获取值是否nan的布尔标记

运算

参阅二元运算基础

统计

计算时一般不包括丢失的数据

执行描述性统计

在其他轴做相同的运算

用于运算的对象有不同的维度并需要对齐.除此之外，pandas会自动沿着指定维度计算.

Apply

在数据上使用函数

直方图

请参阅直方图和离散化

 
Python
 
        1
       
        2
       
        3
       
        4
       
        5
       
        6
       
        7
       
        8
       
        9
       
        10
       
        11
       
        12
       
        13
       
        14
       
        15
       
        16
       
        17
       
        18
       
        19
       
        20
       
        21
       
        22
       
        In
         
        [
        68
        ]
        :
         
        s
         
        =
         
        pd
        .
        Series
        (
        np
        .
        random
        .
        randint
        (
        0
        ,
        7
        ,
        size
        =
        10
        )
        )
       
        In
         
        [
        69
        ]
        :
         
        s
       
        Out
        [
        69
        ]
        :
         
        0
            
        4
       
        1
            
        2
       
        2
            
        1
       
        3
            
        2
       
        4
            
        6
       
        5
            
        4
       
        6
            
        4
       
        7
            
        6
       
        8
            
        4
       
        9
            
        4
       
        dtype
        :
         
        int32
       
        In
         
        [
        70
        ]
        :
         
        s
        .
        value_counts
        (
        )
       
        Out
        [
        70
        ]
        :
         
        4
            
        5
       
        6
            
        2
       
        2
            
        2
       
        1
            
        1
       
        dtype
        :
         
        int64

字符串方法

序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式（并且大多数时候都如此）. 更多信息请参阅字符串向量方法.