Series&DataFrame的整理_使用分层索引笛卡尔乘积构造形状为4*4的series类结构对象,并转化为dataframe类结-优快云博客

本文详细介绍了Pandas中的Series和DataFrame的操作，包括创建、索引、切片、缺失值处理、重新索引、多级索引的数据累计方法等。重点讲解了如何使用loc、iloc和ix进行索引，以及fillna、dropna等方法处理缺失值，还涵盖了DataFrame的重新索引、行列标签转换以及数据合并操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

import pandas as pd
import numpy as np
Series的创建：
   1）Series是通用的NumPy数组：
       例：   data = pd.Series([0.25,0.5,075,1.0， index = ["a","b","c","d"])
   2）Series是特殊的字典：
       例：   data_dict = {"population":123,
               "text":456,
               "flower":789}
           data = pd.Series(data_dict)
           data
Pandas提供的索引器（pandas提供的索引器（loc,iloc,ix(在0.20.00以后的pandas中ix使用较少))）：
   f = pd.Series([1,2,3],index = ["a","b","c"])
   print(f)
   print("*"*20)
   print(f.loc["c"])   #显式索引器
   print("*"*20)
   print(f.iloc[2])     #隐式索引器

   a 1
   b 2
   c 3
   dtype: int64
   ********************
   3
   ********************
   3
   Series和DataFrame对以上的索引器都适用。
Series的切片：
   将隐式索引作为切片：
       data = pd.Series([0.25,0.5,075,1.0， index = ["a","b","c","d"])
       data[0:2]

       a   0.25
       b   0.5
   将显式索引作为切片：
       data = pd.Series([0.25,0.5,075,1.0， index = ["a","b","c","d"])
       data["a":"c"]

       a   0.25
       b   0.5
       c   0.75
   注意的是：当使用显式索引作为切片时，结果包含最后的一个索引，当使用隐式索引作为切片时，结果不包含最后一个索引。
索引对齐（Series和DataFrame）：
   对于缺失的值用NaN值填充；
   Series索引对齐：
       A = pd.Series([2,4,6],index = [0,1,2])
       B = pd.Series([1,3,5],index = [1,2,3])
       print(A+B)
       A.add(B,fill_value = 0)
   DataFrame索引对齐：
       A = pd.DataFrame(rng.randint(0.20.(2.2)),columns = list("AB"))
       B = pd.DataFrame(rng.randint(0.10.(3.3)),columns = list("BAC"))
       print(A + B)
       fill = A.stack().mean()
       A.add(B,fill_value = fill)   #使用A的add方法，传入B以及一个fill_value参数,fill_value表示缺失值的填充值；
DataFrame行和列的查找：
   对于查找DataFrame行和列，DataFrame可以看成增强的二维数组，对于查找DataFrame的某一行，可以通过obj.values[i]来查找，对于查找DataFrame的某一列
   可以通过obj["j"]或者obj.j来查找；
   例：   frame = pd.DataFrame(np.arange(12).reshape((4,3)),
   index = ["ohil","colorado","utah","new york"],
   columns = ["one","two","three"])
       series = frame.values[0]
       print(series)
       print(type(series))
       series1 = frame["two"]
       print(series1)
       print(type(series1))

       [0 1 2]
       <class 'numpy.ndarray'>
       ohil         1
       colorado     4
       utah         7
       new york     10
       Name: two, dtype: int32
       <class 'pandas.core.series.Series'>
DataFrame与Series的运算：
   DataFrame与Sreies的运算：Series要进行广播，达到与DataFrame同样的结构可以进行运算；（问题：axis值的选择）
   例1：   frame = pd.DataFrame(np.arange(12).reshape((4,3)),
   index = ["ohil","colorado","utah","new york"],
   columns = ["one","two","three"])
       series = frame.values[0]
       print(frame)
       print(series)
       print(frame.sub(series,axis = 1))

           one     two     three
       ohil     0    1    2
       colorado   3    4    5
       utah     6     7     8
       new york    9    10    11
       [0 1 2]
            one    two     three
       ohil    0     0     0
       colorado   3     3     3
       utah     6     6     6
       new york 9     9     9
   例2：   A = np.random.randint(10,size = (3,4))
       print(A)
       df = pd.DataFrame(A,columns = list("QRST"))
       print(df["R"])
       print(df.sub(df["R"],axis = 0))
       print(df.sub(df["R"],axis = 1))

       [[7    3    8    4]
        [2    0    0    9]
        [4   8    4    4]]
       0     3
       1     0
       2     8
       Name: R, dtype: int32
            Q     R    S     T
       0     4     0    5     1
       1     2     0    0     9
       2   -4   0   -4   -4
           Q    R    S    T    0    1    2
       0    NaN    NaN    NaN    NaN    NaN    NaN    NaN
       1    NaN    NaN    NaN    NaN    NaN    NaN    NaN
       2    NaN    NaN    NaN    NaN    NaN    NaN    NaN
缺失值的处理(Series和DataFrame)：
   1）发现缺失值
       Pandas数据结构有两种有效的方法去发现缺失值：isnull和notnull
   2）剔除缺失值
       通过dropna方法去剔除缺失值
       例：   data = pd.Series([1,np.nan,"hello",None])
           data.dropna

           o   1
           1   hello
           dtype:   object
       注意：我们没有办法从DataFrame中单独剔除一个值，要么剔除整行或者整列；根据实际要求，DataFrame中的dropna()会配置一些参数进行选择；
       例：   df = pd.DataFrame([1,   np.nan,   2],
                   [2,   3 ,    5],
                   [np.nan,   4,   6])
           print(df.dropna())

               0,   1,   2
           1   2,   3,   5
           print(df.dropna(axis = "columns"))#或者是：axis = 1
               2
           0   2
           1   5
           2   6
       注意：对于以上剔除DataFrame中的缺失值过于暴力，可能会剔除非缺失值的值；
           对于解决这个问题，可以通过设置how或thresh参数来满足，可以设置剔除行或者列缺失值的数量阈值。
       默认设置是how = "any",即，只要是缺失值就剔除整行或整列（通过axis来设置坐标轴），(参看上例)
           how = "all",即，剔除全部是缺失值的行或者列（参看下例）；
       例：   df[3] = np.nan
           df
               0   1   2   3
           0   1   NaN   2   NaN
           1   2   3   5   NaN
           2   NaN   4   6   NaN
           df.dropna(axis = "columns",how = "all")
               0   1   2
           0   1   NaN   2
           1   2   3   5
           2   NaN   4   6
   3）缺失值的填充
       Series与DataFrame缺失值的填充操作方法类似，只是DataFrame填充时需要设置坐标轴参数axis;
       通过fillna()方法填充缺失值;对于fillna()方法，可以填充一个数值，也可以method来实现向前填充（method = "ffill"）和向后填充(method = "bfill");
       Series缺失值填充
           例：   data = pd.Series([1,np.nan,2,None,3],index = list("abcde"))
               data
               a   1
               b   NaN
               c   2
               d   NaN
               e   3
               dtype:   float64

               data.fillna(0)
               a   1
               b   0
               c   2
               d   0
               e   3
               dtype:   float64

               data.fillna(method = "ffill")
               a   1
               b   1
               c   2
               d   2
               e   3
               dtype:   float64

               data.fillna(method = "bfill")
               a   1
               b   2
               c   2
               d   3
               e   3
               dtype:   float64

       DataFrame缺失值填充：
           例：   df = pd.DataFrame([1,   np.nan,   2,   np.nan],
                       [2,   3 ,    5,   np.nan],
                       [np.nan,   4,   6,   np.nan])
               df
                   0   1   2   3
               0   1   NaN   2   NaN
               1   2   3   5   NaN
               2   NaN   4   6   NaN

               df.fillna(method = "ffill",axis = 1)
                   0   1   2   3
               0   1   1   2   2
               1   2   3   5   5
               2   NaN   4   6   6
   具体pandas中axis的问题参看：https://www.jianshu.com/p/bf60078103f2；
重新索引：
   重新索引就是通过调用Series的reindex，根据新索引进行重新排序。如果当新索引与原索引不一致时，就会用NaN值进行填充；
   例1：   Series重新索引：
       obj = pd.Series([4.5,7.2,-5.3,3.6],index = ["d","b","a","c"])
       print(obj)
       obj2 = obj.reindex(["a","b","c","d","e"])
       print(obj2)
       obj3 = pd.Series(["blue","purple","yellow"],index = [0,2,4])
       obj4 = obj3.reindex(range(6),method = "ffill")
       print(obj4)

       d     4.5
       b    7.2
       a    -5.3
       c    3.6
       dtype:    float64
       a    -5.3
       b    7.2
       c     3.6
       d     4.5
       e     NaN
       dtype:    float64
       0    blue
       1     blue
       2     purple
       3     purple
       4     yellow
       5     yellow
       dtype:    object
   例2：   DataFrame重新索引：
       frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ["a","c","d"],columns = ["ohil","texas","california"])
       #对index进行重新索引
       print(frame)
       print(frame.reindex(["a","b","c","d"]))
       #对columns进行重新索引
       states = ["texas","see","california"]
       print(frame.reindex(columns = states))
       #对index和columns同时进行重新索引
       frame1 = frame.reindex(index = ["a","b","c","d"], columns = states)
       print(frame1)
       frame1.loc["b"] = 0
       print(frame1)

           ohil     texas     california
       a    0     1    2
       c    3     4    5
       d    6     7    8
           ohil     texas     california
       a    0.0     1.0    2.0
       b    NaN     NaN NaN
       c    3.0     4.0    5.0
       d    6.0     7.0    8.0
           texas     see     california
       a     1     NaN 2
       c     4     NaN 5
       d     7     NaN 8
            texas     see     california
       a     1.0     NaN 2.0
       b     NaN     NaN NaN
       c     4.0     NaN 5.0
       d     7.0     NaN 8.0
            texas     see     california
       a     1.0     NaN 2.0
       b     0.0     0.0    0.0
       c     4.0     NaN 5.0
       d     7.0     NaN 8.0
   疑惑：
       frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ["a","c","d"],columns = ["ohil","texas","california"])
       frame1 = frame.reindex(index = ["a","b","c","d"],method = "ffill", columns = states)
       print(frame1)

       ValueError：   index must be monotonic increasing or decreasing（索引必须是单调递增或递减）（出现值的错误，与例题答案不同）
丢弃指定轴上的值：
   例：   data = pd.DataFrame(np.arange(16).reshape((4,4)),
           index = ["ohil","colorado","utah","new york"],
           columns = ["one","two","three","four"])
       print(data)
       print(data.drop("two",axis = 1))
       print(data.drop("ohil"))   #默认axis = 0，可以不用指定
           one   two     three     four
       ohil    0     1    2    3
       colorado 4     5     6    7
       utah     8    9    10    11
       new york 12    13    14     15

           one     three    four
       ohil     0     2    3
       colorado 4    6    7
       utah 8    10     11
       new york 12     14     15

           one     two     three     four
       colorado 4     5    6    7
       utah     8     9    10     11
       new york 12    13    14     15
排名和排序：
   排序：
       要对行和列索引进行排序（按字典顺序），可使用sort_index方法，将返回一个已排序的新对象，对于索引降序排序，可以通过设置sort_index = （ascending = False）
       对于Series的值进行排序，对DataFrame进行一个或者多个列的值进行排序时，通过sort_values()进行排序，通过by来设置某一列或者多列；
       例：   #对索引的排序
           obj = pd.Series(range(4),index = ["d","b","c","a"])
           print(obj.sort_index())
           print(obj.sort_index(ascending = False))
           frame = pd.DataFrame(np.arange(8).reshape((2,4)),index = ["one","three"],
               columns = ["d","a","b","c"])
           print(frame.sort_index(axis = 1))
           #对值的排序
           print(frame.sort_values(by = "c",ascending = False))
           obj1 = pd.Series([4,7,-2,2])
           print(obj1.sort_values())

           a     3
           b     1
           c     2
           d     0
           dtype:    int64
           d     0
           c     2
           b     1
           a     3
           dtype:    int64
                a     b     c     d
           one     1     2     3     0
           three     5     6     7     4
                d     a     b     c
           three     4     5     6     7
           one     0     1     2     3
           2    -2
           3     2
           0     4
           1     7
           dtype:    int64
   排名：
       排名时用于破坏评级关系的method选项：
       method       说明
       "average”   默认：在相等分组中，为各个值分配平均排名
       "min"       使用整个分组的最小排名
       "max"       使用整个分组的最大排名
       "first"       按值在原始数据中的出现顺序分配排名
       例：   obj = pd.Series([7,-5,4,2,0,4])
           print(obj.rank())
           print(obj.rank(method = "first"))
           #降序
           print(obj.rank(ascending = False,method = "first"))
           frame = pd.DataFrame({"b":[4.3,7,-3,2],"a":[0,1,0,1],
               "c":[-2,5,8,-2.5]})
           print(frame)
           print(frame.rank(axis = 0))

           0     6.0
           1     1.0
           2     4.5
           3     3.0
           4     2.0
           5     4.5
           dtype:    float64
           0     6.0
           1     1.0
           2     4.0
           3    3.0
           4     2.0
           5     5.0
           dtype:    float64
           0     1.0
           1     6.0
           2     2.0
           3     4.0
           4     5.0
           5     3.0
           dtype:    float64
                a     b     c
           0     0     4.3    -2.0
           1     1     7.0     5.0
           2     0    -3.0     8.0
           3     1     2.0    -2.5
                a     b     c
           0     1.5     3.0     2.0
           1     3.5     4.0     3.0
           2     1.5     1.0     4.0
           3     3.5     2.0     1.0
层级索引：
   使用unstack()，stack()方法Series与DataFrame之间的转换
   例：   index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["sex","age"])
       data = np.arange(4)
       data1 = pd.Series(data,index = index)
       print(data1)
       data2 = data1.unstack()
       print(data2)
       data3 = data2.stack()
       print(data3)

       sex     age
       a    1     0
           2     1
       b    1     2
           2     3
       dtype:    int32
       age    1     2
       sex
       a    0     1
       b    2     3
       sex    age
       a    1     0
           2     1
       b    1     2
           2     3
       dtype:    int32
   显式地创建多级索引：
       常用：笛卡尔积创建MultiIndex
           pd.MultiIndex.from_product([["a","b"],[1,2]])
           MultiIndex(levels=[['a', 'b'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
           利用笛卡尔积创建一个Series结构的数据
           例：   index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["sex","age"])
               data = np.arange(4)
               data1 = pd.Series(data,index = index)
               print(data1)

               sex     age
               a    1     0
                    2     1
               b    1     2
                   2     3
               dtype:    int32
           利用笛卡尔积创建一个DataFrame结构的数据
           例：   index = pd.MultiIndex.from_product([["a","b"],[1,2]], names = ["sex","age"])
               columns = ["see","bob","lol"]
               data = np.arange(12).reshape((4,3))
               data1 = pd.DataFrame(data,index = index, columns = columns)
               print(data1)

                      see     bob     lol
               sex    age
               a    1     0     1     2
                   2     3     4     5
               b    1     6     7     8
                   2     9    10    11
   多级索引的取值和切片：
       Series多级索引：
           例：   index = pd.MultiIndex.from_product([["a","b"],[1,2]],names = ["single","age"])
               data = np.arange(4)
               data1 = pd.Series(data,index = index)
               print(data1)
               print("*"*10)
               print(data1["a",1])
               print("*"*10)
               print(data1["a"])
               print("*"*10)
               print(data1[1])
               print(id(data1[1]))
               #对于局部切片，要求MultiIndex是按顺序排列的
               print(data1.loc[("a",2)])
               print(id(data1.loc[("a",2)]))
               print("*"*10)
               print(data1.loc["a",2])
               print("*"*10)
               print(id(data1.loc["a",2]))

               single     age
               a    1     0
                    2     1
               b    1     2
                   2     3
               dtype: int32
               **********
               0
               **********
               age
               1     0
               2     1
               dtype: int32
               **********
               1
               2421847444384
               1
               2421847444288
               **********
               1
               **********
               2421847444288
       DataFrame多级索引：
           例：   index = pd.MultiIndex.from_product([[2013,2014],[1,2]],names = ["year","visit"])
               columns = pd.MultiIndex.from_product([["bob","guido","sue"],["hr","temp"]],names = ["subject","type"])
               data = np.round(np.random.randn(4,6),1)
               data1 = pd.DataFrame(data,index = index,columns = columns)
               print(data1)
               print("*"*20)
               #直接使用Series中的索引在DataFrame上，就直接会使用在DataFrame中的列上
               print(data1["guido","hr"])
               print("*"*20)
               #iloc是只取最内层的索引
               print(data1.iloc[:2,:2])
               print("*"*20)
               #使用loc可以在行和列上选择
               print(data1.loc[:,("bob","hr")])
               print("*"*20)
               print(data1.loc[(2013,1),("bob","hr")])

               subject    bob         guido    sue
               type    hr    temp     hr    temp    hr    temp
               year    visit
               2013    1     1.0    -3.4     -0.6    -2.0     0.5    -0.2
                    2     0.5     2.4     -0.3    -2.1    -0.0    -0.1
               2014    1    -0.2    -0.5    0.5    -1.2    -2.0     1.8
                    2    -0.1     0.6     -1.2     0.3    -0.3    -1.2
               ********************
               year     visit
               2013     1    -0.6
                   2    -0.3
               2014     1     0.5
                   2    -1.2
               Name: (guido, hr), dtype: float64
               ********************
               subject    bob
               type    hr    temp
               year    visit
               2013    1     1.0    -3.4
                    2     0.5     2.4
               ********************
               year     visit
               2013     1     1.0
                   2     0.5
               2014     1    -0.2
                   2    -0.1
               Name: (bob, hr), dtype: float64
               ********************
               1.0
   多级索引行列转换：
       有序索引和无序索引:
           如果MultiIndex不是有序索引，使用切片操作会失败，可以使用sort_indx()方法对索引进行排序；
           例：   index = pd.MultiIndex.from_product([["a","c","b"],[1,2]],names = ["year","visit"])
               data = np.random.rand(6)
               data1 = pd.Series(data,index = index)
               print(data1)
               try:
               data1["a":"b"]
               except KeyError as e:
                print(e)
               #使用sort_index()方法
               data1 = data1.sort_index()
               print(data1["a":"b"])

               year     visit
               a    1     0.316982
                   2     0.988440
               c    1     0.553392
                   2     0.189816
               b    1     0.577293
                   2     0.894663
               dtype: float64
               'Key length (1) was greater than MultiIndex lexsort depth (0)'
               year visit
               a    1     0.316982
                   2    0.988440
               b    1     0.577293
                   2     0.894663
               dtype: float64
       索引stack与unstack:
           index = pd.MultiIndex.from_product([["a","b","c"],["d","e"],[1,2]],names = ["year","see","visit"])
           data = np.random.rand(12)
           data1 = pd.Series(data,index = index)
           print(data1)
           print("*"*20)
           print(data1.unstack(level = 1))
           print("*"*20)
           print(data1.unstack(level = 0))
           print("*"*20)
           print(data1.unstack(level = 2))

           year     see     visit
           a    d     1     0.092771
                  2     0.830678
                e     1     0.584222
                    2     0.391323
           b    d     1     0.381634
                   2     0.554762
               e     1     0.616809
                   2     0.226231
           c    d     1     0.389731
                    2     0.484459
               e     1     0.526906
                    2     0.405733
           dtype: float64
           ********************
           see     d        e
           year    visit
           a     1     0.092771     0.584222
                2     0.830678     0.391323
           b     1     0.381634     0.616809
                2     0.554762     0.226231
           c     1     0.389731     0.526906
                2     0.484459     0.405733
           ********************
           year     a        b        c
           see    visit
           d    1     0.092771     0.381634     0.389731
               2     0.830678     0.554762     0.484459
           e    1     0.584222     0.616809     0.526906
               2     0.391323     0.226231     0.405733
           ********************
           visit     1        2
           year    see
           a     d     0.092771     0.830678
                e     0.584222     0.391323
           b     d     0.381634     0.554762
                e     0.616809     0.226231
           c     d     0.389731     0.484459
               e     0.526906     0.405733
       索引的设置与重置：
           层级数据维度转换的另一种方法是行列标签的转换，通过reset_index方法实现；
           例：   index = pd.MultiIndex.from_product([["a","b","c"],["d","e"],[1,2]],names = ["year","see","visit"])
               data = np.random.rand(12)
               data1 = pd.Series(data,index = index)
               print(data1)
               print(data1.reset_index(name = "join"))

               year     see     visit
               a    d     1     0.260028
                        2     0.431808
                   e     1     0.651972
                        2     0.486900
               b    d     1     0.024664
                        2     0.466295
                   e     1     0.325269
                        2     0.175738
               c    d     1     0.507806
                        2     0.772308
                   e     1     0.913312
                        2     0.079500
               dtype:    float64
                    year    see     visit     join
               0    a    d     1     0.260028
               1    a    d     2     0.431808
               2    a    e     1     0.651972
               3    a    e     2     0.486900
               4    b    d     1     0.024664
               5    b    d     2     0.466295
               6    b    e     1     0.325269
               7    b    e     2     0.175738
               8    c    d     1     0.507806
               9    c    d     2     0.772308
               10     c    e     1     0.913312
               11     c    e     2     0.079500
       多级索引的数据累计方法
           对于层级索引数据，可以设置参数level实现对数据子集的累计操作；
               例：   index = pd.MultiIndex.from_product([[2013,2014],[1,2]],names = ["year","visit"])
                   columns = pd.MultiIndex.from_product([["bob","guido","sue"],["hr","temp"]],names = ["subject","type"])
                   data = np.round(np.random.randn(4,6),1)
                   data1 = pd.DataFrame(data,index = index,columns = columns)
                   print(data1)
                   print("*"*20)
                   #level表示累计的参数名称，axis表示沿着哪一维度累计
                   print(data1.mean(level = 0))
                   print("*"*20)
                   print(data1.mean(axis = 1,level = 0))
                   print("*"*20)
                   print(data1.mean(axis = 0,level = 1))
                   print("*"*20)
                   print(data1.mean(axis = 1,level = 1))

                   subject    bob         guido    sue
                   type    hr    temp     hr    temp    hr    temp
                   year    visit
                   2013    1    -0.3    -2.5     -0.8     1.4     0.7    -0.7
                        2    -0.9    -0.8    0.2    -0.8     1.1    -2.3
                   2014    1     0.2    -0.9    0.7     1.4     0.5     0.2
                        2    -0.0    -0.4    1.6     1.2    -0.5     0.4
                   ********************
                   subject         bob        guido    sue
                   type         hr     temp     hr    temp    hr    temp
                   year
                   2013         -0.6    -1.65    -0.30     0.3     0.9    -1.5
                   2014         0.1    -0.65     1.15     1.3     0.0     0.3
                   ********************
                   subject     bob     guido    sue
                   year    visit
                   2013    1    -1.40    0.30     0.00
                        2    -0.85     -0.30    -0.60
                   2014    1    -0.35    1.05     0.35
                        2    -0.20    1.40    -0.05
                   ********************
                   subject         bob         guido    sue
                   type        hr    temp     hr    temp    hr     temp
                       visit
                       1    -0.05    -1.7    -0.05     1.4     0.6    -0.25
                       2    -0.45   -0.6     0.90     0.2     0.3    -0.95
                   ********************
                   type     hr         temp
                   year    visit
                   2013    1    -0.133333    -0.600000
                        2     0.133333        -1.300000
                   2014    1     0.466667     0.233333
                        2     0.366667     0.400000
合并数据集：Concat与append操作

合并数据集：合并与连接