Pandas: Indexing and selecting data
1.Introduction
In this article, I will summarize the various indexing methods in Pandas. The primary focus will be on Series and DataFrame.
Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as the index.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series object.
I will not discuss the details of these two types here. You can get more information about Series and DataFrame in pandas documentation.
Reference Document Links: https://pandas.pydata.org/docs/user_guide
2.Basics
Object Type | Selection | Return Value Type |
---|---|---|
Series | series[label] | scalar value |
DataFrame | frame[colname] | Series (corresponding to colname) |
2.1 basic indexing of Series
2.1.1 obj[…]
The basic indexing (obj[…]) of Series is similar to the indexing of ndarrays, except that the index values of Series can be more than just integers.
obj = pd.Series(np.arange(4),index=['a', 'b', 'c', 'd'])
'''
a 0
b 1
c 2
d 3
dtype: int32
'''
obj.index # Index(['a', 'b', 'c', 'd'], dtype='object')
I think the most confusing thing for the beginners is that you can index data in Series with both labels and positional subscripts. These two types of indexing can be respectively called label indexing and location indexing.
obj['b'] # 1
obj[1] #1
obj[['b', 'a', 'd']]
'''
b 1
a 0
d 3
dtype: int32
'''
obj[[1,0,3]]
'''
b 1
a 0
d 3
dtype: int32
'''
Of course you can also set the label to an integer. But this is not recommended, because you will not be able to distinguish between the label indexing and location indexing.
2.1.2 slicing
You can also use slicing to select data in Series.
obj['a':'c']
'''
a 0
b 1
c 2
dtype: int32
'''
obj[0:2]
'''
a 0
b 1
dtype: int32
'''
Can you spot the difference between slicing with subscripts and slicing with labels?
You should note that when slicing with labels, both the start bound AND the stop bound are included, if present in the index. BUT when slicing with subscripts, the stop bound are not included!
2.1.3 boolean indexing
Boolean indexing are still vaild in Pandas.
obj[obj<2]
'''
a 0
b 1
dtype: int32
'''
2.2 basic indexing of DataFrame
Select columns:
you can use a single value or sequence to select out one or more columns from DataFrame.
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
'''
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
'''
#use a single value
data['two']
'''
the result is a Series
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
'''
#use a sequence
data[['three','one']]
'''
the result is a DataFrame
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
'''
Select rows:
you can use slicing or boolean indexing to select out one or more rows.
#note the difference between slicing with labels and subscripts
data[:2]
'''
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
'''
data[:'Utah']
'''
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
'''
data[data['three']>5]
'''
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
'''
3. Selection by label (.loc)
pandas allows you to select a subset of rows and columns of an array from a DataFrame in NumPy style syntax.
The .loc attribute is the primary access method. The following are valid inputs:
- A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.)
- A list or array of labels [‘a’,’b’,’c’]
- A slice object with labels ‘a’:‘f’
- A boolean array.
#Select a single value
data.loc['Colorado','two'] # 5
#Select single row and multiple columns
data.loc['Colorado',['two','three']]
'''
the result is a Series
two 5
three 6
Name: Colorado, dtype: int32
'''
#Select multiple rows and single column
data.loc[['Colorado','New York'],'two']
'''
the result is a Series
Colorado 5
New York 13
Name: two, dtype: int32
'''
#Select multiple rows and multiple columns
data.loc[['Colorado','New York'],['two','three']]
'''
the result is a DataFrame
two three
Colorado 5 6
New York 13 14
'''
#you also can use slicing
data.loc[:'Utah','two']
'''
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int32
'''
4. Selection by position (.iloc)
pandas provides a suit of methods in order to get pure;y integer based indexing.
The .iloc attribute is the primary access method. The following are valid inputs:
- An integer e.g. 5.
- A list or array of integers [4, 3, 0].
- A slice object with ints 1:7.
- A boolean array.
#Select a single value
data.iloc[1,1] #5
#Select single row and multiple columns
data.iloc[1,[1,2]]
'''
two 5
three 6
Name: Colorado, dtype: int32
'''
#Select multiple rows and single column
data.iloc[[1,3],1]
'''
Colorado 5
New York 13
Name: two, dtype: int32
'''
#Select multiple rows and multiple columns
data.iloc[[1,3],[1,2]]
'''
two three
Colorado 5 6
New York 13 14
'''
#you also can use slicing
data.iloc[:2,1]
'''
Ohio 1
Colorado 5
Name: two, dtype: int32
'''
5. Combining positional and label-based indexing
If you wish to get the 0th and the 2nd elements from the index in the ‘A’ column, you can do:
dfd = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=list('abc'))
'''
A B
a 1 4
b 2 5
c 3 6
'''
dfd.loc[dfd.index[[0, 2]], 'A']
'''
a 1
c 3
Name: A, dtype: int64
'''
This can also be expressed using .iloc, by explicitly getting locations on the indexers, and using positional indexing to select things.
dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
'''
a 1
c 3
Name: A, dtype: int64
'''
For getting multiple indexers, using .get_indexer
dfd.columns.get_indexer(['A', 'B'])
# array([0, 1], dtype=int64)
dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
'''
A B
a 1 4
c 3 6
'''
oc[[0, 2], dfd.columns.get_loc(‘A’)]
‘’’
a 1
c 3
Name: A, dtype: int64
‘’’
For getting *multiple* indexers, using **.get_indexer**
```python
dfd.columns.get_indexer(['A', 'B'])
# array([0, 1], dtype=int64)
dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
'''
A B
a 1 4
c 3 6
'''