numpy和pandas常用操作
前言
import numpy as np
import pandas as pd
一、numpy操作
1.numpy数组中对所有元素进行加减乘除:
v = np.array([1,2,3,4,5])
加法:v+1
结果是:array([2, 3, 4, 5, 6])
减法:v-1
结果是:array([0, 1, 2, 3, 4])
乘法:v*2
结果是:array([ 2, 4, 6, 8, 10])
除法:v/10
结果是:array([0.1, 0.2, 0.3, 0.4, 0.5])
2.生成指定形状的numpy数组
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.rand.html
生成3行2列的随机数数组:
np.random.rand(3, 2)
结果是:
array([[0.98776843, 0.33816671],
[0.2125091 , 0.63436048],
[0.54017721, 0.37424256]])
生成用1填充的指定形状指定数据类型的数组:
生成的是3行1列的数组
np.ones((3, 1), dtype=np.float32)
结果是:
array([[1.],
[1.],
[1.]], dtype=float32)
在声明数组时指定不同的数据类型:
import numpy as np
dtype = [('Name', 'S10'), ('Height', float), ('Age', int)]
values = [('Li', 1.8, 41), ('Wang', 1.9, 38),('Duan', 1.7, 38)]
# [('Li', 1.8, 41), ('Wang', 1.9, 38), ('Duan', 1.7, 38)]
a = np.array(values, dtype=dtype)
# array([(b'Li', 1.8, 41), (b'Wang', 1.9, 38), (b'Duan', 1.7, 38)], dtype=[('Name', 'S10'), ('Height', '<f8'), ('Age', '<i4')])
排序的时候就可以指定列了:np.sort(a, order='Height')
参考链接:https://blog.youkuaiyun.com/haiyang_duan/article/details/79221458
3.矩阵乘法
4.获得数组中数值等于某个值的位置
label1=np.array([1,2,3,4,5,6,7,1,2,3,4,5,6,7])
label1==2
返回的数据为:
array([False, True, False, False, False, False, False, False, True,
False, False, False, False, False])
5.矩阵行/列求和
https://blog.youkuaiyun.com/lwplwf/article/details/55506896
test1 = np.array([[5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
test1.sum(axis=1)
# 输出 array([30, 75, 120])
test1 = np.array([[5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
test1.sum(axis=0)
# 输出 array([60, 75, 90])
二、pandas操作
1.读取csv/excel,从dict中构建dataframe
1)读取csv/excel
https://blog.youkuaiyun.com/sinat_29957455/article/details/79054126
df = pd.read_csv('XXX.csv')
df = pd.read_excel('excel.xls')
df = pd.DataFrame.from_dict({'source':keysarr,'target':valuesarr})
2)写入csv
df5.to_csv("C:/Users/17646/Desktop/result5.csv",sep=',')
2. 读取mongodb数据到dataframe中
1)从mongodb中读取数据
import pandas as pd
import pymongo
client = pymongo.MongoClient("127.0.0.1:27017")
# client = pymongo.MongoClient('localhost',27017)
db = client['lvsuo']
collection = db["total2"]
data = pd.DataFrame(list(collection .find()))
2)将dataframe数据插入mongodb中
我试的时候对于从mongodb里面取出的_id字段是不能够转换成json里面去的,后来我就自己指定了需要的那几个字段保存进了mongodb。
db["withIndex"].insert_many(json.loads(df1.T.to_json()).values())
3.查看数据
查看前5行数据,可以传递参数指定查看多少行数据:df.head()
查看数据有多少行多少列:df.shape
查看所有列名:df.columns
查看所有信息:df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
State 3333 non-null object
Account length 3333 non-null int64
Area code 3333 non-null int64
International plan 3333 non-null object
Voice mail plan 3333 non-null object
Number vmail messages 3333 non-null int64
Total day minutes 3333 non-null float64
Total day calls 3333 non-null int64
Total day charge 3333 non-null float64
Total eve minutes 3333 non-null float64
Total eve calls 3333 non-null int64
Total eve charge 3333 non-null float64
Total night minutes 3333 non-null float64
Total night calls 3333 non-null int64
Total night charge 3333 non-null float64
Total intl minutes 3333 non-null float64
Total intl calls 3333 non-null int64
Total intl charge 3333 non-null float64
Customer service calls 3333 non-null int64
Churn 3333 non-null bool
dtypes: bool(1), float64(8), int64(8), object(3)
memory usage: 498.1+ KB
None
查看数据的基础统计特征:
df.describe()查看的是float64和int64的特征的统计特征:
Account length | Area code | Number vmail messages | Total day minutes | Total day calls | Total day charge | Total eve minutes | Total eve calls | Total eve charge | Total night minutes | Total night calls | Total night charge | Total intl minutes | Total intl calls | Total intl charge | Customer service calls | Churn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 |
mean | 101.064806 | 437.182418 | 8.099010 | 179.775098 | 100.435644 | 30.562307 | 200.980348 | 100.114311 | 17.083540 | 200.872037 | 100.107711 | 9.039325 | 10.237294 | 4.479448 | 2.764581 | 1.562856 |
std | 39.822106 | 42.371290 | 13.688365 | 54.467389 | 20.069084 | 9.259435 | 50.713844 | 19.922625 | 4.310668 | 50.573847 | 19.568609 | 2.275873 | 2.791840 | 2.461214 | 0.753773 | 1.315491 |
min | 1.000000 | 408.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 23.200000 | 33.000000 | 1.040000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 74.000000 | 408.000000 | 0.000000 | 143.700000 | 87.000000 | 24.430000 | 166.600000 | 87.000000 | 14.160000 | 167.000000 | 87.000000 | 7.520000 | 8.500000 | 3.000000 | 2.300000 | 1.000000 |
50% | 101.000000 | 415.000000 | 0.000000 | 179.400000 | 101.000000 | 30.500000 | 201.400000 | 100.000000 | 17.120000 | 201.200000 | 100.000000 | 9.050000 | 10.300000 | 4.000000 | 2.780000 | 1.000000 |
75% | 127.000000 | 510.000000 | 20.000000 | 216.400000 | 114.000000 | 36.790000 | 235.300000 | 114.000000 | 20.000000 | 235.300000 | 113.000000 | 10.590000 | 12.100000 | 6.000000 | 3.270000 | 2.000000 |
max | 243.000000 | 510.000000 | 51.000000 | 350.800000 | 165.000000 | 59.640000 | 363.700000 | 170.000000 | 30.910000 | 395.000000 | 175.000000 | 17.770000 | 20.000000 | 20.000000 | 5.400000 | 9.000000 |
查看非数值型数据的基础统计:
df.describe(include=['object', 'bool'])
State | International plan | Voice mail plan |
---|---|---|
count | 3333 | 3333 |
unique | 51 | 2 |
top | WV | No |
freq | 106 | 3010 |
对于object和bool类型的特征,查看指定列中不同值的各项出现的数量:
df['Churn'].value_counts()
0 2850
1 483
Name: Churn, dtype: int64
采用df['Churn'].value_counts(normalize=True)可以把数值转换成分数:
0 0.855086
1 0.144914
Name: Churn, dtype: float64
4.数据类型操作
具体有哪些数据类型呢?
将指定列转换成int64:df['Churn'] = df['Churn'].astype('int64')
5. 数据过滤
查看指定数列数据:df[['col1','col1','col1']],它仍然是一个dataframe
1)数值类型大小比较:
!!!!!!!!在进行比较的时候一定要看数据类型是否相同!!!!!!!
df[df['dists']==123]或者df[df['dists']>=123]或者df[df['dists']>123]或者...
获得最小值所在的行:df[df['dists']==df['dists'].min()]
如果返回的列要求指定某几列的话:
df.ix[df['fromId']=='9c91082ab89ab3d157a4d0174b9c7ffa',['fromName','toName','money','minFormatDt']]
或者df[df['fromId']=='9c91082ab89ab3d157a4d0174b9c7ffa'][['fromName','toName','money','minFormatDt']]
多个条件进行过滤的话采用&、|,条件分别用括号括起来,例如
data1[(data1['fromId']=='004574fe986ed418066a581792a20368') & (data1['minFormatDt']>=1504546860)]
2)去掉重复数据
df.drop_duplicates(['name','address','area'])
df.drop_duplicates(subset=None, keep='first', inplace=False)
1 data.drop_duplicates()#data中一行元素全部相同时才去除
2 data.drop_duplicates(['a','b'])#data根据’a','b'组合列删除重复项,默认保留第一个出现的值组合。传入参数keep='last'则保留最后一个
来自 <https://www.cnblogs.com/zenan/p/8616508.html>
3)获取某一行
按照index的序值取:df.iloc[[2]],也有df.iloc[2]这种写法
按照具体值index的值取:df.loc[[2]]
In [2]: df
Out[2]:
A B
0 1.068932 -0.794307
2 -0.470056 1.192211
4 -0.284561 0.756029
6 1.037563 -0.267820
8 -0.538478 -0.800654
In [5]: df.iloc[[2]]
Out[5]:
A B
4 -0.284561 0.756029
In [6]: df.loc[[2]]
Out[6]:
A B
2 -0.470056 1.192211
4)填充nan或者Null值
对所有的nan填充0:
data.fillna(0)
对不同列进行不同的填充:
data.fillna({'age':0,'money':1000})
6.采用pandas进行统计
1)使用groupby语句进行统计
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html
df.groupby(['fromId','toId']).agg({'money':['count','sum']}).reset_index()
对df的结果输出之后的形式如下:
fromId | toId | money | ||
count | sum | |||
0 | XX | XXX | 15 | 3000000.00 |
另外一种写法得到的结果如下,和上面唯一的差别在于我在这儿只对money列进行统计:
df2 = df.groupby(['fromId','toId'])['money'].agg(['count','sum']).reset_index()
fromId | toId | count | sum | |
0 | XX | XXX | 15 | 3000000.00 |
7.对指定列应用函数,apply
df.apply(function,axis)
import time
def appl(row):
rq = row['jyrq']
rq = row['jyrq'] + '000000'
rq=rq.replace('t','0')[0:14]
ts = int(time.mktime(time.strptime(rq, "%Y%m%d%H%M%S")))
return ts
8.构建dataframe
列表创建
d1 = DataFrame([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
9.行列转置
dataframe.T这一句话就可以完成dataframe行列转置了。
10.dataframe的join操作
记得用set_index函数,不然的话默认是用的是原始的索引。
df7 = df6[["id","rank"]].join(df5.set_index("id"),how="left",on="id")
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.join.html
10.排序
sort_values
ids.sort_values(by="index",axis=0,ascending =True)
https://blog.youkuaiyun.com/wendaomudong_l2d4/article/details/80648633
11.取样
# 选取k个数据
df.sample(n=k)
#按照比例取样
df.sample(frac=0.5)
将时间转换成时间戳,对行进行操作
df3[‘ndate’]=df3.apply(appl,axis=1)