pandas中size方法,使用pandas中的chunksize参数提取每个值的基本计数

转载于 2021-01-12 19:28:59 发布 · 390 阅读

·

0

·

文章标签：

#pandas中size方法

博主有一个大CSV文件，想统计各列不同值的数量。最初代码用groupby方法，但处理大文件时遇到问题。解决方案是用value_counts方法按列统计，对于大文件使用chunksize分块读取，最后汇总结果并填充缺失值为0。

I have a CSV file with the following categories: item1,item2,item3,item4 which values is exactly one of the following: 0,1,2,3,4.

I would like to count for each items how many are there for each value.

My code is the following, df being the corresponding DataFrame:

outputDf = pandas.DataFrame()

cat_list = list(df.columns.values)

for col in cat_list:

s = df.groupby(col).size()

outputDf[col] = s

I would like to do exactly the same using the chunksize parameter when I read my CSV with read_csv, because my CSV is very big.

My problem is: I can't find a way to find the cat_list, neither to build the outputDf.

Can someone give me a hint?

解决方案

I'd apply value_counts columnwise rather than doing groupby:

>>> df = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"])

>>> df.apply(pd.value_counts)

item1 item2 item3 item4

0 17 26 17 20

1 21 21 22 19

2 17 18 22 23

3 24 14 20 24

4 21 21 19 14

And for the chunked version, we just need to assemble the parts (making sure to fillna(0) so that if a part doesn't have a 3, for example, we get 0 and not nan.)

>>> df_iter = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"], chunksize=10)

>>> sum(c.apply(pd.value_counts).fillna(0) for c in df_iter)

item1 item2 item3 item4

0 17 26 17 20

1 21 21 22 19

2 17 18 22 23

3 24 14 20 24

4 21 21 19 14

(Of course, in practice you'd probably want to use as large a chunksize as you can get away with.)

评论

成就一亿技术人!

拼手气红包6.0元

还能输入1000个字符 | 博主筛选后可见

添加红包

插入表情

表情包

代码片

HTML/XML
objective-c
Ruby
PHP
C
C++
JavaScript
Python
Java
CSS
SQL
其它

条评论被折叠查看

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。