Pandas是Python中最著名的数据分析工具。在处理数据集时,每个人都会使用到它。但是随着数据大小的增加,执行某些操作的某些方法会比其他方法花费更长的时间。所以了解和使用更快的方法非常重要,特别是在大型数据集中,本文将介绍一些使用Pandas处理大数据时的技巧,希望对你有所帮助
数据生成
为了方便介绍,我们生成一些数据作为演示,faker是一个生成假数据的Python包。这里我们直接使用它
import random
from faker import Faker
fake = Faker()
car_brands = [“Audi”,“Bmw”,“Jaguar”,“Fiat”,“Mercedes”,“Nissan”,“Porsche”,“Toyota”, None]
tv_brands = [“Beko”, “Lg”, “Panasonic”, “Samsung”, “Sony”]
def generate_record():
“”" generates a fake row
“”"
cid = fake.bothify(text=‘CID-###’)
name = fake.name()
age=fake.random_number(digits=2)
city = fake.city()
plate = fake.license_plate()
job = fake.job()
company = fake.company()
employed = fake.boolean(chance_of_getting_true=75)
social_security = fake.boolean(chance_of_getting_true=90)
healthcare = fake.boolean(chance_of_getting_true=95)
iban = fake.iban()
salary = fake.random_int(min=0, max=99999)
car = random.choice(car_brands)
tv = random.choice(tv_brands)
record = [cid, name, age, city, plate, job, company, employed,
social_security, healthcare, iban, salary, car, tv]
return record
record = generate_record()
print(record)
“”"
[‘CID-753’, ‘Kristy Terry’, 5877566, ‘North Jessicaborough’, ‘988 XEE’,
‘Engineer, control and instrumentation’, ‘Braun, Robinson and Shaw’,
True, True, True, ‘GB57VOOS96765461230455’, 27109, ‘Bmw’, ‘Beko’]
“”"
我们创建了一个100万行的DF。
import os
import pandas as pd
from multiprocessing import Pool
N= 1_000_000
if name == ‘main’:
cpus = os.cpu_count()
pool = Pool(cpus-1)
async_results = []
for _ in range(N):
async_results.append(pool.apply_async(generate_record))
pool.close()
pool.join()
data = []
for i, async_result in enumerate(async_results):
data.append(async_result.get())
df = pd.DataFrame(data=data, columns=[“CID”, “Name”, “Age”, “City”, “Plate”, “Job”, “Company”,
“Employed”, “Social_Security”, “Healthcare”, “Iban”,
“Salary”, “Car”, “Tv”])
图片
磁盘IO
Pandas可以使用不同的格式保存DF。让我们比较一下这些格式的速度。
#Write
%timeit df.to_csv(“df.csv”)
#3.77 s ± 339 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.to_pickle(“df.pickle”)
#948 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.to_parquet(“df”)
#2.77 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.to_feather(“df.feather”)
#368 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def write_table(df):
dtf = dt.Frame(df)
dtf.to_csv(“df_.csv”)
%timeit write_table(df)
#559 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
图片
#Read
%timeit df=pd.read_csv(“df.csv”)
#1.89 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df=pd.read_pickle(“df.pickle”)
#402 ms ± 6