问题场景:
有一个csv700w+条数据,一个loss数据表,
,# 问题描述:
想着用两个set相减获得去取缺失值的list,然后再遍历获取df数据
原因分析:
结果datafram数据太大,遍历太慢,所以想着会不会有更好的方法,不用遍历就可以获得想要结果
解决方案:
dfnew = df[df[‘REGNO’].isin(list3)]
参考:https://blog.youkuaiyun.com/lzw2016/article/details/80472649
新增,由于后面再次处理的时候,忘记前一次是怎么处理,结果吃大亏,这里全部记录一下
import pymysql
import pandas as pd
eng = pymysql.connect(host='xx', user='root', password='xx', database='xx')
eng1 = pymysql.connect(host='x', user='root', password='xx', database='xx')
cursor = eng.cursor()
sql = 'select REGNO from xx'
df_loss = pd.read_sql(sql, eng)
# print('loss len is :{}'.format(len(df_loss)))
loss_list = df_loss['REGNO'].values.tolist()
print('loss len is :{}'.format(len(loss_list)))
set_loss = set(loss_list)
df_half = pd.read_csv('half_Image_data_2.csv', low_memory=False)
print('half len is :{}'.format(len(df_half)))
df_list = df_half['REGNO'].values.tolist()
set_half = set(df_list)
set_c = set_half - set_loss
list_new = list(set_c)
print('list_new is :{}'.format(list_new[0]))
# sql_new = 'select REGNO,IMAGETYPE from where REGNO in {}'.format(list_new)
df_new = df_half['REGNO'].isin(list_new)
# df_new = pd.read_sql(sql_new, eng1)
df_new.to_csv('drop_loss_half_2.csv', index=0)
print('处理结束')