
Pandas
文章平均质量分 83
楓尘林间
计算机爱好者
展开
-
python pandas 数据探索
来源于: Kaggle Lending Club Loan Data数据可视化分析与不良贷款预测#对特征缺失值的处理计算特征缺失值比例的函数:def draw_missing_data_table(data): total = data.isnull().sum().sort_values(ascending=False) percent = (data.isnull().sum() / data.shape[0]).sort_values(ascending=False) m原创 2021-07-06 15:46:36 · 528 阅读 · 0 评论 -
anaconda利用sns或plt画图中文乱码问题
sns.set_style(“whitegrid”)必须注释或删除 否则改任何配置都是错的(都是血泪的教训)import seaborn as snssns.set_style({'font.sans-serif':['SimHei']})plt.rcParams ['font.sans-serif'] = ['SimHei'] #Used to display Chinese labels normallyplt.rcParams ['axes.unicode_minus'] = Fal.原创 2020-10-30 21:17:57 · 2758 阅读 · 2 评论 -
随进森林和xgb特征重要性和特征名称对应输出
# XX为df格式 train特征表feature_names = XX.columns.tolist()feature_names = np.array(feature_names)feature_importances = clf.feature_importances_indices = np.argsort(feature_importances)[::-1]x = feature_importances[show_indices],y=feature_names[indices]原创 2020-10-30 18:25:29 · 1005 阅读 · 0 评论 -
pandas下dataframe对某列使用自定义函数
import mathdef scale(x): if x > 2: x = int(math.log(float(x))**2) return xdata_value['I5']=data_value['I5'].apply(scale)原始数据data_value.head(10)进行函数变换后:data_value.head(10)原创 2020-10-11 22:29:11 · 8309 阅读 · 2 评论 -
pandas 对dataframe的复制
转载自:https://www.it1352.com/1689965.htmlimport numpy as npimport pandas as pdarr = np.array( [[1,0,1,1,1,5], [0,0,0,0,1,3], [1,0,0,0,1,1], [1,0,0,1,1,1], [1,0,0,0,1,1], [1,1,0,0,1,1]] )df = pd.DataFrame( np.repeat(转载 2020-10-10 15:22:42 · 3425 阅读 · 0 评论 -
python 中 numpy.dtype.kind属性
numpy.dtype.kinddtype.kindA character code (one of ‘biufcmMOSUV’) identifying the general kind of data.b booleani signed integeru unsigned integerf floating-pointc complex floating-pointm timedeltaM datetimeO objectS (byte-)stringU Un原创 2020-08-24 14:54:01 · 1665 阅读 · 0 评论 -
利用numpy的argsort对list1中的排序提取list2中的元素
import numpy as nplist1 = np.array([4,1,5,8,7,9,2])list2 = np.array(["44","11","55","88","77","99","22"])# 需要从大到小 加上[::-1] sort_1 = np.argsort(list1)[::-1]res = list2[sort_1]print(res)结果:['99' '88' '77' '55' '44' '22' '11']...原创 2020-05-26 23:12:34 · 316 阅读 · 0 评论 -
[转载]python 列表转为字典的两个小方法
1、两个列表转换list1 = [‘key1’,‘key2’,‘key3’]list2 = [‘1’,‘2’,‘3’]把他们转为这样的字典:{‘key1’:‘1’,‘key2’:‘2’,‘key3’:‘3’}ist1 = ['key1','key2','key3']list2 = ['1','2','3']dict(zip(list1,list2)){'key1':'1','key2':'2','key3':'3'}2、嵌套列表转换有两种方法new_list= [['key1',转载 2020-05-26 21:45:59 · 241 阅读 · 0 评论 -
pandas 计算IV值方法
def calculateIV(train_data,label_columnName): ''' @description @param label_columnName: column name of label @return train_data: pd.DataFrame,include label test_data: pd.DataFrame ''' import math print("calcul原创 2020-05-26 16:43:01 · 1546 阅读 · 1 评论 -
python 实现 string2Index方法
import pandas as pd### d 传入是浅拷贝 只拷贝数据地址 而不是数据 当运行完d[cname] = d[cname].map(dic) 输入的d已经改变 就算没有返回值,函数外的d也发生了改变def string2index(d, cname_list): #空值不进行编号 for cname in cname_list: counts = d[cname].value_counts() s = counts.index.toli原创 2020-05-26 14:53:56 · 307 阅读 · 0 评论 -
pandas 画pearson相关系数热力图
pearson相关系数计算函数data.corr()该方法支持空值:np.nanimport seaborn as snsimport numpy as npimport matplotlib.pyplot as pltdata = pd.DataFrame({"A":[np.nan,2,9], "B":[4,14,6], "c":[987,8,9]})f, ax= plt.subplots(figsize = (14, 10))corr = data.corr()# print(原创 2020-05-25 17:03:43 · 13628 阅读 · 0 评论 -
pandas 列计算log不用math.log而是np.log
计算WoE指标时:import numpy as npimport mathimport pandas as pddf_min_max_bin["ok"] = ((df_min_max_bin["nums_label_1"]+1)/sum_label_1)/((df_min_max_bin["nums_label_0"]+1)/sum_label_0) df_min_max_bin["WoE"] = math.log(df_min_max_bin["ok"])结果报错:T原创 2020-05-14 20:51:38 · 7673 阅读 · 0 评论 -
pandas df上下拼接
1. mn = ab.append(cd)ab = pd.DataFrame({"aa":[1,2,3,np.nan,np.nan],"bb":[11,22,33,44,55]})print(ab)cd = pd.DataFrame({"aa":[1,2,5,123,4546],"bb":[11,22,55,111,222]})print(cd)mn = ab.append(cd, ignore_index = True)print(mn)一定要写等式,因为append后不是覆盖ab 而是原创 2020-05-14 20:32:38 · 11046 阅读 · 1 评论 -
pandas中c列值取决于 a,b两列
pandas判断df中某个值是否为空正确使用:pd.notnull(s.iloc[1,1])错误方法:1. s.iloc[1,1].empty2.s.iloc[1,1].isnull()3.s.iloc[1,1]==np.nanpandas中c列值取决于 a,b两列例子中 bb列取决于 cc和a这两列deo = pd.DataFrame({"a":[1,2,3,np.nan,5],"bb":[0,0,0,0,0], "cc":[22,22,22,22,22]})deo.head()原创 2020-05-14 11:43:49 · 565 阅读 · 0 评论 -
df某列大于阈值 赋值为a ,小于等于阈值,赋值为b
# 天数>=15 label 为1# 天数<15 label 为0f = lambda s: 1 if s["天数"]>=15 else 0## 如果只想设置 大于阈值 小于阈值 取本身值f = lambda s: 1 if s["天数"]>=15 else s["天数"]result["label"] = result.apply(f, axis=1)result["label"].value_counts()...原创 2020-05-14 10:41:12 · 4775 阅读 · 0 评论 -
pandas df表内所有数据保留两位小数
整表保留两位小数方法import numpy as npimport pandas as pdformat="{0:.02f}".format# 或者#format = lambda x:'%.2f' % xx = x.applymap(format)x.head()注意:***通过此步骤后,所有列属性均变为object (str类型) ***原创 2020-05-11 16:52:24 · 12050 阅读 · 0 评论 -
Python创建二维数组(关于list的一个小坑)
错误做法:lists = [[]] * 3lists[0].append(3)[[3], [3], [3]]正确做法:lists = [[] for i in range(3)]lists[0].append(3)lists[1].append(5)lists[2].append(7)[[3], [5], [7]]https://www.cnblogs.com/PyLe...转载 2020-05-08 15:20:53 · 220 阅读 · 0 评论 -
pandas 选取不包含某些列的数据
1. 不包含某一个特征#取p_feature中不包含列名为index 的所有列p_feature = p_feature.iloc[:, p_feature.columns != "index"]2.不包含多个特征拓展: 选取某一列中等于或不等于某个值的dfisin()参考1:https://www.cnblogs.com/nxf-rabbit75/p/10105271.h...原创 2020-05-08 09:51:27 · 8683 阅读 · 0 评论 -
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).检查:https://blog.youkuaiyun.com/yi976263092/article/details/87878112原创 2020-05-07 18:57:15 · 903 阅读 · 0 评论 -
pandas 实现等频分箱
#等频分箱def frequencybox(demo, name, new_name, n): demo["tmp"] = pd.qcut(demo[name],n) group_by_age_bin = demo.groupby(["tmp"],as_index=True) df_min_max_bin = pd.DataFrame()#用来记录每个箱体的最大...原创 2020-05-06 23:21:27 · 5683 阅读 · 2 评论 -
pandas构建列名对应数据类型表dtypes 与拓展
#利用泰坦尼克数据data.head() PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S1 ...原创 2020-04-03 16:56:57 · 573 阅读 · 0 评论 -
Pandas使用总结
1.pandas修改列类型使用astype如下:df[[column]] = df[[column]].astype(type)-2.原创 2019-10-24 18:45:05 · 175 阅读 · 0 评论 -
Python,Pandas和numpy中的NaN
在处理数据时NAN值是非常常见的,但是NaN值你真的了解吗?下面让楓尘君带大家了解一下这个神奇的NAN在csv表格中长这样:在python里展示为这样:我们创建这个值时是这样:import numpy as npnp.nan但是想判断是否是np.nan还真是有点难度:a=np.nanprint(a==np.nan)结果是Falseprint(a.equal(np.na...原创 2019-10-22 20:15:31 · 4571 阅读 · 0 评论 -
Pandas给dataframe中某一行的某一个特定位置赋值loc iloc ix
pandas从dataframe中获取值可以用:import numpy as npfrom pandas import Series, DataFrameimport pandas as pddf = DataFrame({'CST_NO': [11, 22, 11, 22, 11, 22], 'AMOUNT': [10, 20, 30, np.nan...原创 2019-10-22 17:38:37 · 35945 阅读 · 4 评论 -
python两个列表同位置排序
假设有两个lista = [2,3,1,5]b = [“b”, “c”, “a”, “e”]想为a生序排列变为:sorta=[1,2,3,5]希望b对应变为:sortb=[“a”,“b”, “c”, “e”]处理方案:list1, list2 = (list(t) for t in zip(*sorted(zip(list1, list2))))或者:from opera...原创 2019-10-22 15:42:39 · 5332 阅读 · 1 评论 -
Pandas的Dataframe筛选出现 TypeError: '>' not supported between instances of 'float' and 'str'
今天遇到一个问题:pos_data = session_data.loc[(session_data['CST_NO'].isin(pos_list))&(session_data['BIZ_CODE']=='TRANSFER')]报错:查阅资料后发现,是isin的左右类型不一致,都转化为str则不会出现这种问题:改为:添加 astype(str)pos_data = ses...原创 2019-10-12 23:12:58 · 14792 阅读 · 0 评论