翻译:MICE 算法

MICE(Multiple Imputation by Chained Equations)算法通过迭代预测模型填充数据集中的缺失值。它适用于数据泄露场景,如在客户保留率建模时避免信息泄漏。预测均值匹配(PMM)方法用于选择接近预测值的值进行插补,特别适合多模态、整数或偏斜分布的变量。通过对比均值匹配与模型预测的插补效果,显示均值匹配能更好地保持数据分布特性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文:miceforest: Fast Imputation with Random Forests in Python

链式方程的多重插补(MICE,Multiple Imputation by Chained Equations)通过一系列迭代的预测模型来“填充”(插补)数据集中的缺失数据。在每次迭代中,将使用数据集中的其他变量来估算数据集中的每个指定的变量,这些迭代持续运行,直到满足收敛为止。

一,MICE的算法实现

MICE的算法如下图所示,这个过程会持续执行,直到所有指定的变量都被插补为止。如果出现插补的均值没有收敛的情况,那么将会执行更多的迭代,尽管通常不需要超过5次迭代。
在这里插入图片描述
MICE适用的场景:
数据泄露(Data Leakage) :如果缺失值与目标变量直接相关联,从而导致泄漏,则MICE尤其有用。 例如,假设您要对客户保留率进行建模,在客户注册时或注册后1个月通过特定变量手机客户的登录信息,如果该变量的值缺失,这将导致数据泄漏,因为它告诉您“客户留存的时间没有超过1个月”。

Data Leakage定义:
存在和利用这种倒‘因’为‘果’的feature的现象,叫数据竞赛中的Data Leakage。 这里的Data Leakage 跟其他场合说的数据安全数据泄漏完全不一样。从字面上理解,我们说的Data Leakage不是数据泄漏,而是因果关系的纰漏,是由于数据准备过程中出现的失误,使模型沿着有纰漏的,甚至是颠倒的因果关系进行预测,但得到极好的预测结果。
举个例子,Chris老师在处理电信用户流失的时候,用原有的数据集轻轻松松就可以把AUC达到0.99以上。这让人非常警惕。于是Chris老师仔细查看了一下模型和数据,原来数据中有一个权重极高的feature是“3个月内的缴费纪录”。很多流失用户的账户内,这个feature的值是0。再进一步,他跟会计核实了一下,在会计记账中,这个feature 代表的是用户已经流失后的三个月的缴费纪录,那肯定就是0了。这是典型的因果关系颠倒。

漏斗分析(Funnel Analysis):信息通常是在“漏斗”的不同阶段收集的,MICE可用于对漏斗中不同阶段的实体特征进行有根据的猜测。

置信区间(Confidence Intervals):MICE可用于插补(估算)缺失值,但是请务必记住,这些估算值是一种预测,创建具有不同估算值的多个数据集可以执行两种类型的推断:

  • 估算值分布:可以为每个估算值构建一个profile(画像,或特征),以获得该值可能的分布信息。
  • 模型预测分布:使用多个数据集,您可以构建多个模型,并为每个样本创建预测的分布。那些具有估算值的样本,如果没有高置信度(插补值没有高置信区间),那么它们的预测值将具有较大的方差,很有可能具有很大的偏差。

二,预测均值匹配

miceforest可以利用称为预测均值匹配(predictive mean matching,PMM)的过程来选择要估算的值。 PMM包含从原始的、完整的数据中选择一个数据点,该数据点的预测值接近缺失样本的预测值。选择最接近的N个(mean_match_candidates参数)值作为候选值,从候选值中随机选择一个值,这个过程可以逐列指定。

MICE在实践中的工作原理如下图所示:
在这里插入图片描述
对于一个需要插补(估算)的变量,如果该变量具有以下任意特征,那么此方法非常有用:

  • Multimodal(多模态)
  • Integer(整数值)
  • Skewed(偏斜)
最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2::…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值