蓄水池抽样

本文通过逐步解析从数据流中随机选取样本的过程,介绍了储水池抽样算法的基本原理。该算法能够确保每个元素被选中的概率相等,适用于处理大量数据集的情况。

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/


A Primer on Reservoir Sampling

For this problem, the simplest concrete example would be a stream that only contained a single item. In this case, our algorithm should return this single element with probability 1. Now let’s try a slightly harder problem, a stream with exactly two elements. We know that we have to hold on to the first element we see from this stream, because we don’t know if we’re in the case that the stream only has one element. When the second element comes along, we know that we want to return one of the two elements, each with probability 1/2. So let’s generate a random number R between 0 and 1, and return the first element if R is less than 0.5 and return the second element if R is greater than 0.5.

Now let’s try to generalize this approach to a stream with three elements. After we’ve seen the second element in the stream, we’re now holding on to either the first element or the second element, each with probability 1/2. When the third element arrives, what should we do? Well, if we know that there are only three elements in the stream, we need to return this third element with probability 1/3, which means that we’ll return the other element we’re holding with probability 1 – 1/3 = 2/3. That means that the probability of returning each element in the stream is as follows:

  1. First Element: (1/2) * (2/3) = 1/3
  2. Second Element: (1/2) * (2/3) = 1/3
  3. Third Element: 1/3

By considering the stream of three elements, we see how to generalize this algorithm to any N: at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.

This general technique is called reservoir sampling, and it is useful in a number of applications that require us to analyze very large data sets. You can find an excellent overview of a set of algorithms for performing reservoir sampling in this blog post by Greg Grothaus. I’d like to focus on two of those algorithms in particular, and talk about how they are used in Cloudera ML, our open-source collection of data preparation and machine learning algorithms for Hadoop.


"Mstar Bin Tool"是一款专门针对Mstar系列芯片开发的固件处理软件,主要用于智能电视及相关电子设备的系统维护与深度定制。该工具包特别标注了"LETV USB SCRIPT"模块,表明其对乐视品牌设备具有兼容性,能够通过USB通信协议执行固件读写操作。作为一款专业的固件编辑器,它允许技术人员对Mstar芯片的底层二进制文件进行解析、修改与重构,从而实现系统功能的调整、性能优化或故障修复。 工具包中的核心组件包括固件编译环境、设备通信脚本、操作界面及技术文档等。其中"letv_usb_script"是一套针对乐视设备的自动化操作程序,可指导用户完成固件烧录全过程。而"mstar_bin"模块则专门处理芯片的二进制数据文件,支持固件版本的升级、降级或个性化定制。工具采用7-Zip压缩格式封装,用户需先使用解压软件提取文件内容。 操作前需确认目标设备采用Mstar芯片架构并具备完好的USB接口。建议预先备份设备原始固件作为恢复保障。通过编辑器修改固件参数时,可调整系统配置、增删功能模块或修复已知缺陷。执行刷机操作时需严格遵循脚本指示的步骤顺序,保持设备供电稳定,避免中断导致硬件损坏。该工具适用于具备嵌入式系统知识的开发人员或高级用户,在进行设备定制化开发、系统调试或维护修复时使用。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值