Keep track of the median?

本文介绍了如何使用最大堆和最小堆数据结构实时计算不断变化数字流的中位数。详细解释了算法原理,包括初始化堆、处理新数字和堆平衡操作,并提供了实例说明。

From: Career up 150.

1. Question: Numbers are randomly generated and stored into an (expanding) array. How would you keep track of the median?

Answer: 

Heap? A heap is really good at basic ordering and keeping track of max and mins. This is actually interesting – if you had two heaps, you could keep track of the biggest half and the smallest half of the elements. The biggest half is kept in a min heap, such that the smallest element in the biggest half is at the root. The smallest half is kept in a max heap, such that the biggest element of the smallest half is at the root. Now, with these data structures, you have the potential median elements at the roots. If the heaps are no longer the same size, you can quickly “rebalance” the heaps by popping an element off the one heap and pushing it onto the other.

2. 可参考 http://yaronspace.cn/blog/archives/1306 http://www.cppblog.com/820986942/archive/2011/05/23/146991.html

题目介绍:

输入为不断地数字流,实时显示出当前已经输入的数字序列的中位数

解答:

求中位数的方法很多,对于大数据量最经典是桶的计数方法,但是对于这个问题不适用,因为数据是不断变化的

可以用最大堆和最小堆来解答这个问题:

1.假设当前的中位数为m,其中最大堆维护的是<=m的数字序列,最小堆维护的是>=m的数字序列,但是两个堆都不包含m

2.当新的数字到达时,比如为a,将a与m进行比较,若a<=m 则将其加入到最大堆中,否则将其加入到最小堆中

3.如果此时最小堆和最大堆的元素个数的差值>=2 ,则将m加入到元素个数少的堆中,然后从元素个数多的堆将根节点赋值到m,最后重建两个最大堆和最小堆,返回到2


进一步,如果数组数据不仅仅是增加,而是可以删除数据。这时应该使用什么数据结构呢?上面的堆就不合适了(查找需要O(n))。例如下面的问题,

https://www.interviewstreet.com/challenges/dashboard/#problem/4fcf919f11817

The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.
 
Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5  
 
Input:
 
The first line is an integer n indicates the number of operations. Each of the next n lines is either "a x" or "r x" which indicates the operation is add or remove.
 
Output:
 
For each operation: If the operation is add output the median after adding x in a single line. If the operation is remove and the number x is not in the list, output "Wrong!" in a single line. If the operation is remove and the number x is in the list, output the median after deleting x in a single line. (if the result is an integer DO NOT output decimal point. And if the result is a double number , DO NOT output trailing 0s.)
 
Constraints:
 
0 < n <= 100,000
 
for each "a x" or "r x" , x will fit in 32-bit integer.
 
Sample Input:
 
7
r 1
a 1
a 2
a 1
r 1
r 2
r 1
 
Sample Output:
Wrong!
1
1.5
1
1.5
1
Wrong!
 
Note: As evident from the last line of the input, if after remove operation the list becomes empty you have to print "Wrong!" ( quotes are for clarity ).

### 中位数算法及其嵌套应用 #### 计算中位数的方法 对于一组数值数据,中位数是指将这些数值按大小顺序排列后位于中间位置的值。如果数值的数量是奇数,则中位数正好处于序列中央;如果是偶数,则取两个最中间数值的平均值作为中位数。 在实际编程实践中,可以利用NumPy库来简化这一过程,因为该库能够提供高效的数组处理能力以及内置函数median()用于直接获取给定数组的中位数值[^1]: ```python import numpy as np data = [7, 5, 3, 9, 8, 6, 4] # 使用numpy计算中位数 median_value = np.median(data) print(f"The median of the data is {median_value}") ``` #### 中位数在算法中的作用 中位数作为一种统计量,在数据分析和机器学习等领域有着广泛应用。相比于均值而言,中位数更能抵抗异常点的影响,因此常被用来评估中心趋势或者构建鲁棒性强的模型。例如,在图像处理过程中去除噪声时,可以通过计算局部窗口内的像素灰度级分布情况下的中位数来进行滤波操作。 另外值得注意的是,虽然这里讨论的重点在于一维数据集上的简单中位数运算,但在某些情况下也可能涉及到更高维度的数据结构或是更为复杂的业务逻辑下对多个子集中各自中位数的操作需求。此时就需要考虑如何合理运用循环、条件判断甚至是递归来完成任务了[^2]。 #### 复杂场景下的中位数计算实例 考虑到可能存在多层嵌套列表表示不同类别样本集合的情形,下面给出一段示范代码展示怎样遍历这样的复合型容器对象,并分别求得各组内部成员对应的中位数结果: ```python from statistics import median nested_lists = [ [10, 20, 30], [40, 50, 60, 70], [[80], [90]] ] def get_medians(nested_list): medians = [] for sublist in nested_list: try: # 如果当前项是一个纯数字列表,则直接计算其中位数 m = median(sublist) medians.append(m) except TypeError: # 否则认为遇到了更深一层的嵌套,递归调用本函数继续深入解析 deeper_medians = get_medians(sublist) overall_median = median(deeper_medians) medians.append(overall_median) return medians result = get_medians(nested_lists) print(result) ``` 此段代码通过引入`statistics`模块里的`median()`函数实现了针对任意深度嵌套列表元素间相互关系的有效分析与处理机制。同时借助于try-except语句捕获可能发生的类型错误从而触发进一步层次探索动作,最终达到预期目的即返回各个独立分组里边所含有的全部观测值之总体代表性指标——中位数。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值