org.apache.spark.util.collection.SizeTracker#takeSample
spark在shuffle的read和write阶段,都涉及到采样估算集合占用内存大小
/**
* Take a new sample of the current collection's size.
*/
private def takeSample(): Unit = {
samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
// Only use the last two samples to extrapolate
if (samples.size > 2) {
samples.dequeue()
}
val bytesDelta = samples.toList.reverse match {
case latest :: previous :: tail =>
(latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
// If fewer than 2 samples, assume no change
case _ => 0
}
bytesPerUpdate = math.max(0, bytesDelta)
nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
}
使用的是
org.apache.spark.util.SizeEstimator#estimate
@DeveloperApi
object SizeEstimator extends Logging {
/**
* Estimate the number of bytes that the given object takes up on the JVM heap. The estimate
* includes space taken up by objects referenced by the given object, their references, and so on
* and so forth.
*
* This is useful for determining the amount of heap space a broadcast variable will occupy on
* each executor or the amount of space each object will take when caching objects in
* deserialized form. This is not the same as the serialized size of the object, which will
* typically be much smaller.
*/
def estimate(obj: AnyRef): Long = estimate(obj, new IdentityHashMap[AnyRef, AnyRef])