背景
本文基于delta 2.0.0
Delta是通过CDF(change data feed)来实现CDC(change data capture)。
CDF是能让表能够输出数据表变化的能力,CDC是能够捕获和识别数据的变化,并能够将变化的数据交给下游做进一步的处理。
我们来分析一下是怎么做到数据行级别的CDF的
分析
在设置delta.enableChangeDataFeed= true的前提下(在Enable change data fedd有提及),我们分析一下逻辑计划DeltaDelete对应的RunnableCommand DeleteCommand的Run方法:
final override def run(sparkSession: SparkSession): Seq[Row] = {
recordDeltaOperation(deltaLog, "delta.dml.delete") {
deltaLog.assertRemovable()
deltaLog.withNewTransaction { txn =>
val deleteActions = performDelete(sparkSession, deltaLog, txn)
if (deleteActions.nonEmpty) {
txn.commit(deleteActions, DeltaOperations.Delete(condition.map(_.sql).toSeq))
}
}
// Re-cache all cached plans(including this relation itself, if it's cached) that refer to
// this data source relation.
sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, target)
}
Seq.empty[Row]
}
最重要的方法是performDelete,这里的performDelete方法会根据condition来做不同的操作:
- 如果没有条件,那就是全部删除:
case None =>
// Case 1: Delete the whole table if the condition is true
val allFiles = txn.filterFiles(Nil)
numRemovedFiles = allFiles.size
scanTimeMs = (System.nanoTime() - startTime) / 1000 / 1000
val (numBytes, numPartitions) = totalBytesAndDistinctPartitionValues(allFiles)
numBytesRemoved = numBytes
numFilesBeforeSkipping = numRemovedFiles
numBytesBeforeSkipping = numBytes
numFilesAfterSkipping = numRemovedFiles
numBytesAfterSkipping = numBytes
if (txn.metadata.partitionColumns.nonEmpty) {
numPartitionsAfterSkipping = Some(numPartitions)
num

本文详细解析了 Delta Lake 如何通过 CDF (Change Data Feed) 实现 CDC (Change Data Capture) 功能,尤其针对 Delete 操作如何进行数据行级别的变更捕获,并介绍了其在元数据层面的处理流程。
最低0.47元/天 解锁文章
121

被折叠的 条评论
为什么被折叠?



