Spark执行计划分析:
https://blog.youkuaiyun.com/zyzzxycj/article/details/82704713
-----------
先贴一张sql解析的总流程图:
第一次看这图可能还不是很理解,先看一个简单sql:
select * from heguozi.payinfo where pay = 0 limit 10
当这个sqlText,到获得最终结果中间,会存在哪些执行计划呢?
explain extended select * from heguozi.payinfo where pay = 0 limit 10
会看到有4个执行计划:
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
+- 'Project [*]
+- 'Filter ('pay = 0)
+- 'UnresolvedRelation `heguozi`.`payinfo`
Parsed Logical Plan对应图中Unresolved LogicalPlan,
== Analyzed Logical Plan ==
pay_id: string, totalpay_id: string, kindpay_id: string, kindpayname: string, fee: double, operator: string, operator_name: string, pay_time: bigint, pay: double, charge: double, is_valid: int, entity_id: string, create_time: bigint, op_time: bigint, last_ver: bigint, opuser_id: string, card_id: string, card_entity_id: string, online_bill_id: string, type: int, code: string, waitingpay_id: string, load_time: int, modify_time: int, ... 8 more fields
GlobalLimit 10
+- LocalLimit 10
+- Project [pay_id#10079, totalpay_id#10080, kindpay_id#10081, kindpayname#10082, fee#10083, operator#10084, operator_name#10085, pay_time#10086L, pay#10087, charge#10088, is_valid#10089, entity_id#10090, create_time#10091L, op_time#10092L, last_ver#10093L, opuser_id#10094, card_id#10095, card_entity_id#10096, online_bill_id#10097, type#10098, code#10099, waitingpay_id#10100, load_time#10101, modify_time#10102, ... 8 more fields]
+- Filter (pay#10087 = cast(0 as double))
+- SubqueryAlias payinfo
+- Relation[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] parquet
Analyzed Logical Plan对应图中Resolved LogicalPlan,
== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Filter (isnotnull(pay#10087) && (pay#10087 = 0.0))
+- Relation[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] parquet
Optimized Logical Plan对应图中Optimized LogicalPlan,
== Physical Plan ==
CollectLimit 10
+- *(1) LocalLimit 10
+- *(1) Project [pay_id#10079, totalpay_id#10080, kindpay_id#10081, kindpayname#10082, fee#10083, operator#10084, operator_name#10085, pay_time#10086L, pay#10087, charge#10088, is_valid#10089, entity_id#10090, create_time#10091L, op_time#10092L, last_ver#10093L, opuser_id#10094, card_id#10095, card_entity_id#10096, online_bill_id#10097, type#10098, code#10099, waitingpay_id#10100, load_time#10101, modify_time#10102, ... 8 more fields]
+- *(1) Filter (isnotnull(pay#10087) && (pay#10087 = 0.0))
+- *(1) FileScan parquet heguozi.payinfo[pay_id#10079,totalpay_id#10080,kindpay_id#10081,kindpayname#10082,fee#10083,operator#10084,operator_name#10085,pay_time#10086L,pay#10087,charge#10088,is_valid#10089,entity_id#10090,create_time#10091L,op_time#10092L,last_ver#10093L,opuser_id#10094,card_id#10095,card_entity_id#10096,online_bill_id#10097,type#10098,code#10099,waitingpay_id#10100,load_time#10101,modify_time#10102,... 8 more fields] Batched: true, Format: Parquet, Location: CatalogFileIndex[hdfs://cluster-cdh/user/flume/heguozi/payinfo], PartitionCount: 0, PartitionFilters: [], PushedFilters: [IsNotNull(pay), EqualTo(pay,0.0)], ReadSchema: struct<pay_id:string,totalpay_id:string,kindpay_id:string,kindpayname:string,fee:double,operator:...
Physical Plan即为最终可执行的PhysicalPlan。
在Spark执行计划分析(https://blog.youkuaiyun.com/zyzzxycj/article/details/82704713)中已经说明,Physical Plan 中的*(n)为WholeStageCodegenId,这个WholeStageCodegen又是个啥东西呢??
(whole-stage code generation --暂时没找到有什么确切的翻译)
它是在spark2.x中才有的一个新技术,它的作用是将spark job执行过程中的算子自动生成为可执行代码来执行,本质就是scala的反射机制,不涉及虚函数的调用,更优于spark1.x的Volcano Iterator Model (火山迭代模型)。当然,whole-stage code generation技术只是从CPU密集操作的方面进行性能调优,对IO密集操作的层面是无法提高效率的,比如Shuffle中产生的读写磁盘操作是无法通过该技术提升性能的。
那么就拿刚刚的sql为例,来看一下*(1)所生成的代码吧:
代码比较长,但是仔细看一下,几乎都是重复的操作。
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*(1) LocalLimit 10
+- *(1) Project [pay_id#10263, totalpay_id#10264, kindpay_id#10265, kindpayname#10266, fee#10267, operator#10268, operator_name#10269, pay_time#10270L, pay#10271, charge#10272, is_valid#10273, entity_id#10274, create_time#10275L, op_time#10276L, last_ver#10277L, opuser_id#10278, card_id#10279, card_entity_id#10280, online_bill_id#10281, type#10282, code#10283, waitingpay_id#10284, load_time#10285, modify_time#10286, ... 8 more fields]
+- *(1) Filter (isnotnull(pay#10271) && (pay#10271 = 0.0))
+- *(1) FileScan parquet heguozi.payinfo[pay_id#10263,totalpay_id#10264,kindpay_id#10265,kindpayname#10266,fee#10267,operator#10268,operator_name#10269,pay_time#10270L,pay#10271,charge#10272,is_valid#10273,entity_id#10274,create_time#10275L,op_time#10276L,last_ver#10277L,opuser_id#10278,card_id#10279,card_entity_id#10280,online_bill_id#10281,type#10282,code#10283,waitingpay_id#10284,load_time#10285,modify_time#10286,... 8 more fields] Batched: true, Format: Parquet, Location: CatalogFileIndex[hdfs://cluster-cdh/user/flume/heguozi/payinfo], PartitionCount: 0, PartitionFilters: [], PushedFilters: [IsNotNull(pay), EqualTo(pay,0.0)], ReadSchema: struct<pay_id:string,totalpay_id:string,kindpay_id:string,kindpayname:string,fee:dou