spark 集合交集差集运算

最新推荐文章于 2024-04-26 16:37:24 发布

转载最新推荐文章于 2024-04-26 16:37:24 发布 · 783 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/realzjx/p/5716292.html

文章标签：

#大数据

本文介绍了使用Spark进行集合差集运算的具体实践方法。当两个DataFrame的schema不完全相同时，如何通过重命名属性和使用特定的连接类型来实现差集操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

intersect except是spark提供的集合差集运算，但是要求参与运算的两个dataframe，有相同的data Schema。

如果我想从集合1（attribute1, attribute2, attribute3）求 attribute2 出现在另一个集合2(attribute2, attribute4, attribute5)里的所有行

则intersect 完全无效，我刚接触spark没多久，只好就绕了一下路。实践如下。

multiple_orders$forJoin = multiple_orders$presentee_mobile
multiple_orders$presentee_mobile=NULL
order_nonFastCar <- join(order_nonFastCar, multiple_orders, order_nonFastCar$presentee_mobile==multiple_orders$forJoin, "left_outer")
order_nonFastCar= filter(order_nonFastCar, "forJoin is null")
order_nonFastCar$forJoin=NULL

把属性改一下名，是因为order_nonFastCar里也有presentee_mobile这个属性列。如果不改名， join之后无法通过filter求交集

转载于:https://www.cnblogs.com/realzjx/p/5716292.html