Spark RDD、Dataset、Dataframe的head()，first()，take()，isEmpty()

最新推荐文章于 2025-09-26 10:31:50 发布

原创最新推荐文章于 2025-09-26 10:31:50 发布 · 7.7k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#rdd #datafram #dataset

Spark 专栏收录该内容

40 篇文章

订阅专栏

本文深入探讨了RDD和Dataset在Spark中的关键操作，包括take、first、head、isEmpty及count等函数的使用与效率对比。解析了不同场景下选择合适函数的重要性，以及它们在数据处理流程中的角色。

RDD：

rdd无head函数

take(num: Int): Array[T]：返回rdd的前num个元素组成array到driver。若rdd为nothing或null会报错

first(): T：返回rdd的第一个元素，若rdd为空，或为nothing或null会报错

isEmpty(): Boolean ：判断rdd是否为空（空分区或空元素都为空，即使分区有一个，元素为空也为空）。rdd为Nothing或null的RDD引用会抛出异常（在内部实际使用了take(1)）

注意： `parallelize(Seq())` 为 `RDD[Nothing]`, (`parallelize(Seq())` 可通过 `parallelize(Seq[T]())`.)避免

Dataset：

head(n: Int): Array[T] :提取前n行数据，会将数据拉到driver端。dataset为空会报错

def head(): T：返回第一行，等价于head(1)。dataset为空会报错

def first(): T ：等价于head()。dataset为空会报错

take(n: Int): Array[T]: 等价于head(n)。dataset为空会报错

isEmpty: Boolean：只有Spark 2.4.0之后才有

count()效率不如foreachPartition.

/**
	* Returns the first `n` rows.
	*
	* @note this method should only be used if the resulting array is expected to be small, as
	* all the data is loaded into the driver's memory.
	*
	* @group action
	* @since 1.6.0
	*/
	def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

	/**
	* Returns the first row.
	* @group action
	* @since 1.6.0
	*/
	def head(): T = head(1).head

	/**
	* Returns the first row. Alias for head().
	* @group action
	* @since 1.6.0
	*/
	def first(): T = head()

def take(n: Int): Array[T] = head(n)

scala> val r2=Seq()

r2: Seq[Nothing] = List()

scala> val d3=r2.toDS()

<console>:36: error: value is not a member of Seq[Nothing]

count()和foreachPartition(）效率：

ds.rdd.isEmpty性能最高

Dataframe：

head(int n)：也是拉到driver端，性能特别差

head()：head(1)

first()：等价于head()

take(int n)：等价于 head(int n)

没有isEmpty函数

df.rdd.isEmpty性能最好

public Row[] head(int n)
{
return limit(n).collect();
}

public Row head()
{
return (Row)Predef..MODULE$.refArrayOps((Object[])head(1)).head();
}

public Row first()
{
return head();
}

public Row[] take(int n)
{
return head(n);
}