用df.show得到的输出很美观,但是我们不能将其输出作为字符串,因为show调用的内部showString方法是私有的。有没有什么方法可以让我得到类似的输出,而不需要编写方法来复制相同的功能呢?
答案是有的。
show调用的showString源码
def show(truncate: Boolean): Unit = show(20, truncate)
//
def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
println(showString(numRows, truncate = 20))
} else {
println(showString(numRows, truncate = 0))
}
private[sql] def showString(
_numRows: Int,
truncate: Int = 20,
vertical: Boolean = false): String = {
val numRows = _numRows.max(0).min(ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH - 1)
// Get rows represented by Seq[Seq[String]], we may get one more line if it has more data.
val tmpRows = getRows(numRows, truncate)
val hasMoreData = tmpRows.length - 1 > numRows
val rows = tmpRows.take(numRows + 1)
val sb = new StringBuilder
//......
sb.toString()
}
方案1
step1:利用自定义trait,将如下代码写到一个名为DatasetShow.scala,并放到你的source文件目录下
package org.apache.spark.sql
trait DatasetShow {
implicit class DatasetHelper[T](ds: Dataset[T]) {
def toShowString(numRows: Int = 20, truncate: Int = 0, vertical: Boolean = false): String =
"\n" + ds.showString(numRows, truncate, vertical)
}
}
注意:package org.apache.spark.sql
idea中这句会报错:Package name 'org.apache.spark.sql' does not correspond to the file path 'com.huawei.cbgai.dataprocess.utils'
,直接点推荐选项:Move to package org.apache.spark.sql
即可。
即他会给在你的工作目录/scala下创建文件目录/org/apache/spark/sql,并将DatasetShow.scala移动到该目录下。(不是idea的话也可手动完成这个过程)
step2: 在需要打印的类里,增加该特质:
class MyClass extends DatasetShow {
//.....
LOGGER.info("打印你的表格" +df.toShowString)
}
这种方法可能不适用于Java 9+(当然,一旦Spark最终支持它),因为Java 9+比Java 8及更早版本更严格地强制执行模块边界。在这种情况下,可能需要使用反射访问此API
方案2
利用反射
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val showString = classOf[org.apache.spark.sql.DataFrame].getDeclaredMethod("showString", classOf[Int], classOf[Int], classOf[Boolean])
showString: java.lang.reflect.Method = public java.lang.String org.apache.spark.sql.Dataset.showString(int,int,boolean)
scala> showString.setAccessible(true)
scala> showString.invoke(df, 10.asInstanceOf[Object], 20.asInstanceOf[Object], false.asInstanceOf[Object]).asInstanceOf[String]
res2: String =
"+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
"
参考
https://www.appsloveworld.com/scala/100/10/how-to-redirect-scala-spark-dataset-show-to-log4j-logger
https://cloud.tencent.com/developer/ask/sof/148011
https://blog.youkuaiyun.com/rover2002/article/details/106242682