spark sql中的sqlcontext与hivecontext区别

最新推荐文章于 2024-08-09 12:09:29 发布

time_exceed

最新推荐文章于 2024-08-09 12:09:29 发布

阅读量5.5k

点赞数

分类专栏： spark 文章标签： spark sql

本文链接：https://blog.youkuaiyun.com/q5390498/article/details/53163696

版权

spark 专栏收录该内容

7 篇文章

订阅专栏

SparkSQL的模块用于处理结构化数据，包括DataFrame和SQL查询引擎。HiveContext是SQLContext的超集，允许执行SQL查询和Hive命令。在Spark 2.0之后，虽然对Hive的依赖减少，但若要使用Hive功能，仍需HiveContext。HiveContext具有更好的SQL解析和对Hive UDF的支持，并且是启动Thrift服务器的必要条件，但其依赖较大。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

很困惑这两者有什么区别，然后谷歌。
One of Sparks’s modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation.
原文https://blogs.msdn.microsoft.com/bigdatasupport/2015/09/14/understanding-sparks-sparkconf-sparkcontext-sqlcontext-and-hivecontext/

Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext seems to be slightly less important.

Spark < 2.0

Obviously if you want to work with Hive you have to use HiveContext. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.

Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY clause but it is really nothing Spark specific.

Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext to work.

HiveContext also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

Finally HiveContext is required to start Thrift server.

The biggest problem with HiveContext is that it comes with large dependencies.
原文：
http://stackoverflow.com/questions/33666545/what-is-the-difference-between-apache-spark-sqlcontext-vs-hivecontext

还有听一个人说，是因为企业目前都是用hive