Overview of Spark, YARN, and HDFS

Spark is a relatively recent addition to the Hadoop ecosystem. Spark is an analytics engine and framework capable of running queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python, and Java if they so desire. Spark can load data directly from disk, memory, and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.

Submitting Spark Jobs

Due to the iterative nature of Spark scripts, they are often developed interactively and can be developed as a script or in a notebook. This results in compact scripts that consist of tight, functional code.

A Spark script can be submitted to a Spark cluster using various methods:

  1. Running the script directly on the head node
  2. Using the acluster submit command from the client
  3. Interactively in an IPython shell or Jupyter Notebook on the cluster
  4. Using the spark-submit script from the client

To run a script on the head node, simply execute python example.py on the cluster. Developing locally on test data and pushing the same analytics scripts to the cluster is a key feature of Anaconda Cluster. With a cluster created and Spark scripts developed, you can use the acluster submit command to automatically push the script to the head node and run it on the Spark cluster.

The below examples use the acluster submit command to run the Spark examples, but any of the above methods can be used to submit a job to the Spark cluster.

Running Spark in Different Modes

Anaconda Cluster can install Spark in standalone mode via the spark-standalone plugin or with theYARN resource manager via the spark-yarn plugin. YARN can be useful when resource management on the cluster is an issue, e.g., when resources need to be shared by many tasks, users and multiple applications.

Spark scripts can be configured to run in standalone mode:

conf = SparkConf()
conf.setMaster('spark://<HOSTNAME_SPARK_MASTER>:7077')

or with YARN by setting the yarn-client as the master within the script:

conf = SparkConf()
conf.setMaster('yarn-client')

You can also submit jobs with YARN by setting --master yarn-client as an option to the spark-submit command:

spark-submit --master yarn-client spark_example.py

Working with Data in HDFS

Moving data in and around HDFS can be difficult. If you need to move data from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to Redshift, from HDFS to Hive, and so on, we recommend using odo, which is part of the Blaze ecosystemOdo efficiently migrates data from the source to the target through a network of conversions.

Use odo to upload a file:

# Load local data into HDFS
auth = {'user': 'hdfs','port': '14000'}
odo('./iris.csv', 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP),
     **auth)

Use odo to upload URLs:

# Load local data into HDFS
auth = {'user': 'hdfs','port': '14000'}
url = 'https://raw.githubusercontent.com/ContinuumIO/blaze/master/blaze/examples/data/iris.csv'
odo(url, 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP),
     **auth)

If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, group-bys, etc. in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in other data storage systems.

spark on yarn环境下,“Operation category READ is not supported in state standby”错误通常是由于Hadoop的standby模式不允许read操作导致的。以下是一些可能的解决办法: ### 自动尝试可用NameNode spark从hadoop配置中得到nn1、nn2两个NameNode,会一个个尝试,直到找到可用的。也就是说,如果第一次尝试的恰好是standby节点,那么就会显示该错误信息,但并不会影响它继续尝试到正常的NameNode节点,这种情况一般无需额外处理,等待其自动找到可用节点即可[^3]。 ### 处理两个NameNode均为Standby状态 若通过`http://namenode:50070/dfshealth.html#tab-overview`和`http://datanode01:50070/dfshealth.html#tab-overview`发现两个NameNode都是Standby,可执行以下操作: ```bash zkcli.sh ls / rmr /hadoop-ha hdfs zkfc –formatZK ``` 完成上述操作后,集群即可正常使用。测试高可用时,可以使用`kill –9 pid`(active NameNode 的pid)命令,之后会发现原来Standby的NameNode已成为active状态[^4]。 ### 手动切换NameNode状态 若想让集群保持自动切换,可采取另外一种方法:先kill掉一个节点上的NameNode进程,使另一个节点的状态由standby变为active,再将被kill节点的NameNode进程启动,此时该节点的状态为standby[^5]。 ### 修改配置使用NameNode逻辑名称 Hive连接HDFS时,若配置的是具体的NameNode地址,而Hadoop集群使用的是HA会出现主备切换问题。解决方案是Hive元数据配置使用NameNode的逻辑名称而不是具体的主机名 [^4]。 ### 确保进程正常启动 启动集群时,使用`start-dfs.sh`和`start-yarn.sh`之后,相关的进程会自动被启动,包括两个NameNode进程、zkfc、journal等,不需要手动启动。但要注意,standby的NameNode的ResourceManager进程可能没有自动启动,需要手动处理[^4]。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值