ktlinker1119-优快云博客

原创 Spark学习之17：Spark访问MySQL

本文描述使用Spark1.4，在spark-shell环境中访问mysql的用法。1. 准备MySQL的JDBC驱动将mysql对应版本的驱动上传到启动spark-shell的服务器。这里，将mysql驱动放在$SPARK_HOME的ext目录(自己新建的)。测试连接的是MySQL 5.6.19，驱动程序为mysql-connector-java-5.1.31.jar。2.

2015-06-17 14:49:13 3586 1

原创 Spark学习之16：Spark Streaming执行流程(2)

在Spark Streaming执行流程(1)中，描述了SocketReceiver接收数据，然后由BlockGenerator将数据生成Block并存储的过程。本文将描述将Block生成RDD，并提交执行的流程。2. 创建Job该图是前文流程图的一部分。在JobGenerator的启动流程中，将创建一个匿名Actor和一个RecurringTimer对

2015-06-05 12:43:40 836

原创 Spark学习之15：Spark Streaming执行流程(1)

本文以spark streaming文档中创建Socket Stream的例子来描述StreamingContext的执行流程。例子示例代码：val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" "))val pairs = words.map(word =>

2015-06-05 12:30:14 2224

原创 Spark学习之14：Spark on Yarn

在Yarn运行Spark有两种模式：（1）yarn-cluster；（2）yarn-client。这两种模式的区别是：yarn-cluster模式下，Driver运行在集群的NodeManager中；而yarn-client模式下，Driver运行于启动spark-submit的客户端。本文简要介绍两种模式的执行流程。1. yarn-client1.

2015-05-28 16:22:03 960

原创 Spark学习之13：Standalone HA

Standalone模式提供了通过zookeeper来保证Master的高可用性。Standalone模式可以利用Zookeeper来提多个Master间的领导选择和Worker、App的状态存储。在Master启动时，对应的Master actor对象会根据RECOVERY_MODE来创建相应的Master失败恢复模式。本文描述通过Zookeeper来恢复Master的过程。

2015-05-28 16:14:07 732

原创 Spark学习之12：checkpoint

要对RDD做checkpoint操作，需要先调用SparkContext的setCheckpointDir设置checkpoint数据存储位置。RDD的checkpoint操作由SparkContext.runJob发起。如果了解整个Job的执行过程，那么理解RDD的checkpoint就相对简单了。1. RDD.checkpoint def checkpoint() { i

2015-05-25 16:25:59 2218

原创 Spark学习之11：Shuffle Read

本文描述ShuffleMapTask执行完成后，后续Stage执行时读取Shuffle Write结果的过程。涉及Shuffle Read的RDD有ShuffledRDD、CoGroupedRDD等。发起Shuffle Read的方法是这些RDD的compute方法。下面以ShuffledRDD为例，描述Shuffle Read过程。1. 入口函数Shuffle Read操作的入口是S

2015-05-22 14:29:37 3065

原创 Spark学习之10：Task执行结果返回流程

当ShuffleMapTask或ResultTask执行完成后，其结果会传递给Driver。1. 返回流程返回流程涉及Executor和Driver。2. TaskRunner.run override def run() { ...... try { ...... // Run the

2015-05-20 14:57:20 3003

原创 Spark学习之9：Shuffle Write

Stage在提交过程中，会产生两种Task，ShuffleMapTask和ResultTask。在ShuffleMapTask执行过程中，会产生Shuffle结果的写磁盘操作。1. ShuffleMapTask.runTask此方法是Task执行的入口。 override def runTask(context: TaskContext): MapStatus = {

2015-05-19 15:54:40 1057

原创 Spark学习之8：Stage提交到Task执行

1. Stage提交流程RDD图的Stage划分好后，就开始Stage提交。Stage提交到Task执行的流程如下：DAGScheduler.handleJobSubmitted先完成Stage的划分，然后进行Stage提交操作。1.1. DAGScheduler.submitStage private def subm

2015-05-12 19:27:51 1321

原创 Spark学习之7：Job触发及Stage划分

1. Job提交触发流程图：作业提交流程有RDD的action操作触发，继而调用SparkContext.runJob。在RDD的action操作后可能会调用多个SparkContext.runJob的重载函数，但最终会调用的runJob见1.1。1.1. SparkContext.runJob def runJob[T, U: Cla

2015-05-12 19:06:34 1450

原创 Spark学习之6：Broadcast及RDD cache

1. Broadcast1.1. 创建流程BlockManager的三个put*方法（putIterator、putBytes、putArray）都包括（tellMaster: Boolean = true）参数，默认值为true。该参数是是否通知Master（BlockManagerMasterActor）的开关，当为true时，在将数据写入本地存储系统后，将

2015-05-04 09:18:12 4375

原创 Spark学习之5：BlockManager初始化

每个Driver和Executor都有自己的BlockManager，它管理RDD缓存、Shuffle计算结果、Broadcast存储等。1. BlockManagerprivate[spark] class BlockManager( executorId: String, actorSystem: ActorSystem, val master: Block

2015-05-04 09:05:26 4287

原创 Spark学习之4：SparkContext执行过程

本文主要讲述standalone模式下，应用程序启动后，创建SparkConf、SparkContext的执行过程。在应用程序中，一般会先创建SparkConf对象，并作相应的参数设置，然后用该对象来初始化SparkContext对象。1. SparkConfSparkConf管理Spark应用程序的配置，参数以key-value形式表示。class SparkCo

2015-04-24 09:58:07 1680

原创 Spark学习之3：SparkSubmit启动应用程序主类过程

本文主要讲述在standalone模式下，从bin/spark-submit脚本到SparkSubmit类启动应用程序主类的过程。1 调用流程图2 启动脚本2.1 bin/spark-submit# For client mode, the driver will be launched in the same JVM that launches

2015-04-22 11:13:41 9447

原创 Spark学习之2：Worker启动流程

1. 启动脚本sbin/start-slaves.sh# Launch the slavesif["$SPARK_WORKER_INSTANCES"=""];then exec "$sbin/slaves.sh" cd "$SPARK_HOME" \; "$sbin/start-slave.sh"1"spark://$SPARK_MASTER_IP:$SPARK_MASTER_P

2015-04-20 15:31:26 2582

原创 Spark学习之1：Master启动流程

1. 启动脚本sbin/start-master.sh"$sbin"/spark-daemon.sh start org.apache.spark.deploy.master.Master 1 --ip $SPARK_MASTER_IP --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT参数：

2015-04-20 15:16:57 3263

原创 Scala部分应用函数

scala部分应用函数不仅可以用在普通函数上，也可以用在类的方法上（1）在普通函数上的应用def sum(a: Int, b: Int, c: Int) = { a + b + c}val sum1 = sum(_: Int, 2, 3)sum1: Int => Int = sum1(1)执行结果：res1: Int = 6（2）在类方法上的应用tra

2015-03-24 15:19:12 699

原创 IDEA Spark HiveContext访问Hive

1 项目配置需要将hive-site.xml配置放在src目录。2 代码import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.hive.HiveContextobject SparkHive { def main(

2015-02-04 10:20:28 8549

原创在windows上通过Spark访问HBase

1 配置IDE环境（1）安装IDEA http://www.jetbrains.com/idea/download/（2）安装Scala插件2 创建HBaseTest工程创建HBaseTest工程，并添加依赖库：3 代码import org.apache.hadoop.hbase.HBaseConfigurationimport org.apache.hadoop

2015-01-29 15:10:30 1043

原创在windows上运行Spark Driver

1 安装Scala IDE从http://scala-ide.org/下载Scala IDE2 安装Hadoop windows版要在windows上运行hadoop需要从新编译。https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip该地址提供了一个编译好的hadoop2.2.0的版本。用

2015-01-23 09:46:31 1502

Packt.Mastering.PostgreSQL.15.5th.Edition.2023.1

2023-07-28

spark 2.3.0 api 文档。 Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming

2018-06-05

Lean Publishing Software Architecture for Developers(2014).rar

Software Architecture for Developers is a practical and pragmatic guide to modern software architecture, specifically aimed at software developers. You'll learn: The essence of software architecture. Why the software architecture role should include coding, coaching and collaboration. The things that you really need to think about before coding. How to visualise your software architecture using the C4 model. A lightweight approach to documenting your software. Why there is no conflict between agile and architecture. What "just enough" up front design means. How to identify risks with risk-storming.

2020-03-23

精通特征工程.pdf

本书介绍大量特征工程技术，阐明特征工程的基本原则。主要内容包括：机器学习流程中的基本概念，数值型数据的基础特征工程，自然文本的特征工程，词频- 逆文档频率，高效的分类变量编码技术，主成分分析，模型堆叠，图像处理，等等。

2019-05-24

spark 1.2.0 文档(spark-1.2.0-doc)

spark-1.2.0 文档 api Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

2015-01-04

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

ktlinker1119的专栏