Conquer Big Data through Spark

本课程全面覆盖Apache Spark应用开发,从架构到高级功能,包括Spark SQL、MLlib、GraphX、Spark Streaming等核心模块,以及Spark on Yarn、SparkR、测试与调优等内容,适合所有对大数据开发感兴趣的人员。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Course Background:

Apache Spark™ is a fast and general engine for large-scaledata processing. Spark has an advanced DAGexecution engine that supports cyclic data flow and in-memory computing. You can run programs up to 100x faster than HadoopMapReduce in memory, or 10x faster on disk.:

Sparkpowers a stack of high-level tools including SparkSQL,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the sameapplication:

You canrun Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can readfrom HDFS, HBase, Cassandra, and any Hadoop data source:

Write applications quickly in Java, Scala orPython.Spark offers over 80 high-level operators thatmake it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.

ApacheSpark has seen phenomenal adoption, being widely slated as the successor toHadoop MapReduce, and being deployed in clusters from a handful to thousands ofnodes.

In the past few years ,Databricks, with the help of the Spark community, hascontributed many improvements to Apache Spark to improve its performance,stability, and scalability. This enabled Databricks to use Apache Spark to sort100 TB of data on 206 machines in 23 minutes, which is 3X faster than theprevious Hadoop 100TB result on 2100 machines. Similarly, Databricks sorted 1PB of data on 190 machines in less than 4 hours, which is over 4X faster thanthe previous Hadoop 1PB result on 3800 machines.

 

 Spark is fulfilling its promise to serve as a faster and morescalable engine for data processing of all sizes. Spark enables equallydramatic improvements in time and cost for all Big Data users.

 

Course Introduction:

This course almost covers everything for ApplicationDeveloper to build diverse Spark applications to fulfill all kinds of businessrequirements: Architecture of Sparkthe programming model in Sparkinternalsof SparkSpark SQLMLlibGraphXSpark StreamingTestingTuningSpark onYarnJobServer and SparkR.

Additionalthis course also covers the very necessaryskills you need to write Scala code in Spark, to help whom is not familiar withScala.

 

Who Needsto Attend

Anyone whois interested in Big Data Development;

HadoopDeveloper;

Other BigData Developer;

 

Prerequisites

   Be familiar with the basics of object-oriented programming;

CourseOutline

 

Day 1

Class 1The architecture of Spark

1 Ecosystem of Spark

2 Design of Spark

3 RDD

4  Fault-tolerance in Spark

 

Class 2Programming with Scala

1 Classes and Objects in Scala

2 Funtional Object

3 Traits

4 Case class and Pattern Matching

5 Collections

6 Implicit Conversions and Parameters

7 Actors and Concurrency

 

Class 3Spark Programming Model

1 RDD

2 transformation

3 action

4 lineage

5 Dependency

 

Class 4Spark Internals

1 Spark Cluster

2 Job Scheduling

3 DAGScheduler

4 TaskScheduler

5 Task Internal

 

 

 

 

TIME

CONTENT

Note

 

 

 

 

Day 2

Class 5Broadcasts and Accumulators

1  Broadcast Internal

2  Best practice in Broadcast

3  Accumulators Internal

4  Best practice in Accumulators

 

Class 6Action in programming Spark

1 Data Source:File、HDFS、HBase、S3;

2 IDEA

3 Maven

4 sbt.

5 Code

6 Deployment

 

Class 7Deep in Spark Driver

1 The Secret of SparkContext

2 The Secret of  SparkConf

4 The Secret of  SparkEnv

 

Class 8Deep in RDD

1 DAG

2 Scala RDD Function

3 Spark Java RDD Function

4 RDD Tuning

 

 

 

 

TIME

CONTENT

NOTE

 

 

 

 

 

 

 

 

Day 3

Class 9Machine Learning on Spark

1 LinearRegression

2 K-Means

3 Collaborative Filtering

 

Class 10: Graph Computation on Spark

1 Table Operators

2 Graph Operators

3 GraphX Algorithms

 

Class 11: Spark SQL

1 ParquetJSONJDBC

2 DSL

3 SQL on RDD

 

Class 12Spark Streaming

1 DStream

2 transformation

3 checkpoint

4 Tuning

 

 

 

TIME

CONTENT

NOTE

Day 4

Class 13Spark on Yarn

1 Internals of Spark on Yarn

2 Best practice of Spark on Yarn

 

Class 14JobServer

1 Restful Architecture of JobServer

2 JobServer APIs

3 Best Practice of JobServer

 

Class 15SparkR

1 Programming in R

2 R on Spark

3 Internals of SparkR

4 SparkR API

 

Class 16Spark Tuing

1 Logs

2 Concurency

3 Memory

4 GC

5 Serializers

6 Safety

7 14s cases of Tuning

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值