Spark

本文围绕分布式机器学习展开,指出通信开销随机器数量增加的问题,提出P2P等解决方案。介绍了Spark的相关内容,包括其部署、调度、容错机制等,分析了Spark的优缺点,还对比了MapReduce的特性,阐述了Spark中RDD的不可变性及谱系图的作用。

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

### Distributed Machine Learning

 

  • Problem - Scale out - more iterations/second - increase throughput
    1. The problem: Communication overhead scales badly with number of machines, would increase as # machines increase
    2. Solution: P2P
      1. Higher network overhead because the transmission time, more hopes between nodes. But less bandwidth - so easier scale-out
      2. Data center: put all user nodes under data center, which lead to the center monitoring problems.
  • (& Solution 2) Problem - BSP consistency - barrier, wait for stragglers.
    1. Nodes can accept slightly stale state, can still converge
    2. N-Bounded delay BSP
  • Problems
    1. Lineage grows soooo large: manually checkpoint to HDFS
    2. The immutable RDD - cannot support randomness!!!
    3. When need a lot of memory - not enough!
    4. Overhead (time and space) of copy data: because no mutate-in-place
  • Deployment
    1. Master Server
    2. Cluster manager
    3. worker nodes
  • Scheduling
    1. Partitioning/Cache-aware scheduling to minimize shuffles (data skew)
  • TODO: https://www.quora.com/What-is-the-difference-between-HDFS-and-GFS
  • Lazy evaluation:
    1. 3 Operations:
      1. Transformation (map)
      2. Persist (cache to memory)
      3. Action (reduce)
    2. Execution(evaluation) is triggered by actions, not transformations (maps)
  • (HDFS) GFS's fault tolerance
    1. The chunk-server fault tolerance
      1. Master grant lease to chunk primary (only during modification operation!)
      2. The lease would be revoked if file is being renamed or deleted!)
      3. The lease would be updated each time assigning a new one
      4. The lease would be refreshed every 60 second if modification is not finished
    2. Why Lease/version number?
      1. Network partition
      2. Primary failed - TODO: what if primary failed?
      3. Revoke lease when expire or rename/delete file
      4. Detect outdated chunk server with version number, consider the server with failed (shut down during lease renewal).
    3. The master fault tolerance
      1. Use the master-slave replica
      2. Use WAL
      3. Log cannot be too long -- why?
      4. Checkpoint during some period
  • Why Spark RDD is immutable?
    1. Consistency model: same RDD caching and sharing across nodes.
    2. Enables lineage
      1. Recreate RDD anytime
      2. Be more strict! Need to be deterministic functions of input
    3. Compatibility with HDFS storing interface - which allows append only. The modification(write) is not allowed.
  • Spark is fault-tolerance (recovery) efficient!!!
    1. Using coarse grained operations - lineage graph instead of storing the real data.
    2. In contrast to Hadoop/GFS, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many (small-size partition) data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.1 If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute  just that partition. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.
    3. Source: https://www.quora.com/What-is-the-difference-between-fine-grained-and-coarse-grained-transformation-in-context-of-Spark
  • Why didn't use in-memory computation and data-sharing before?
    1. Because there's not a way of efficient fault-tolerance if using in-memory computation, you have to
      1. Checkpoint often
      2. Replicate data across nodes
      3. Log each operation to node-local persistent storage
    2. The latency is still the same.
  • Spark: Pros and Cons
    1. Pros
      1. Fast response time!
      2. In-memory computation
      3. Data sharing and fault tolerance???
    2. Cons
      1. Cannot solve fine-grained operations
      2. Cannot solve large datasets which cannot fit into memory task
      3. Cannot solve task with requirement of very high efficiency - SIMD, or GPU
  • What is Spark?
    1. Berkeley's extensions to Hadoop
    2. Goal 1: keep and share data sets in main memory
      1. Which lead to the problem of fault tolerance
      2. Which solved by lineage and small partitioning
    3. Goal 2: Unified computation abstraction
      1. Which make iterations viable(local work + message passing = BSP = Bulk synchronous parallel)
      2. Which lead to the problem of how to balance ease-of-use and generality?
  • MapReduce: pros and cons
    1. Pros:
      1. High throughput
      2. Fault-tolerance
    2. Cons:
      1. Response time is large (latency) due to I/O penalty, interactive data analysis is not possible
      2. Iterative applications would be slow, 90% time spent on I/O and network (e.g. ML)
      3. Abstraction not expressive enough, different use cases has different applications now.
  • What is linage in spark
    1. The dependencies between RDDs will be logged in a graph -- the sequence of operations -- lineage graph
    2. use "toDebugString" would know the lineage graph
    3. Enable fault-tolerant
      1. With the lineage graph, we can recompute the missing or damaged partitions due to node failure.
    4. Alias:
      1. RDD operator graphs
      2. RDD dependencies graph
      3. Output of applying transformations to the spark, a logical execution plan.
    5. Source: https://data-flair.training/blogs/rdd-lineage/
标题基于SpringBoot的马术俱乐部管理系统设计与实现AI更换标题第1章引言介绍马术俱乐部管理系统的研究背景、意义、国内外研究现状、论文方法及创新点。1.1研究背景与意义阐述马术俱乐部管理系统对提升俱乐部管理效率的重要性。1.2国内外研究现状分析国内外马术俱乐部管理系统的发展现状及存在的问题。1.3研究方法以及创新点概述本文采用的研究方法,包括SpringBoot框架的应用,以及系统的创新点。第2章相关理论总结和评述与马术俱乐部管理系统相关的现有理论。2.1SpringBoot框架理论介绍SpringBoot框架的基本原理、特点及其在Web开发中的应用。2.2数据库设计理论阐述数据库设计的基本原则、方法以及在管理系统中的应用。2.3马术俱乐部管理理论概述马术俱乐部管理的基本理论,包括会员管理、课程安排等。第3章系统设计详细描述马术俱乐部管理系统的设计方案,包括架构设计、功能模块设计等。3.1系统架构设计给出系统的整体架构,包括前端、后端和数据库的交互方式。3.2功能模块设计详细介绍系统的各个功能模块,如会员管理、课程管理、预约管理等。3.3数据库设计阐述数据库的设计方案,包括表结构、字段设计以及数据关系。第4章系统实现介绍马术俱乐部管理系统的实现过程,包括开发环境、编码实现等。4.1开发环境搭建介绍系统开发所需的环境,包括操作系统、开发工具等。4.2编码实现详细介绍系统各个功能模块的编码实现过程。4.3系统测试与调试阐述系统的测试方法、测试用例以及调试过程。第5章系统应用与分析呈现马术俱乐部管理系统的应用效果,并进行性能分析。5.1系统应用情况介绍系统在马术俱乐部中的实际应用情况。5.2系统性能分析从响应时间、并发处理能力等方面对系统性能进行分析。5.3用户反馈与改进收集用户反馈,提出系统改进建议。第6章结论与展望总结马术俱乐部管理系统的设计与实现成果,并展望未来的研究
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值