读完这100篇论文 就能成大数据高手

本文推荐了100篇关于大数据技术的重要文献,覆盖从数据存储到实时处理等多个方面,旨在帮助读者深入理解大数据领域的核心技术和发展趋势。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

英文原文:https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan

优快云译文:http://www.youkuaiyun.com/article/2015-07-07/2825148

 

PayPal高级工程总监Anil Madan推荐了100篇大数据文献,涵盖大数据技术栈,据说全部读懂能成为大数据的顶级高手。

100篇论文链接:

  1.  Polyglot storage
  2. BDAS (Spark)
  3. Lambda - Established architecture for a typical data pipeline. More details.
  4. Kappa– An alternative architecture which moves the processing upstream to the Stream layer.
  5. SummingBird– a reference model on bridging the online and traditional processing models. 
  6. Data center as a computer– provides a great background on warehouse scale computing.
  7. NOSQL Data Stores– background on a diverse set of key-value, document and column oriented stores.
  8. NoSQL Thesis– great background on distributed systems, first generation NoSQL systems.
  9. Large Scale Data Management- covers the data model, the system architecture and the consistency model, ranging from traditional database vendors to new emerging internet-based enterprises. 
  10. Eventual Consistency– background on the different consistency models for distributed systems.
  11. CAP Theorem– a nice background on CAP and its evolution.
  12. Pro parallel DBMS paper
  13. another parallel DBMS paper
  14. MapReduce
  15.  SQL on Hadoop
  16. Tachyon and Spark RDD
  17. Google File System- The seminal work on Distributed File Systems which shaped the Hadoop File System.
  18. Hadoop File System– Historical context/architecture on evolution of HDFS.
  19. Ceph File System
  20. An alternative to HDFS. 
  21. Tachyon– An in memory storage system to handle the modern day low latency data processing.
  22. Column Oriented vs Row-Stores– good overview of data layout, compression and materialization.
  23. RCFile– Hybrid PAX structure which takes the best of both the column and row oriented stores.
  24. Parquet– column oriented format first covered in Google’s Dremel’s paper.
  25. ORCFile– an improved column oriented format used by Hive.
  26. Compression– compression techniques and their comparison on the Hadoop ecosystem.
  27. Erasure Codes– background on erasure codes and techniques; improvement on the default triplication on Hadoop to reduce storage cost
  28.  Hadoop .
  29. Dynamo – key-value distributed storage system
  30. Cassandra – Inspired by Dynamo; a multi-dimensional key-value/column oriented data store.
  31. Voldemort – another one inspired by Dynamo, developed at LinkedIn.
  32. BigTable – seminal paper from Google on distributed column oriented data stores.
  33. HBase – while there is no definitive paper , this provides a good overview of the technology.
  34. Hypertable – provides a good overview of the architecture.
  35. CouchDB – a popular document oriented data store.
  36. MongoDB – a good introduction to MongoDB architecture
  37. Neo4j – most popular Graph database.
  38. Titan – open source Graph database under the Apache license.
  39. Megastore – a highly available distributed consistent database. Uses Bigtable as its storage subsystem.
  40. Spanner – Globally distributed synchronously replicated linearizable database which supports SQL access.
  41. MESA – provides consistency, high availability, reliability, fault tolerance and scalability for large data and query volumes.
  42. CockroachDB – An open source version of Spanner (led by former engineers) in active development.
  43. YARN – The next generation Hadoop compute framework.
  44. Mesos – scheduling between multiple diverse cluster computing frameworks.
  45. Capacity Scheduler - introduction to different features of capacity scheduler. 
  46. FairShare Scheduler - introduction to different features of fair scheduler.
  47. Delayed Scheduling - introduction to Delayed Scheduling for FairShare scheduler.
  48. Fair & Capacity schedulers – a survey of Hadoop schedulers.
  49. Paxos – a simple version of the classical paper; used for distributed systems consensus and coordination. 
  50. the classical paper for Paxos
  51. Chubby – Google’s distributed locking service that implements Paxos.
  52. Zookeeper – open source version inspired from Chubby though is general coordination service than simply a locking service 
  53. Spark – its popularity and adoption is challenging the traditional Hadoop ecosystem.
  54. Flink – very similar to Spark ecosystem; strength over Spark is in iterative processing.
  55. MapReduce – The seminal paper from Google on MapReduce.
  56. MapReduce Survey – A dated, yet a good paper; survey of Map Reduce frameworks.
  57. Pregel – Google’s paper on large scale graph processing
  58. Giraph - large-scale distributed Graph processing system modelled around Pregel
  59. GraphX - graph computation framework that unifies graph-parallel and data parallel computation.
  60. Hama - general BSP computing engine on top of Hadoop
  61. Open source graph processing  survey of open source systems modelled around Pregel BSP.
  62. Stream Processing – A great overview of the distinct real time processing systems 
  63. Storm – Real time big data processing system
  64. Samza  - stream processing framework from LinkedIn
  65. Spark Streaming – introduced the micro batch architecture bridging the traditional batch and interactive processing.
  66. Dremel – Google’s paper on how it processes interactive big data workloads, which laid the groundwork for multiple open source SQL systems on Hadoop.
  67. Impala – MPI style processing on make Hadoop performant for interactive workloads.
  68. Drill – A open source implementation of Dremel.
  69. Shark – provides a good introduction to the data analysis capabilities on the Spark ecosystem.
  70. Shark – another great paper which goes deeper into SQL access.
  71. Dryad – Configuring & executing parallel data pipelines using DAG.
  72. Tez – open source implementation of Dryad using YARN.
  73. BlinkDB - enabling interactive queries over data samples and presenting results annotated with meaningful error bars
  74. Druid – a real time OLAP data store. Operationalized time series analytics databases
  75. Pinot – LinkedIn OLAP data store very similar to Druid. 
  76. Pig – Provides a good overview of Pig Latin.
  77. Pig – provide an introduction of how to build data pipelines using Pig.
  78. Hive – provides an introduction of Hive.
  79. Hive – another good paper to understand the motivations behind Hive at Facebook.
  80. Phoenix – SQL on Hbase.
  81. Join Algorithms for Map Reduce – provides a great introduction to different join algorithms on Hadoop. 
  82. Join Algorithms for Map Reduce – another great paper on the different join techniques.
  83. MLlib – Machine language library on Spark.
  84. SparkR – Distributed R on Spark framework.
  85. Mahout – Machine learning framework on traditional Map Reduce.
  86. Flume – a framework for collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
  87. Sqoop– a tool to move data between Hadoop and Relational data stores.
  88. Kafka – distributed messaging system for data processing
  89. Crunch – library for writing, testing, and running MapReduce pipelines.
  90. Falcon – data management framework that helps automate movement and processing of Big Data.
  91. Cascading – data manipulation through scripting.
  92. Oozie – a workflow scheduler system to manage Hadoop jobs.
  93. HCatalog - a table and storage management layer for Hadoop.
  94. ProtocolBuffers – language neutral serialization format popularized by Google. 
  95. Avro – modeled around Protocol Buffers for the Hadoop ecosystem.
  96. OpenTSDB – a time series metrics systems built on top of HBase.
  97. Ambari - system for collecting, aggregating and serving Hadoop and system metrics
  98. YCSB – performance evaluation of NoSQL systems.
  99. GridMix – provides benchmark for Hadoop workloads by running a mix of synthetic jobs
  100. Background on big data benchmarking with the key challenges associated.

嗯,准备读一读




 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值