英文原文:https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan
优快云译文:http://www.youkuaiyun.com/article/2015-07-07/2825148
PayPal高级工程总监Anil Madan推荐了100篇大数据文献,涵盖大数据技术栈,据说全部读懂能成为大数据的顶级高手。
100篇论文链接:
- Polyglot storage
- BDAS (Spark)
- Lambda - Established architecture for a typical data pipeline. More details.
- Kappa– An alternative architecture which moves the processing upstream to the Stream layer.
- SummingBird– a reference model on bridging the online and traditional processing models.
- Data center as a computer– provides a great background on warehouse scale computing.
- NOSQL Data Stores– background on a diverse set of key-value, document and column oriented stores.
- NoSQL Thesis– great background on distributed systems, first generation NoSQL systems.
- Large Scale Data Management- covers the data model, the system architecture and the consistency model, ranging from traditional database vendors to new emerging internet-based enterprises.
- Eventual Consistency– background on the different consistency models for distributed systems.
- CAP Theorem– a nice background on CAP and its evolution.
- Pro parallel DBMS paper
- another parallel DBMS paper
- MapReduce
- SQL on Hadoop
- Tachyon and Spark RDD
- Google File System- The seminal work on Distributed File Systems which shaped the Hadoop File System.
- Hadoop File System– Historical context/architecture on evolution of HDFS.
- Ceph File System–
- An alternative to HDFS.
- Tachyon– An in memory storage system to handle the modern day low latency data processing.
- Column Oriented vs Row-Stores– good overview of data layout, compression and materialization.
- RCFile– Hybrid PAX structure which takes the best of both the column and row oriented stores.
- Parquet– column oriented format first covered in Google’s Dremel’s paper.
- ORCFile– an improved column oriented format used by Hive.
- Compression– compression techniques and their comparison on the Hadoop ecosystem.
- Erasure Codes– background on erasure codes and techniques; improvement on the default triplication on Hadoop to reduce storage cost
- Hadoop .
- Dynamo – key-value distributed storage system
- Cassandra – Inspired by Dynamo; a multi-dimensional key-value/column oriented data store.
- Voldemort – another one inspired by Dynamo, developed at LinkedIn.
- BigTable – seminal paper from Google on distributed column oriented data stores.
- HBase – while there is no definitive paper , this provides a good overview of the technology.
- Hypertable – provides a good overview of the architecture.
- CouchDB – a popular document oriented data store.
- MongoDB – a good introduction to MongoDB architecture
- Neo4j – most popular Graph database.
- Titan – open source Graph database under the Apache license.
- Megastore – a highly available distributed consistent database. Uses Bigtable as its storage subsystem.
- Spanner – Globally distributed synchronously replicated linearizable database which supports SQL access.
- MESA – provides consistency, high availability, reliability, fault tolerance and scalability for large data and query volumes.
- CockroachDB – An open source version of Spanner (led by former engineers) in active development.
- YARN – The next generation Hadoop compute framework.
- Mesos – scheduling between multiple diverse cluster computing frameworks.
- Capacity Scheduler - introduction to different features of capacity scheduler.
- FairShare Scheduler - introduction to different features of fair scheduler.
- Delayed Scheduling - introduction to Delayed Scheduling for FairShare scheduler.
- Fair & Capacity schedulers – a survey of Hadoop schedulers.
- Paxos – a simple version of the classical paper; used for distributed systems consensus and coordination.
- the classical paper for Paxos
- Chubby – Google’s distributed locking service that implements Paxos.
- Zookeeper – open source version inspired from Chubby though is general coordination service than simply a locking service
- Spark – its popularity and adoption is challenging the traditional Hadoop ecosystem.
- Flink – very similar to Spark ecosystem; strength over Spark is in iterative processing.
- MapReduce – The seminal paper from Google on MapReduce.
- MapReduce Survey – A dated, yet a good paper; survey of Map Reduce frameworks.
- Pregel – Google’s paper on large scale graph processing
- Giraph - large-scale distributed Graph processing system modelled around Pregel
- GraphX - graph computation framework that unifies graph-parallel and data parallel computation.
- Hama - general BSP computing engine on top of Hadoop
- Open source graph processing survey of open source systems modelled around Pregel BSP.
- Stream Processing – A great overview of the distinct real time processing systems
- Storm – Real time big data processing system
- Samza - stream processing framework from LinkedIn
- Spark Streaming – introduced the micro batch architecture bridging the traditional batch and interactive processing.
- Dremel – Google’s paper on how it processes interactive big data workloads, which laid the groundwork for multiple open source SQL systems on Hadoop.
- Impala – MPI style processing on make Hadoop performant for interactive workloads.
- Drill – A open source implementation of Dremel.
- Shark – provides a good introduction to the data analysis capabilities on the Spark ecosystem.
- Shark – another great paper which goes deeper into SQL access.
- Dryad – Configuring & executing parallel data pipelines using DAG.
- Tez – open source implementation of Dryad using YARN.
- BlinkDB - enabling interactive queries over data samples and presenting results annotated with meaningful error bars
- Druid – a real time OLAP data store. Operationalized time series analytics databases
- Pinot – LinkedIn OLAP data store very similar to Druid.
- Pig – Provides a good overview of Pig Latin.
- Pig – provide an introduction of how to build data pipelines using Pig.
- Hive – provides an introduction of Hive.
- Hive – another good paper to understand the motivations behind Hive at Facebook.
- Phoenix – SQL on Hbase.
- Join Algorithms for Map Reduce – provides a great introduction to different join algorithms on Hadoop.
- Join Algorithms for Map Reduce – another great paper on the different join techniques.
- MLlib – Machine language library on Spark.
- SparkR – Distributed R on Spark framework.
- Mahout – Machine learning framework on traditional Map Reduce.
- Flume – a framework for collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
- Sqoop– a tool to move data between Hadoop and Relational data stores.
- Kafka – distributed messaging system for data processing
- Crunch – library for writing, testing, and running MapReduce pipelines.
- Falcon – data management framework that helps automate movement and processing of Big Data.
- Cascading – data manipulation through scripting.
- Oozie – a workflow scheduler system to manage Hadoop jobs.
- HCatalog - a table and storage management layer for Hadoop.
- ProtocolBuffers – language neutral serialization format popularized by Google.
- Avro – modeled around Protocol Buffers for the Hadoop ecosystem.
- OpenTSDB – a time series metrics systems built on top of HBase.
- Ambari - system for collecting, aggregating and serving Hadoop and system metrics
- YCSB – performance evaluation of NoSQL systems.
- GridMix – provides benchmark for Hadoop workloads by running a mix of synthetic jobs
- Background on big data benchmarking with the key challenges associated.
嗯,准备读一读