•The Motivation For Hadoop
· Problems with traditional large-scale systems
· Requirements for a new approach
• Hadoop Basic Concepts
· An Overview of Hadoop
· The Hadoop Distributed File System
· How MapReduce Works
· Anatomy of a Hadoop Cluster
· Other Hadoop Ecosystem Components
• Writing a MapReduce Program
· The MapReduce Flow
· Examining a Sample MapReduce Program
· Basic MapReduce API Concepts
· The Driver Code
· The Mapper
· The Reducer
· Hadoop’s Streaming API
· Using Eclipse for Rapid Development
• Integrating Hadoop Into The Workflow
· Relational Database Management Systems
· Storage Systems
· Creating workflows with Oozie
· Importing Data from RDBMSs With Sqoop
· Importing Real-Time Data with Flume
· Accessing HDFS Using FuseDFS and Hoop
• Delving Deeper Into The Hadoop API
· Using Combiners
· Using LocalJobRunner Mode for Faster Development
· Reducing Intermediate Data with Combiners
· The configure and close methods for MapReduce
Setup and Teardown
· Writing Partitioners for Better Load Balancing
· Directly Accessing HDFS
· Using The Distributed Cache
• Using Hive and Pig
· Hive Basics
· Pig Basics
• Common MapReduce Algorithms
· Sorting and Searching
· Indexing
· Machine Learning with Mahout
· Term Frequency - Inverse Document Frequency
· Word Co-Occurrence
• Practical Development Tips and Techniques
· Testing with MRUnit
· Debugging MapReduce Code
· Using LocalJobRunner Mode for Easier Debugging
· Eclipse development techniques
· Retrieving Job Information with Counters
· Logging
· Splittable File Formats
· Determining the Optimal Number of Reducers
· Map-Only MapReduce Jobs
· Implementing Multiple Mappers using ChainMapper
• More Advanced MapReduce Programming
· Custom Writables and WritableComparables
· Saving Binary Data using SequenceFiles and Avro Files
· Creating InputFormats and OutputFormats
• Joining Data Sets in MapReduce Jobs
· Map-Side Joins
· The Secondary Sort
· Reduce-Side Joins
• Graph Manipulation in Hadoop
· Introduction to graph techniques
· Representing Graphs in Hadoop
· Implementing a sample algorithm: Single Source
· Shortest Path
• Creating Workflows with Oozie
· The Motivation for Oozie
· Oozie’s Workflow Definition Format
本文详细阐述了Hadoop的基本概念、分布式文件系统、MapReduce工作原理及集群结构,同时指导如何使用Eclipse进行快速开发。进一步介绍了如何将Hadoop整合到工作流程中,包括与关系型数据库管理系统的交互、实时数据处理、HDFS访问方法等。深入分析了Hadoop API的高级应用,如结合器的使用、本地作业运行模式、数据中间件的减少、配置和关闭方法、分区器的编写、直接HDFS访问、分布式缓存利用等。此外,还讨论了Hive和Pig的集成,提供了实用的开发技巧和测试策略,如使用MRUnit、调试MapReduce代码、本地作业运行模式下的简化调试、Eclipse开发技术、计数器获取、日志记录、可分割文件格式的确定、最优 reducers数量的判断、只映射任务的实现、多个mapper的使用等。
1万+

被折叠的 条评论
为什么被折叠?



