cascading 入门（一）

最新推荐文章于 2020-12-04 15:36:20 发布

原创最新推荐文章于 2020-12-04 15:36:20 发布 · 5k 阅读

CC 4.0 BY-SA版权

Cascading是一个基于Hadoop的API，用于构建容错数据处理工作流，它隐藏了MapReduce的复杂性，使开发者能快速开发分布式应用。通过Cascading，可以实现复杂的数据处理任务，尤其适合大数据量的计算需求和简化Hadoop的使用。本文将介绍Cascading的基本概念，包括Hello World示例、主要类、Source和Sink Taps以及Scheme和Taps的种类。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1cascading是什么

cascading是一个架构在Hadoop上的API，用来创建复杂和容错数据处理工作流。它抽象了集群拓扑结构和配置来快速开发复杂分布式的应用，而不用考虑背后的MapReduce。Cascading目前依赖于 Hadoop提供存储和执行架构，但是Cascading API为开发者隔离了Hadoop的技术细节，提供了不需要改变初始流程工作流定义就可以在不同的计算框架内运行的能力。

2为什么要用cascading

cascading使组织能够快速开发复杂的数据处理与Hadoop的应用。cascading使用场景通常有：

1增长的数据量日益增加的计算的线性增长使无法使用单独的计算单元计算，因此需要以hadoop为技术手段的分布式计算。

2 复杂的计算过程，cascading简化了hadoop的使用。使用层叠的api（DSLs）作为查询语言。允许开发者和分析师，数据挖掘和探索的ad-hoc查询。

3helloworld

Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Pipe helloPipe= new HelloPipe("HelloPipe" );
String inputPath=args[0];

String outPath=args[1];

Tap inputPathT= new Hfs( new TextLine(), inputPath);
Tap outPathT=new Hfs( new SequenceFile( new Fields( "time", "ip","url" ) ), outPath);

Flow f=flowConnector.connect(inputPathT,outPathT,helloPipe);

Cascade cascade = new CascadeConnector().connect( f);

cascade.complete();

private static class HelloPipe extends SubAssembly{

public HelloPipe(String name) {

RegexSplitter regexSplitter = new RegexSplitter( new Fields("time", "ip","url" ) );
Pipe importPipe = new Each( name, new Fields( "line" ), regexSplitter );

importPipe = new Each( importPipe, new MyFileter (new Fields("time", "ip","url" )) );

setTails( importPipe );

}

}

4主要的类

Tap 水龙头，数据源

Pipe水管

The Each and Every Pipes Merge GroupBy CoGroup HashJoin

Pipe Assemblies管组件

Flow 流程，连接tap-pipeassembies-tap的

FlowConnector 流程连接器，连接各个流程的

5 Source and Sink Taps

Schemes

TextLine：text行文本

TextDelimited 可以执行分隔字符的text。textline 是TextDelimited ("\n")

SequenceFile : 基于Hadoop的顺序文件

WritableSequenceFile：

escription	Cascading local platform	Hadoop platform
Package Name	`cascading.scheme.local`	`cascading.scheme.hadoop`
Read lines of text	`TextLine`	`TextLine`
Read delimited text (CSV, TSV, etc)	`TextDelimited`	`TextDelimited`
Cascading proprietary efficient binary		`SequenceFile`
External Hadoop application binary (custom `Writable` type)		`WritableSequenceFile`