一、涵盖
MapReduce
InputFormat
RecordReader
切片:block=input split 1.1
File…
Text…
NLine…
DB…
Mapper
setup
map 业务逻辑
cleanup
Combiner
本地的Reducer
注意适用场景
Partitioner
将key按照某种规则进行分发
Hash:
Custom
Reducer
setup
reduce
cleanup
OutputFormat
RecordWriter
DB…
Text…
File…
排序
全排
分区排
二、MapReduce Hive实现join
join
select e.empno,e.ename,e.deptno,d.dname from emp e join dept d on e.deptno=d.deptno;
Reduce/Shuffle/Common Join
join操作是在reduce端完成的
mr:
两张表 emp dept
使用两个MapTask来读取我们的emp和dept数据
MR是要进行shuffle,Mapper的key是我们on的条件 deptno
相同的deptno会被分发到同一个reduce中去
有一个标识符来区分某一条数据到底来自于哪个表
select e.empno,e.ename,e.deptno,d.dname from emp e join dept d on e.deptno=d.deptno;
最终的输出字段:empno ename deptno dname flag
需要自定义序列化类:Info
脏数据 不干净的数据
文件:MapTask
表:TableScan
1)读完数据以后==>Shuffle==>Reduce来join
2)某一个join的key对应的数据量大==>skew
1Y条数据
9000W都是deptno=10 ReduceTask
reducer join
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cc.pxj.wfy</groupId>
<artifactId>phoneWcRuoZe</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!-- 添加Hadoop依赖 -->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.17</version>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
package com.ccj.pxj.join;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
*
*/
public class Info implements Writable {
private int empno;
private String ename;
private int deptno;
private String dname;
private int flag; //区分数据从哪里来的标志位
public Info(){
}
public Info(int empno, String ename, int deptno, String dname){
this.empno = empno;
this.ename = ename;
this.deptno = deptno;
this.dname = dname;
}
public Info(int empno, String ename, int deptno, String dname

本文深入探讨MapReduce的工作原理,包括输入格式、记录读取、切片、Mapper、Reducer等关键组件,以及Hive如何实现join操作。通过实例讲解了MapReduce和Hive join的执行流程和优化策略,如数据倾斜处理和Map Join。
最低0.47元/天 解锁文章
553

被折叠的 条评论
为什么被折叠?



