大数据学习——mapreduce程序单词统计

本文详细介绍使用Hadoop实现WordCount程序的过程,包括项目结构、pom.xml配置、Mapper与Reducer逻辑实现,以及如何在HDFS上运行程序并查看结果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

项目结构

pom.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <groupId>com.cyf</groupId>
  <artifactId>MyWordCount</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>2.6.4</version>
    </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-jar-plugin</artifactId>
        <version>2.4</version>
        <configuration>
          <archive>
            <manifest>
              <addClasspath>true</addClasspath>
              <classpathPrefix>lib/</classpathPrefix>
              <mainClass>cn.itcast.mapreduce.WordCountDriver</mainClass>
            </manifest>
          </archive>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

 

WordCountMapper.java

package cn.itcast.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import static com.sun.corba.se.spi.activation.IIOP_CLEAR_TEXT.value;

/**
 * @author AllenWoon
 *         <p>
 *         Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
 *         KEYIN:是指框架读取到的数据的key类型
 *         在默认的读取数据组件InputFormat下,读取的key是一行文本的偏移量,所以key的类型是long类型的
 *         <p>
 *         VALUEIN指框架读取到的数据的value类型
 *         在默认的读取数据组件InputFormat下,读到的value就是一行文本的内容,所以value的类型是String类型的
 *         <p>
 *         keyout是指用户自定义逻辑方法返回的数据中key的类型 这个是由用户业务逻辑决定的。
 *         在我们的单词统计当中,我们输出的是单词作为key,所以类型是String
 *         <p>
 *         VALUEOUT是指用户自定义逻辑方法返回的数据中value的类型 这个是由用户业务逻辑决定的。
 *         在我们的单词统计当中,我们输出的是单词数量作为value,所以类型是Integer
 *         <p>
 *         但是,String ,Long都是jdk中自带的数据类型,在序列化的时候,效率比较低。hadoop为了提高序列化的效率,他就自己自定义了一套数据结构。
 *         <p>
 *         所以说在我们的hadoop程序中,如果该数据需要进行序列化(写磁盘,或者网络传输),就一定要用实现了hadoop序列化框架的数据类型
 *         <p>
 *         <p>
 *         Long------->LongWritable
 *         String----->Text
 *         Integer---->IntWritable
 *         null------->nullWritable
 */


public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    /**
     * 这个map方法就是mapreduce程序中被主体程序MapTask所调用的用户业务逻辑方法
     * Maptask会驱动我们的读取数据组件inputFormat去读取数据(KEYIN,VALUEIN),每读取一个(k,v),也就会传入到这个用户写的map方法中去调用一次
     * 在默认的inputFormat实现中,此处的key就是一行的起始偏移量,value就是一行的内容
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String lines = value.toString();
        String[] words = lines.split(" ");
        for (String word : words) {
            context.write(new Text(word), new IntWritable(1));

        }
    }

}
WordCountReducer.java
package cn.itcast.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;


/***
 * @author AllenWoon
 *         <p>
 *         reducetask在调用我们的reduce方法
 *         <p>
 *         reducetask应该接收到map阶段(前一阶段)中所有maptask输出的数据中的一部分;
 *         (key.hashcode% numReduceTask==本ReduceTask编号)
 *         <p>
 *         reducetask将接收到的kv数据拿来处理时,是这样调用我们的reduce方法的:
 *         <p>
 *         先讲自己接收到的所有的kv对按照k分组(根据k是否相同)
 *         <p>
 *         然后将一组kv中的k传给我们的reduce方法的key变量,把这一组kv中的所有的v用一个迭代器传给reduce方法的变量values
 */

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;

        for (IntWritable v : values) {
            count += v.get();
        }
        context.write(key, new IntWritable(count));
    }


}
WordCountDriver.java
package cn.itcast.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
 * @author AllenWoon
 *         <p>
 *         本类是客户端用来指定wordcount job程序运行时候所需要的很多参数
 *         <p>
 *         比如:指定哪个类作为map阶段的业务逻辑类  哪个类作为reduce阶段的业务逻辑类
 *         指定用哪个组件作为数据的读取组件  数据结果输出组件
 *         指定这个wordcount jar包所在的路径
 *         <p>
 *         ....
 *         以及其他各种所需要的参数
 */
public class WordCountDriver {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        //告诉框架,我们的程序所在jar包的位置
        job.setJar("/root/wordcount.jar");

        //告诉程序,我们的程序所用好的mapper类和reduce类是什么

        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);


        //告诉框架,我们的程序输出的数据类型
        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(IntWritable.class);


        job.setOutputKeyClass(Text.class);
        job.setOutputKeyClass(IntWritable.class);


        //告诉框架我们程序使用的数据读取组件 结果输出所用的组件是什么
        //TextInputFormat是mapreduce程序中内置的一种读取数据的组件 准确的说叫做读取文本文件的输入组件

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        //告诉框架,我们要处理的数据文件在哪个路径下
        FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
        //告诉框架我们的输出结果输出的位置

        FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));

        Boolean res = job.waitForCompletion(true);
     System.exit(res?0:1);
} }

 

 

先建两个文件1.txt 2.txt 

内容如下

1.txt 

hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello

2.txt

hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello hello aleen hello nana hello city hello ciounty hello
zhangsan helllo lisi hello wangwu hello hello hello 
zhaoliu zhousna hello

 

在hdfs上创建文件夹

hadoop fs -mkdir -p /wordcount/input

 

把1.txt 2.txt放在/wordcount/input目录下

 hadoop fs -put 1.txt 2.txt /wordcount/input

 

上传wordcount.jar

 

运行

hadoop jar wordcount.jar cn.itcast.mapreduce.WordCountDriver

查看生成的结果文件

hadoop fs -cat /wordcount/output/part-r-00000

 

 

 

 

 

 

 

 

转载于:https://www.cnblogs.com/feifeicui/p/10217409.html

### 关于大数据系统架构中MapReduce的相关信息 #### MapReduce原理概述 MapReduce是一种用于处理和生成大型数据集的编程模型及其相关实现。该方法通过将复杂的大规模并行计算简化到两个函数——`Map` 和 `Reduce`,使得开发者无需深入了解分布式系统的内部机制即可编写能够高效执行的数据密集型任务的应用程序[^3]。 在具体操作过程中,输入的数据集合被分割成更小的部分(通常称为“分片”或splits),之后由不同的节点上的映射器(map tasks)并发地读取这些部分,并对其应用指定的操作逻辑;此阶段产生的中间键值对会经过洗牌(shuffle)排序(sort),再传递给归约器(reduce tasks),后者负责汇总来自不同源的结果以形成最终输出[^1]。 #### 架构特点 为了提高效率并减少网络带宽消耗,MapReduce遵循“计算靠近数据”的原则,即尽可能让运算发生在存储有相应数据副本的位置附近。这种设计理念有助于降低因频繁跨节点交换大量资料所带来的性能瓶颈问题。 此外,整个体系结构基于主从模式(Master/slave architecture),其中包含一个中心协调者(JobTracker位于master端)以及众多工作单元(TaskTrackers分布于slave侧)。当提交作业时,前者承担着资源分配、进度监控等职责,而后者主要负责实际的任务执行。 #### 应用实例 考虑到上述特性,MapReduce非常适合应用于如下场景: - **日志分析**:对于互联网公司而言,每天都会产生海量的日志记录。利用MapReduce可以从这些非结构化文本中提取有价值的信息,比如统计特定时间段内的用户行为特征。 - **搜索引擎索引构建**:爬虫抓取网页后得到的内容可以通过MapReduce快速解析并建立倒排表,从而加速后续查询响应速度。 - **生物信息学研究**:基因测序项目往往涉及PB级别的序列片段对比匹配工作,借助这一工具可显著缩短实验周期。 ```python def map(key, value): words = value.split() for word in words: yield (word, 1) def reduce(key, values): sum_count = sum(values) yield (key, sum_count) ``` 这段简单的Python伪代码展示了如何使用MapReduce统计文档单词出现次数的例子。它先是在`map()`环节把每篇文章拆解为单个词语作为键发出计数值1,接着经由`reduce()`聚合相同词条下的所有贡献量得出总数目。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值