hadoop学习(四)Map/Reduce数据分析简述-示例-电话通讯清单

本文介绍了如何使用Hadoop MapReduce处理大规模电话通讯清单,分析每个手机号码的来电记录。在数据达到TB级别时,传统方法处理效率低下,而Hadoop能够自动分配数据到集群节点,通过Mapper处理数据并可能使用Reducer进行聚合操作。Mapper负责读取、分割原始数据,处理异常,并输出到HDFS,Reducer则对相同Key的数据进行整合。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

    假如我们集群和伪分布式hadoop系统已经搭建完毕。我们都会根据官网或一些资料提供的wordcount函数来测试我们系统是否能正常工作。假设,我们在执行wordcount函数,都没有问题。那我们就可以开始写M/R程序,开始数据分析了。

   因为,hadoop集群,还有其他一些组件需要我们去安装,这里还没有涉及,暂时不考虑。你要做的就是,把要分析的数据上传到HDFS中。至于其余组件,遇到的时候,在学习。这里对概念,不做太多的介绍。必要的概念,和程序执行步骤,这个是必须了解的。

任务要求:分析通话记录,查处每个手机号码有哪些打过来的号码

  — 现有电话通讯清单,记录用户A打给用户B的记录

  — 查找出没有电话号码,都有哪些电话来了

例如:120   13688893333 13733883333 1393302942  

说明:120 有后面3个电话打过来

要实现上述功能,传统处理方面同样也可以实现,但当数据大到一定程度,就会遇到瓶颈:

1、通过c,java程序直接截取上面的数据。

2、把数据直接导入数据库中,直接select也能解决问题。

3、但当数据TB级别的时候,简单的计算的话,传统方法就会遇到瓶颈。这时候M/R就起到作用了,hadoop会将我们上传到HDFS中的数据自动的分配到每一台机器中。

4、mapper程序,开始按行读取数据: 一行行的读入,然后做的是:一个个的分割原始数据、输出所需要数据,处理异常数据(导致数据的崩溃,如何处理异常数据)最后输出到HDFS上。每一行数据都会运行这个mapper,最后这个mapper输出到hdfs上。最后得到的结果在输出到hdfs上。

5、可以没有reduce函数,根据情况而定。不带mapper函数的输出发送到输出文件,map函数的输出格式必须与程序输出格式一致。

6、带有reduce函数任务,系统首先把mapper中输出的key相同的部分都发送到同一个reduce,然后再把reduce函数的结果输出,map函数的输出格式必须和reduce函数的输入格式一致。


Mapper的功能:
1、分割原始数据
2
数据从 hive 复制到 mysql 中报错hadoop@t3l-VirtualBox:/usr/local/sqoop$ ./bin/sqoop export --connect jdbc:mysql://localhost:3306/dblab --username root --password '*-+' --table user_action --export-dir '/user/hive/warehouse/dblab.db/user_action' --fields-terminated-by '\t'; Warning: /usr/local/sqoop/../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /usr/local/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. Warning: /usr/local/sqoop/../zookeeper does not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation. 错误: 找不到或无法加载主类 org.apache.hadoop.hbase.util.GetJavaProperty 2025-08-08 16:00:57,770 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 2025-08-08 16:00:57,835 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2025-08-08 16:00:58,042 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 2025-08-08 16:00:58,042 INFO tool.CodeGenTool: Beginning code generation Fri Aug 08 16:00:58 CST 2025 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 2025-08-08 16:00:59,101 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_action` AS t LIMIT 1 2025-08-08 16:00:59,171 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user_action` AS t LIMIT 1 2025-08-08 16:00:59,210 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop 注: /tmp/sqoop-hadoop/compile/e5381a60ac3daedca9188e7ac7d80584/user_action.java使用或覆盖了已过时的 API。 注: 有关详细信息, 请使用 -Xlint:deprecation 重新编译。 2025-08-08 16:01:01,847 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/e5381a60ac3daedca9188e7ac7d80584/user_action.jar 2025-08-08 16:01:01,853 INFO mapreduce.ExportJobBase: Beginning export of user_action 2025-08-08 16:01:01,854 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2025-08-08 16:01:02,095 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 2025-08-08 16:01:03,097 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2025-08-08 16:01:03,334 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 2025-08-08 16:01:03,337 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 2025-08-08 16:01:03,342 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2025-08-08 16:01:03,692 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2025-08-08 16:01:03,859 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2025-08-08 16:01:03,859 INFO impl.MetricsSystemImpl: JobTracker metrics system started 2025-08-08 16:01:04,058 INFO input.FileInputFormat: Total input files to process : 1 2025-08-08 16:01:04,065 INFO input.FileInputFormat: Total input files to process : 1 2025-08-08 16:01:04,168 INFO mapreduce.JobSubmitter: number of splits:4 2025-08-08 16:01:04,275 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 2025-08-08 16:01:04,481 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1968709802_0001 2025-08-08 16:01:04,485 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2025-08-08 16:01:04,820 INFO mapred.LocalDistributedCacheManager: Creating symlink: /usr/local/hadoop/tmp/mapred/local/1754640064639/libjars <- /usr/local/sqoop/libjars/* 2025-08-08 16:01:04,827 WARN fs.FileUtil: Command 'ln -s /usr/local/hadoop/tmp/mapred/local/1754640064639/libjars /usr/local/sqoop/libjars/*' failed 1 with: ln: 无法创建符号链接'/usr/local/sqoop/libjars/*': 没有那个文件或目录 2025-08-08 16:01:04,827 WARN mapred.LocalDistributedCacheManager: Failed to create symlink: /usr/local/hadoop/tmp/mapred/local/1754640064639/libjars <- /usr/local/sqoop/libjars/* 2025-08-08 16:01:04,827 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/hadoop/mapred/staging/hadoop1968709802/.staging/job_local1968709802_0001/libjars as file:/usr/local/hadoop/tmp/mapred/local/1754640064639/libjars 2025-08-08 16:01:04,958 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 2025-08-08 16:01:04,959 INFO mapreduce.Job: Running job: job_local1968709802_0001 2025-08-08 16:01:04,962 INFO mapred.LocalJobRunner: OutputCommitter set in config null 2025-08-08 16:01:04,990 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.sqoop.mapreduce.NullOutputCommitter 2025-08-08 16:01:05,123 INFO mapred.LocalJobRunner: Waiting for map tasks 2025-08-08 16:01:05,130 INFO mapred.LocalJobRunner: Starting task: attempt_local1968709802_0001_m_000000_0 2025-08-08 16:01:05,245 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2025-08-08 16:01:05,253 INFO mapred.MapTask: Processing split: Paths:/user/hive/warehouse/dblab.db/user_action/000000_0:0+3897281 2025-08-08 16:01:05,259 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file 2025-08-08 16:01:05,259 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start 2025-08-08 16:01:05,259 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length 2025-08-08 16:01:05,277 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Fri Aug 08 16:01:05 CST 2025 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 2025-08-08 16:01:05,392 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false 2025-08-08 16:01:05,396 INFO mapred.LocalJobRunner: Starting task: attempt_local1968709802_0001_m_000001_0 2025-08-08 16:01:05,398 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2025-08-08 16:01:05,398 INFO mapred.MapTask: Processing split: Paths:/user/hive/warehouse/dblab.db/user_action/000000_0:3897281+3897281 2025-08-08 16:01:05,422 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Fri Aug 08 16:01:05 CST 2025 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 2025-08-08 16:01:05,509 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2025-08-08 16:01:05,515 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false 2025-08-08 16:01:05,523 INFO mapred.LocalJobRunner: Starting task: attempt_local1968709802_0001_m_000002_0 2025-08-08 16:01:05,527 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2025-08-08 16:01:05,544 INFO mapred.MapTask: Processing split: Paths:/user/hive/warehouse/dblab.db/user_action/000000_0:7794562+3897281 2025-08-08 16:01:05,557 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Fri Aug 08 16:01:05 CST 2025 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 2025-08-08 16:01:05,635 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2025-08-08 16:01:05,648 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false 2025-08-08 16:01:05,652 INFO mapred.LocalJobRunner: Starting task: attempt_local1968709802_0001_m_000003_0 2025-08-08 16:01:05,655 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2025-08-08 16:01:05,674 INFO mapred.MapTask: Processing split: Paths:/user/hive/warehouse/dblab.db/user_action/000000_0:11691843+3897281 2025-08-08 16:01:05,697 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Fri Aug 08 16:01:05 CST 2025 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. 2025-08-08 16:01:05,768 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2025-08-08 16:01:05,782 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false 2025-08-08 16:01:05,790 INFO mapred.LocalJobRunner: map task executor complete. 2025-08-08 16:01:05,792 WARN mapred.LocalJobRunner: job_local1968709802_0001 java.lang.Exception: java.io.IOException: java.lang.ClassNotFoundException: user_action at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) Caused by: java.io.IOException: java.lang.ClassNotFoundException: user_action at org.apache.sqoop.mapreduce.TextExportMapper.setup(TextExportMapper.java:70) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: user_action at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.sqoop.mapreduce.TextExportMapper.setup(TextExportMapper.java:66) ... 10 more 2025-08-08 16:01:05,962 INFO mapreduce.Job: Job job_local1968709802_0001 running in uber mode : false 2025-08-08 16:01:05,965 INFO mapreduce.Job: map 0% reduce 0% 2025-08-08 16:01:05,967 INFO mapreduce.Job: Job job_local1968709802_0001 failed with state FAILED due to: NA 2025-08-08 16:01:05,973 INFO mapreduce.Job: Counters: 0 2025-08-08 16:01:06,018 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead 2025-08-08 16:01:06,022 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 2.6143 seconds (0 bytes/sec) 2025-08-08 16:01:06,036 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 2025-08-08 16:01:06,036 INFO mapreduce.ExportJobBase: Exported 0 records. 2025-08-08 16:01:06,036 ERROR tool.ExportTool: Error during export: Export job failed!
最新发布
08-09
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值