# **基于 Hadoop 分布式集群的“死亡创业公司”大数据分析系统完整实施方案**
---
## ✅ 一、环境说明
| 角色 | IP 地址 | 主机名 | 配置建议 |
|------|--------|-------|---------|
| Master 节点 | `10.23.175.100` | master | NameNode, ResourceManager, Hive Metastore |
| Slave1 节点 | `10.23.175.101` | slave1 | DataNode, NodeManager |
| Slave2 节点 | `10.23.175.102` | slave2 | DataNode, NodeManager |
操作系统:CentOS 7.x(最小化安装)
Hadoop 版本:3.3.6
Java 版本:OpenJDK 1.8
Hive 版本:3.1.3
MySQL:运行于 master 节点
DataX:部署在 master 节点用于导出数据
---
## ✅ 二、前期准备(所有节点)
### 2.1 安装基础工具和 Python3
```bash
# 所有节点执行
yum update -y
yum install -y vim wget net-tools git epel-release python3 python3-pip
```
验证:
```bash
python3 --version
pip3 --version
```
创建软链接(可选):
```bash
ln -s /usr/bin/python3 /usr/local/bin/python
ln -s /usr/bin/pip3 /usr/local/bin/pip
```
---
### 2.2 设置主机名与 IP 映射(所有节点)
#### 修改本机主机名:
**master 上执行:**
```bash
hostnamectl set-hostname master
```
**slave1 上执行:**
```bash
hostnamectl set-hostname slave1
```
**slave2 上执行:**
```bash
hostnamectl set-hostname slave2
```
#### 统一配置 `/etc/hosts`(三台机器均需设置)
```bash
vim /etc/hosts
```
添加以下内容:
```
10.23.175.100 master
10.23.175.101 slave1
10.23.175.102 slave2
```
保存退出。
---
### 2.3 关闭防火墙与 SELinux
```bash
systemctl stop firewalld
systemctl disable firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config
```
---
### 2.4 创建用户并配置免密 SSH 登录(仅 master 执行)
```bash
useradd hadoop
echo "hadoop:your_password" | chpasswd
mkdir -p /home/hadoop/.ssh
chown -R hadoop:hadoop /home/hadoop/.ssh
```
切换到 hadoop 用户:
```bash
su - hadoop
```
生成密钥对:
```bash
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
```
将公钥复制到其他节点(确保网络互通):
```bash
ssh-copy-id slave1
ssh-copy-id slave2
```
测试是否免密登录:
```bash
ssh slave1 date
ssh slave2 hostname
```
> 必须能无密码访问!
---
## ✅ 三、安装 Java 和 Hadoop(主从节点同步操作)
### 3.1 安装 OpenJDK 8
```bash
yum install -y java-1.8.0-openjdk-devel
java -version
```
设置 `JAVA_HOME` 环境变量(所有节点):
```bash
echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk' >> ~/.bash_profile
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bash_profile
source ~/.bash_profile
```
---
### 3.2 下载并解压 Hadoop(master 操作,然后分发到 slave)
```bash
su - hadoop
cd ~
mkdir -p software apps data scripts
cd software
# 下载 Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -zxvf hadoop-3.3.6.tar.gz -C ../apps/
mv ../apps/hadoop-3.3.6 ../apps/hadoop
```
设置环境变量:
```bash
echo 'export HADOOP_HOME=/home/hadoop/apps/hadoop' >> ~/.bash_profile
echo 'export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH' >> ~/.bash_profile
source ~/.bash_profile
```
---
### 3.3 分发 Hadoop 到 slave 节点
```bash
scp -r ../apps/hadoop slave1:/home/hadoop/apps/
scp -r ../apps/hadoop slave2:/home/hadoop/apps/
# 同步环境变量
scp ~/.bash_profile slave1:~/
scp ~/.bash_profile slave2:~/
```
在 slave1 和 slave2 上分别执行:
```bash
source ~/.bash_profile
```
---
## ✅ 四、配置 Hadoop 伪分布式集群
### 4.1 编辑 `core-site.xml`(master)
```bash
vim $HADOOP_HOME/etc/hadoop/core-site.xml
```
```xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/data/hadoop/tmp</value>
</property>
</configuration>
```
---
### 4.2 编辑 `hdfs-site.xml`
```xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/hadoop/datanode</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
```
---
### 4.3 编辑 `mapred-site.xml`
```xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
```
---
### 4.4 编辑 `yarn-site.xml`
```xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
```
---
### 4.5 编辑 `workers` 文件(原 slaves)
```bash
vim $HADOOP_HOME/etc/hadoop/workers
```
内容为:
```
slave1
slave2
```
---
### 4.6 分发配置文件到 slave 节点
```bash
cd $HADOOP_HOME/etc/hadoop
scp *.xml workers slave1:$HADOOP_HOME/etc/hadoop/
scp *.xml workers slave2:$HADOOP_HOME/etc/hadoop/
```
---
## ✅ 五、格式化并启动 HDFS/YARN
### 5.1 格式化 NameNode(仅首次执行)
```bash
hdfs namenode -format
```
输出看到 `Storage directory ... has been successfully formatted.` 即成功。
---
### 5.2 创建本地目录结构
```bash
mkdir -p /home/hadoop/data/hadoop/{tmp,namenode,datanode}
```
slave1 和 slave2 上也需执行:
```bash
mkdir -p /home/hadoop/data/hadoop/{tmp,datanode}
```
---
### 5.3 启动集群服务
```bash
# 启动 HDFS
start-dfs.sh
# 启动 YARN
start-yarn.sh
# 查看进程
jps
```
**master 应显示:**
- NameNode
- ResourceManager
- SecondaryNameNode(若有)
**slave1/slave2 应显示:**
- DataNode
- NodeManager
---
### 5.4 Web UI 验证
- HDFS 浏览器:[http://10.23.175.100:9870](http://10.23.175.100:9870)
- YARN 控制台:[http://10.23.175.100:8088](http://10.23.175.100:8088)
---
## ✅ 六、上传并清洗数据
### 6.1 使用 XFTP 上传文件
将 `com(all).csv` 上传至 master 的:
```bash
/home/hadoop/data/raw/com_all.csv
```
重命名避免括号问题:
```bash
mv /home/hadoop/data/raw/com\(all\).csv /home/hadoop/data/raw/com_all.csv
```
---
### 6.2 数据清洗脚本(Python3)
创建清洗脚本:
```bash
vim /home/hadoop/scripts/clean_companies.py
```
```python
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1], on_bad_lines='skip', encoding='utf-8')
df.dropna(how='all', axis=1, inplace=True)
df.dropna(subset=['com_name', 'cat'], how='all', inplace=True)
df['total_money'] = pd.to_numeric(df['total_money'], errors='coerce').fillna(0)
df['live_days'] = pd.to_numeric(df['live_days'], errors='coerce').fillna(-1)
df['born_data'] = pd.to_datetime(df['born_data'], errors='coerce')
df['death_data'] = pd.to_datetime(df['death_data'], errors='coerce')
df.fillna('NULL', inplace=True)
df.to_csv(sys.argv[2], index=False, encoding='utf-8')
print("✅ Cleaned data saved.")
```
运行清洗:
```bash
python3 /home/hadoop/scripts/clean_companies.py \
/home/hadoop/data/raw/com_all.csv \
/home/hadoop/data/cleaned/com_cleaned.csv
```
---
### 6.3 上传至 HDFS
```bash
hdfs dfs -mkdir -p /input/death_companies
hdfs dfs -put /home/hadoop/data/cleaned/com_cleaned.csv /input/death_companies/
```
验证:
```bash
hdfs dfs -ls /input/death_companies/
hdfs dfs -cat /input/death_companies/com_cleaned.csv | head -3
```
---
## ✅ 七、MapReduce 离线批处理分析
### MR任务1:统计各行业倒闭数量 Top 10
#### 编写 Mapper
```java
// IndustryCountMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class IndustryCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text industry = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
if (key.get() == 0 && line.startsWith("bianh")) return;
String[] fields = line.split(",", -1);
if (fields.length > 3 && !fields[3].trim().isEmpty() && !fields[3].equals("NULL")) {
industry.set(fields[3]);
context.write(industry, one);
}
}
}
```
#### Reducer
```java
// IndustryCountReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IndustryCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) sum += val.get();
result.set(sum);
context.write(key, result);
}
}
```
#### Driver
```java
// IndustryCountDriver.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class IndustryCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Industry Count");
job.setJarByClass(IndustryCountDriver.class);
job.setMapperClass(IndustryCountMapper.class);
job.setCombinerClass(IndustryCountReducer.class);
job.setReducerClass(IndustryCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
---
### 编译打包运行
```bash
mkdir -p /home/hadoop/mr/classes
javac -cp $(hadoop classpath) -d /home/hadoop/mr/classes /home/hadoop/mr/*.java
cd /home/hadoop/mr/classes
jar cf industry_count.jar *.class
mv industry_count.jar ..
cd ..
hadoop jar industry_count.jar IndustryCountDriver \
/input/death_companies/com_cleaned.csv \
/output/industry_count
```
---
### 查看结果(HDFS Shell)
```bash
hdfs dfs -cat /output/industry_count/part-r-00000 | sort -k2nr | head -10
```
示例输出:
```
金融 1203
企业服务 890
教育 432
...
```
---
### MR任务2:Top 10 存活时间最长的公司
使用 `live_days` 字段排序输出。
代码略(类似上文),输出路径 `/output/top_survivors`
查看命令:
```bash
hdfs dfs -cat /output/top_survivors/part-r-00000
```
---
### 使用 Java API 读取结果(示例程序)
```java
// ReadResultFromHDFS.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.net.URI;
public class ReadResultFromHDFS {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create("hdfs://master:9000"), conf);
Path path = new Path("/output/industry_count/part-r-00000");
FSDataInputStream in = fs.open(path);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
br.close();
fs.close();
}
}
```
编译运行:
```bash
javac -cp $(hadoop classpath):. ReadResultFromHDFS.java
java -cp $(hadoop classpath):. ReadResultFromHDFS
```
---
## ✅ 八、基于 Hive 的数据分析
### 8.1 安装 Hive(master)
```bash
cd /home/hadoop/software
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -zxvf apache-hive-3.1.3-bin.tar.gz -C ../apps/
mv ../apps/apache-hive-3.1.3-bin ../apps/hive
```
环境变量:
```bash
echo 'export HIVE_HOME=/home/hadoop/apps/hive' >> ~/.bash_profile
echo 'export PATH=$HIVE_HOME/bin:$PATH' >> ~/.bash_profile
source ~/.bash_profile
```
---
### 8.2 安装 MySQL 并初始化 Metastore
```bash
yum install -y mysql-server
systemctl start mysqld
grep 'temporary password' /var/log/mysqld.log
mysql -u root -p
```
修改密码并授权:
```sql
ALTER USER 'root'@'localhost' IDENTIFIED BY 'HiveMeta123!';
CREATE DATABASE metastore CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'%' IDENTIFIED BY 'HiveMeta123!';
FLUSH PRIVILEGES;
EXIT;
```
---
### 8.3 配置 Hive
```bash
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
vim $HIVE_HOME/conf/hive-site.xml
```
关键配置:
```xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>HiveMeta123!</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
```
下载驱动:
```bash
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.33/mysql-connector-java-8.0.33.jar
cp mysql-connector-java-8.0.33.jar $HIVE_HOME/lib/
```
初始化:
```bash
schematool -initSchema -dbType mysql
```
---
### 8.4 启动 Hive 建表分析
```bash
hive
```
建表语句(同前文一致):
```sql
CREATE DATABASE IF NOT EXISTS startup_analysis;
USE startup_analysis;
CREATE EXTERNAL TABLE death_companies (
bianh STRING,
com_name STRING,
com_addr STRING,
cat STRING,
se_cat STRING,
com_des STRING,
born_data STRING,
death_data STRING,
live_days INT,
financing STRING,
total_money DOUBLE,
death_reason STRING,
invest_name STRING,
ceo_name STRING,
ceo_des STRING,
ceo_per_des STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar"=",", "quoteChar"="\"")
STORED AS TEXTFILE
LOCATION '/input/death_companies';
```
---
### 8.5 执行 6 个 Hive 查询(创建新表)
```sql
-- Q1: 各行业平均存活天数
CREATE TABLE result_avg_survival AS
SELECT cat, AVG(live_days) AS avg_days FROM death_companies GROUP BY cat ORDER BY avg_days DESC;
-- Q2: 融资阶段统计
CREATE TABLE result_financing_count AS
SELECT financing, COUNT(*) AS cnt FROM death_companies GROUP BY financing ORDER BY cnt DESC;
-- Q3: 政策监管导致倒闭的行业分布
CREATE TABLE result_policy_impact AS
SELECT cat, COUNT(*) AS policy_cnt FROM death_companies
WHERE death_reason LIKE '%政策%' OR death_reason LIKE '%监管%'
GROUP BY cat ORDER BY policy_cnt DESC;
-- Q4: Top 10 融资额公司
CREATE TABLE result_top_funded AS
SELECT com_name, cat, total_money FROM death_companies
ORDER BY total_money DESC LIMIT 10;
-- Q5: 城市死亡公司数 Top 10
CREATE TABLE result_city_top AS
SELECT com_addr, COUNT(*) AS cnt FROM death_companies GROUP BY com_addr ORDER BY cnt DESC LIMIT 10;
-- Q6: 投资机构投资了多少短命公司(<1年)
CREATE TABLE result_investor_risk AS
SELECT invest_name, COUNT(*) AS short_count, AVG(total_money) AS avg_money
FROM death_companies WHERE live_days <= 365 AND invest_name != 'NULL'
GROUP BY invest_name HAVING short_count >= 3 ORDER BY short_count DESC;
```
---
## ✅ 九、使用 DataX 迁移 Hive 结果至 MySQL
### 9.1 安装 DataX
```bash
cd /home/hadoop/software
wget https://github.com/alibaba/DataX/archive/refs/tags/v20230818.tar.gz
tar -zxvf v20230818.tar.gz -C ../apps/
mv ../apps/DataX-20230818 ../apps/datax
```
---
### 9.2 在 MySQL 创建目标库表
```sql
CREATE DATABASE analysis_report CHARACTER SET utf8mb4;
USE analysis_report;
CREATE TABLE result_avg_survival (
cat VARCHAR(100),
avg_days DECIMAL(10,2)
);
CREATE TABLE result_financing_count (
financing VARCHAR(50),
cnt BIGINT
);
CREATE TABLE result_investor_risk (
invest_name VARCHAR(200),
short_count BIGINT,
avg_money DOUBLE
);
```
---
### 9.3 编写 DataX JSON 任务(示例)
```json
{
"job": {
"setting": { "speed": { "channel": 1 } },
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/hive/warehouse/startup_analysis.db/result_avg_survival/*",
"defaultFS": "hdfs://master:9000",
"fileType": "text",
"column": [ {"index":0,"type":"string"}, {"index":1,"type":"double"} ],
"encoding": "UTF-8"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "hive",
"password": "HiveMeta123!",
"column": ["cat", "avg_days"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://master:3306/analysis_report?charset=utf8mb4",
"table": ["result_avg_survival"]
}
]
}
}
}
]
}
}
```
运行:
```bash
python3 $APP_DATAX_HOME/bin/datax.py job/hive_to_mysql_avg.json
```
重复执行其他两个表迁移。
---
### 9.4 验证 MySQL 数据
```sql
SELECT * FROM analysis_report.result_avg_survival LIMIT 5;
SELECT * FROM analysis_report.result_investor_risk LIMIT 3;
```
---
## ✅ 十、知识点详解(不超过三个)
1. **HDFS 分布式存储机制**
数据被切块存储于多个 DataNode,NameNode 统一管理元数据,实现高容错与横向扩展能力。
2. **MapReduce 批处理模型原理**
Map 阶段并行处理输入记录,Reduce 阶段聚合中间结果,适合海量离线数据分析任务。
3. **Hive 构建数据仓库的优势**
提供 SQL 接口自动转化为底层任务,降低大数据分析门槛,支持复杂查询如 JOIN、嵌套子查询等。
---
✅ **最终交付清单:**
- [ ] 所有配置文件截图
- [ ] jps 进程截图
- [ ] HDFS ls/cat 截图
- [ ] Hive 查询结果截图
- [ ] DataX 成功日志截图
- [ ] MySQL 查询结果截图
- [ ] MapReduce 输出截图
- [ ] Java API 输出截图
> 如需自动化部署脚本(一键安装 Hadoop + Hive + MySQL),我可为你生成。