配置RHadoop与运行WordCount例子

本文介绍如何在Linux环境下安装配置R语言及其集成开发环境RStudio,详细说明了安装所需的依赖包及编译环境,并提供了RHadoop的安装配置过程与测试示例。

1、安装R语言环境

su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm'

su -c 'yum install foo'

yum list R-\*

yum install R

2、安装RStudio Desktop和Server

Desktop是rpm包,双击执行

Server安装命令:

yum install openssl098e # Required only for RedHat/CentOS 6 and 7

wget http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm

yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm

添加r-user用户

3、安装gcc、git、pkg-config

yum install gcc git pkg-config

4、安装thrift0.9.0

yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel

编译安装步骤:

Update the System

    yum -y update

Install the Platform Development Tools

    yum -y groupinstall "Development Tools"

Upgrade autoconf/automake/bison

    yum install -y wget

Upgrade autoconf

    wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz

    tar xvf autoconf-2.69.tar.gz

    cd autoconf-2.69

    ./configure --prefix=/usr

    make

    make install

Upgrade automake

    wget http://ftp.gnu.org/gnu/automake/automake-1.14.tar.gz

    tar xvf automake-1.14.tar.gz

    cd automake-1.14

    ./configure --prefix=/usr

    make

    make install

Upgrade bison

    wget http://ftp.gnu.org/gnu/bison/bison-2.5.1.tar.gz

    tar xvf bison-2.5.1.tar.gz

    cd bison-2.5.1

    ./configure --prefix=/usr

    make

    make install

Install C++ Lib Dependencies

    yum -y install libevent-devel zlib-devel openssl-devel

Upgrade Boost

    wget http://sourceforge.net/projects/boost/files/boost/1.55.0/boost_1_55_0.tar.gz

    tar xvf boost_1_55_0.tar.gz

    cd boost_1_55_0

    ./bootstrap.sh

    ./b2 install

Build and Install the Apache Thrift IDL Compiler

    git clone https://git-wip-us.apache.org/repos/asf/thrift.git

    cd thrift

    ./bootstrap.sh

    ./configure --with-lua=no

    修改/thrift-0.9.1/lib/cpp/thrift.pc的includedir=${prefix}/include/thrift

    make

    make install

Update PKG_CONFIG_PATH:

    export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/

Verifiy pkg-config path is correct:

    pkg-config --cflags thrift

    returns:

    -I /usr/local/include/thrift

拷贝文件到lib文件夹

    cp /usr/local/lib/libthrift-1.0.0-dev.so /usr/lib/

5、设置Linux环境变量

export HADOOP_PREFIX=/usr/lib/hadoop

export HADOOP_CMD=/usr/lib/hadoop/bin/hadoop

export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming.jar

6、root用户下开启R环境安装依赖包

install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",

                    "functional", "stringr", "plyr", "reshape2", "dplyr",

                    "R.methodsS3", "caTools", "Hmisc", "data.table", "memoise"))

7、root用户下开启R环境安装RHadoop包

install.packages("/root/RHadoop/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")

install.packages("/root/RHadoop/rmr2_3.3.0.tar.gz", repos=NULL, type="source")

install.packages("/root/RHadoop/plyrmr_0.5.0.tar.gz", repos=NULL, type="source")

install.packages("/root/RHadoop/rhbase_1.2.1.tar.gz", repos=NULL, type="source")

8、配置ant 和 maven

export MAVEN_HOME=/root/apache-maven-3.2.5

export PATH=/root/apache-maven-3.2.5/bin:$PATH

export ANT_HOME=/root/apache-ant-1.9.4

export PATH=$ANT_HOME/bin:$PATH

9、测试RHadoop

Sys.setenv("HADOOP_PREFIX"="/usr/lib/hadoop")

Sys.setenv("HADOOP_CMD"="/usr/lib/hadoop/bin/hadoop")

Sys.setenv("HADOOP_STREAMING"="/usr/lib/hadoop-mapreduce/hadoop-streaming.jar")

 

    library(rmr2)

    bp = rmr.options("backend.parameters")

    trans <- list(D="mapreduce.map.java.opts=-Xmx400M",

                 D="mapreduce.reduce.java.opts=-Xmx400M",

                 D="mapreduce.map.memory.mb=4096",

                 D="mapreduce.reduce.memory.mb=4096",

                 D="mapreduce.task.io.sort.mb=100")

    bp <- list(hadoop=trans)

    #### 没有使用的代码 开始 #######################

    bp$hadoop[1]="mapreduce.map.java.opts=-Xmx400M"

    bp$hadoop[2]="mapreduce.reduce.java.opts=-Xmx400M"

    bp$hadoop[3]="mapreduce.map.memory.mb=1024"

    bp$hadoop[4]="mapreduce.reduce.memory.mb=2048"

    bp$hadoop[5]="mapreduce.task.io.sort.mb=100"

    #### 没有使用的代码 结束 #######################

    rmr.options(backend.parameters = bp)

    rmr.options("backend.parameters")

    ## map function

    map <- function(k,lines) {

        words.list <- strsplit(lines, '\\s')

        words <- unlist(words.list)

        return( keyval(words, 1) )

    }

    ## reduce function

    reduce <- function(word, counts) {

        keyval(word, sum(counts))

    }

    wordcount <- function (input, output=NULL) {

        mapreduce(input=input, output=output, input.format="text",

map=map, reduce=reduce)

    }

 

    ## delete previous result if any

    system("/usr/lib/hadoop/bin/hadoop fs -rm -r /tmp/zhengcong/out")

    

    ## Submit job

    hdfs.root <- '/tmp/zhengcong'

    hdfs.data <- file.path(hdfs.root, 'hp')

    hdfs.out <- file.path(hdfs.root, 'out')

    out <- wordcount(hdfs.data, hdfs.out)

 

    ## Fetch results from HDFS

    results <- from.dfs(out)

 

    ## check top 30 frequent words

    results.df <- as.data.frame(results, stringsAsFactors=F)

    colnames(results.df) <- c('word', 'count')

    head(results.df[order(results.df$count, decreasing=T), ], 30)

 

10、错误解决

    rJava无法加载,root用户下运行 R CMD javareconf -e

    添加 export LD_LIBRARY_PATH=$JAVA_HOME/lib/amd64:$JAVA_HOME/jre/lib/amd64/server

<think> 我们参考多个引用,特别是引用[4]和引用[5]提供了详细的步骤,结合Ubuntu20系统的特点,整理出以下步骤: 1. 安装Java(Hadoop依赖Java环境) 2. 安装Hadoop并配置环境变量 3. 配置Hadoop(包括核心配置文件) 4. 启动Hadoop集群 5. 运行WordCount程序 具体步骤如下: ### 1. 安装Java 确保系统已安装Java 8(或兼容版本)。在终端执行: ```bash sudo apt update sudo apt install openjdk-8-jdk ``` 验证安装: ```bash java -version ``` 应看到类似输出: ``` openjdk version "1.8.0_292" OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10) OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode) ``` ### 2. 安装Hadoop 从Apache官网下载Hadoop(例如3.3.6版本)并解压: ```bash wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz tar -xzvf hadoop-3.3.6.tar.gz sudo mv hadoop-3.3.6 /usr/local/hadoop ``` 配置环境变量,编辑`~/.bashrc`文件,添加: ```bash export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # 根据实际路径调整 ``` 然后使配置生效: ```bash source ~/.bashrc ``` ### 3. 配置Hadoop 进入`$HADOOP_HOME/etc/hadoop`目录,修改以下配置文件: **core-site.xml**: ```xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> ``` **hdfs-site.xml**: ```xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/datanode</value> </property> </configuration> ``` **mapred-site.xml**: ```xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> ``` **yarn-site.xml**: ```xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration> ``` 配置`hadoop-env.sh`,设置JAVA_HOME: ```bash export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 ``` ### 4. 启动Hadoop集群 格式化HDFS(仅第一次启动时): ```bash hdfs namenode -format ``` 启动HDFS和YARN: ```bash start-dfs.sh start-yarn.sh ``` 使用`jps`命令检查进程,应看到至少有以下进程: - NameNode - DataNode - SecondaryNameNode - ResourceManager - NodeManager ### 5. 运行WordCount程序 创建输入目录并上传文件: ```bash hdfs dfs -mkdir -p /user/input hdfs dfs -put /path/to/local/file.txt /user/input ``` 运行WordCount(注意根据实际版本调整jar包路径): ```bash hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /user/input /user/output ``` 查看结果: ```bash hdfs dfs -cat /user/output/part-r-00000 ``` ### 注意事项 - 确保配置文件中指定的目录(如`dfs.namenode.name.dir`)存在,若不存在需手动创建。 - 如果遇到权限问题,可使用`chmod`修改目录权限。 - 在运行WordCount前,确保输出目录不存在(Hadoop要求输出目录不存在)。 §§相关问题§§ 1. Hadoop安装过程中常见的错误有哪些? 2. 如何配置Hadoop集群的多节点环境? 3. Hadoop的MapReduce原理是什么? 4. 如何优化Hadoop WordCount程序的性能? </think>### Ubuntu 20系统下Hadoop安装配置WordCount运行指南 #### 1. 环境准备 - **安装Java** Hadoop依赖Java环境(推荐JDK 8): ```bash sudo apt update sudo apt install openjdk-8-jdk java -version # 验证安装 ``` - **配置SSH免密登录** Hadoop集群管理需要SSH无密码访问: ```bash sudo apt install openssh-server ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys ssh localhost # 测试免密登录 ``` #### 2. 安装配置Hadoop 1. **下载解压Hadoop** ```bash wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz tar -xzvf hadoop-3.3.6.tar.gz sudo mv hadoop-3.3.6 /usr/local/hadoop ``` 2. **配置环境变量** 编辑 `~/.bashrc`: ```bash export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" ``` 生效配置:`source ~/.bashrc` 3. **修改Hadoop配置文件** (`$HADOOP_HOME/etc/hadoop/`) - **core-site.xml**: ```xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> ``` - **hdfs-site.xml**: ```xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <!-- 单机模式设为1 --> </property> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/namenode</value> </property> </configuration> ``` - **mapred-site.xml**: ```xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> ``` - **yarn-site.xml**: ```xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> ``` #### 3. 启动Hadoop集群 ```bash # 格式化HDFS(首次运行) hdfs namenode -format # 启动HDFS和YARN start-dfs.sh start-yarn.sh # 验证进程 jps # 应出现:NameNode, DataNode, ResourceManager, NodeManager ``` #### 4. 运行WordCount程序 1. **准备输入数据** ```bash hdfs dfs -mkdir -p /user/hadoop/input echo "Hello Hadoop World" > file1.txt hdfs dfs -put file1.txt /user/hadoop/input ``` 2. **执行WordCount** ```bash hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar \ wordcount /user/hadoop/input /user/hadoop/output ``` 3. **查看结果** ```bash hdfs dfs -cat /user/hadoop/output/part-r-00000 ``` 输出示例: ``` Hello 1 Hadoop 1 World 1 ``` #### 5. 停止集群 ```bash stop-yarn.sh stop-dfs.sh ``` #### 常见问题解决 - **权限错误**:使用`sudo chown -R $USER:$USER /usr/local/hadoop`赋予权限 - **端口占用**:检查`9000`和`8088`端口是否空闲 - **文件路径错误**:确保配置文件中路径存在(如`dfs.namenode.name.dir`需手动创建目录) > 提示:完整日志查看 `$HADOOP_HOME/logs/` 目录[^3][^4]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值