大数据相关技术入门(基于CentOS7)

本文详细介绍大数据相关工具的安装、配置与使用。涵盖在CentOS 7上安装JDK、Hadoop,在Win10上安装JDK、Eclipse、Maven,还包括HBase、Redis、MongoDB、Hive、Spark的安装配置,以及各工具的应用示例,如词频统计等。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

(一)JDK+Hadoop安装及配置(centos7)

工具:Xshell(连接CentOS主机), Xftp(Windows与CentOS主机之间的文件传输);
提示:本文中下载的安装包都推荐放入/usr/local目录中;

安装前的部署

  1. 关闭防火墙;
  2. 关闭SELINUX;
  3. 下载jdk(这里推荐1.8后缀为tar.gz的版本),上传到centos主机中;
  4. 解压jdk安装包,并配置centos的java环境变量,然后使配置生效;

安装Hadoop(伪分布式模式)

  1. 下载Hadoop(本文使用2.7.7 的版本),之后传入centos主机中;
  2. 解压安装包
  3. 配置Hadoop环境变量,然后是配置生效
  4. 配置Hadoop的五个配置文件(进入到hadoop的/etc/hadoop目录中)(单机模式中不需要配置);

第一个:hadoop-env.sh

[root@hadoop hadoop]# vi hadoop-env.sh  //添加如下的一行变量
#hadoop-2.7.7是第25行
#可以使用 :set number来显示行数
export JAVA_HOME=/usr/java

第二个:core-site.xml(HADOOP-HDFS系统内核文件);

[root@hadoop hadoop]# vi core-site.xml		//添加如下几行
<configuration>
<!--指定HADOOP所使用的文件系统schema(URI),HDFS的老大(NameNode)的地址-->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop:9000</value>			//hadoop为主机名
  </property>
  <!--指定HADOOP运行时产生文件的存储目录-->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/hadoop/tmp</value>
  </property>

 <property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
  </property>
</configuration>

第三个:hdfs-site.xml;

[root@hadoop hadoop]# vi hdfs-site.xml   //添加如下几行
<configuration>
<!--指定HDFS副本的数量-->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

第四个:mapred-site.xml

[root@hadoop hadoop]# vi mapred-site.xml		//添加如下几行
<configuration>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
</configuration>

第五个:yarn-site.xml;

[root@hadoop hadoop]# vi yarn-site.xml		//添加如下几行
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定YARN的老大(ResourceManager)的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop</value>
    </property>
    <!-- 指定reducer获取数据的方式-->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
  1. 解决互信问题 (配置ssh,生成密钥,使ssh可以免密码连接localhost);
  2. 启动Hadoop集群;
    首先格式化NameNode:
    注意:如果不是第一次格式化,格式化之前先删除/usr/local/hadoop-版本号/下面的tmp、logs两个目录;
[root@hadoop hadoop]# hdfs namenode -format #中间没有报错并且最后显示如下信息表示格式化成功

启动Hadoop集群命令:start-all.sh
停止Hadoop集群命令:stop-all.sh
查看Hadoop进程:

[root@hadoop hadoop]# jps		//有一下几个则表示配置成功
1776 SecondaryNameNode
2340 Jps
1478 NameNode
1609 DataNode
1930 ResourceManager
2219 NodeManager
  1. 页面UI端口:50070(HDFS管理界面(NameNode))、8088(MR管理界面);
  2. wordcount实验
    hdfs dfs -put in.txt /intput 上传本地当前路径下的in.txt文件 到hdfs的/input目录下;
    运行hadoop jar hadoop-mapreduce-examples-2.7.7.jar wordcount /intput/in.txt output/;
    在端口50070页面中 查看/output/part-r-00000文件里的词频统计结果。

(二)安装JDK+Eclipse+Maven(win10)

JDK部分

  1. 下载JDK(同样推荐1.8 ,下载后缀名为.exe);
    官网链接:点击下载JDK1.8
  2. 配置JDk环境变量 ;

Eclipse部分

  1. 官网下载Eclipse安装包eclipse-inst-win64:点击下载Eclipse
  2. 安装Eclipse;
  3. 在Eclipse中配置java环境变量;

Maven部分

  1. 下载Maven;
    Maven官网:点击下载mavan,选择最近的镜像,选择Maven压缩包apache-maven-3.6.0-bin.tar.gz开始下载。
  2. 解压Maven压缩包
    解压Maven压缩包apache-maven-3.6.0-bin.tar.gz,解压后的文件夹\apache-maven-3.6.0,将其考入自定义路径,如C:\eclipse\apache-maven-3.6.0。
  3. 配置Maven的环境变量;
  4. 在Eclipse中配置Maven;
    ①修改settings.xml;
    在安装所在文件夹\apache-maven-3.6.0下面,新建\repository文件夹,作为Maven本地仓库。在文件settings.xml里添加 C:\eclipse\apache-maven-3.6.0\repository。
    ②配置Maven的installation和User Settings;
    【Preferences】→【Maven】→【Installations】配置Maven安装路径,【User Settings】配置settings.xml的路径。
  5. 在Eclipse中新建一个Maven项目
  6. 修改Maven项目中的pom.xml:
    依赖(Maven Repository: hadoop)所在网址:Maven依赖
//找到对应版本的三个依赖(如下),拷贝至pom.xml的<project>与</project>之间,保存之后自动生成Maven Dependencies
<dependencies>
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.7.7</version>
  </dependency>
  <dependency>  
      <groupId>org.apache.hadoop</groupId>  
      <artifactId>hadoop-client</artifactId>  
      <version>2.7.7</version>  
  </dependency> 
  <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.7.7</version>
  </dependency>
</dependencies>

HDFS的Java程序

  1. HDFSMKdir.java新建HDFS目录/aadir。
package hdfs.files;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSMKdir {

	public static void main(String[] args) throws IOException {
		
		System.setProperty("HADOOP_USER_NAME", "root");
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hadoop:9000");
		FileSystem client = FileSystem.get(conf);
		client.mkdirs(new Path("/aadir"));
		client.close();
		System.out.println("successfully!");
	}

}
  1. HDFSUpload.java写入/上传 本地文件c:\hdfs\aa.txt 到HDFS的/aadir目录下。
package hdfs.files;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSUpload {

	private static InputStream input;
	private static OutputStream output;
	public static void main(String[] args) throws IOException{
		System.setProperty("HADOOP_USER_NAME", "root");
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hadoop:9000");
		FileSystem client = FileSystem.get(conf);
		input = new FileInputStream("/usr/local/hdfs/aa.txt");
		output = client.create(new Path("/aadir/aaout.txt"));
		byte[] buffer = new byte[1024];
		int len = 0;
		while ((len=input.read(buffer))!=-1){
			output.write(buffer, 0, len);
		}
		output.flush();
		input.close();
		output.close();
	}

}
  1. HDFSDownload.java读/下载 HDFS的根目录文件/bb.txt 到本地c:\hdfs目录下。
package hdfs.files;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSDownload {

	// 声明输入流、输出流
	private static InputStream input;
	private static OutputStream output;
	public static void main(String[] args) throws IOException {
		//设置root权限
				System.setProperty("HADOOP_USER_NAME", "root");
				//创建HDFS连接对象client
				Configuration conf = new Configuration();
				conf.set("fs.defaultFS", "hdfs://hadoop:9000");
				FileSystem client = FileSystem.get(conf);
				//创建本地文件的输入流
				input = new FileInputStream("/usr/local/hdfs/bbout.txt");
				//创建HDFS的输出流
				output = client.create(new Path("/bb.txt"));
				//写文件到HDFS
				byte[] buffer = new byte[1024];
				int len = 0;
				while ((len=input.read(buffer))!=-1){
					output.write(buffer, 0, len);
				}
				//防止输出数据不完整
				output.flush();
				//使用工具类IOUtils上传或下载
				//IOUtils.copy(input, output);
				//关闭输入输出流
				input.close();
				output.close();

	}

}
  1. HDFSFileIfExist.java查看HDFS文件/bb.txt是否存在。
package hdfs.files;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSFileIfExist {

	public static void main(String[] args) throws IOException{
		System.setProperty("HADOOP_USER_NAME", "root");
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://hadoop:9000");
		FileSystem client = FileSystem.get(conf);
		String fileName = "/bb.txt";
		if (client.exists(new Path(fileName))) {
			System.out.println("seccessfully!");
		}else {
			System.out.println("file no exist!");
		}
	}
}
  1. 创建四个Maven项目,分别将四个java程序放入;
  2. 将四个Maven项目打成jar包;
    在每个项目的pom.xml文件添加打包工具:
//在<projec>  </project>中添加
<build>
        <plugins>
            <plugin>
                   <artifactId> maven-assembly-plugin </artifactId>
                   <configuration>
                        <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                             <manifest>
                                  <mainClass>spark.files.WordCountJava</mainClass>
                             </manifest>
                        </archive>
                   </configuration>
                   <executions>
                        <execution>
                             <id>make-assembly</id>
                             <phase>package</phase>
                             <goals>
                                  <goal>single</goal>
                             </goals>
                        </execution>
                   </executions>
              </plugin>
        </plugins>
    </build>
  1. 将生成的jar包上传到CentOS主机中运行;
    命令:java -jar jar包名(进入到放jar包的目录中运行)
    结果:在这里插入图片描述

WordCount的java程序实验

  1. 创建一个Maven项目
  2. 在pom.xml中添加如下依赖:
    注意:在标签中添加
//**注意**:在标签<project></project>中添加
<dependencies>

	<dependency>
    	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-client</artifactId>
    	<version>2.7.7</version>
	</dependency>
	<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>2.7.7</version>
	</dependency>
	
	<dependency>
    	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-mapreduce-client-common</artifactId>
    	<version>2.7.7</version>
	</dependency>
	
    <dependency>
   	 	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-common</artifactId>
    	<version>2.7.7</version>
 	</dependency>
 	<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.0</version>
        </dependency>
  </dependencies>
  1. 创建如下java程序(**注意:**该程序需要先在centos主机中创建/usr/local/hdfs/input/cc.txt 的目录和文件)

WordCountDriver.java:

package hdfs.files;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountDriver {
	
	public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
		   //对数据进行打散
		 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		   //输入数据 hello world love work
		   String line = value.toString();
		   //对数据切分
		   String[] words=line.split(" ");
		   //写出<hello, 1>
		   for(String w:words) {
		   //写出reducer端
		   context.write(new Text(w), new IntWritable(1));
		   }
		  }
	}

	public static class WordCountReducer extends Reducer <Text, IntWritable, Text, IntWritable>{
		  protected void reduce(Text Key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
			    //记录出现的次数
			    int sum=0;
			    //累加求和输出
			    for(IntWritable v:values) {
			      sum +=v.get();
			    }
			    context.write(Key, new IntWritable(sum));
			  }
		}

	public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
		  // 设置root权限
		  System.setProperty("HADOOP_USER_NAME", "root");
		  //创建job任务
		  Configuration conf=new Configuration();
		  Job job=Job.getInstance(conf);
		  //指定jar包位置
		  job.setJarByClass(WordCountDriver.class);
		  //关联使用的Mapper类
		  job.setMapperClass(WordCountMapper.class);
		  //关联使用的Reducer类
		  job.setReducerClass(WordCountReducer.class);
		  //设置Mapper阶段输出的数据类型
		  job.setMapOutputKeyClass(Text.class);
		  job.setMapOutputValueClass(IntWritable.class);
		  //设置Reducer阶段输出的数据类型
		  job.setOutputKeyClass(Text.class);
		  job.setOutputValueClass(IntWritable.class);
		  //设置数据输入路径和文件名
		  FileInputFormat.setInputPaths(job, new Path("/usr/local/hdfs/input/cc.txt"));
		  //设置数据输出路径
		  FileOutputFormat.setOutputPath(job, new Path("/usr/local/hdfs/output"));
		  //提交任务
		  Boolean rs=job.waitForCompletion(true);
		  //退出
		  System.exit(rs?0:1);
	}
}
  1. 将该Maven程序打包后上传到centos主机中,然后运行运行步骤与前面一样;
    (添加之前的打包工具,复制过来就行);

(三)HBase的安装和配置

HBase的安装

  1. 下载HBase压缩包,然后通过Xftp传到centos主机中;
    注意:选择与自己安装的Hadoop版本的兼容版本;
    下载:官网下载地址:点击下载HBase
    选择稳定版hbase-1.4.9-bin.tar.gz,在Windows里面下载;
  2. 解压到/usr/local目录中;
  3. 配置环境变量,然后使环境变量生效;

HBase配置(伪分布式模式)

  1. 修改配置文件(在/usr/local/hbase/conf中):
    ①配置hbase-env.sh
    设置Java安装路径
[root@hadoop conf]# vi hbase-env.sh		//添加如下几行
export JAVA_HOME=/usr/java

设置HBase的配置文件路径

export HBASE_CLASSPATH=/usr/local/hbase/conf

采用HBase自带Zookeeper,设置参数true

export HBASE_MANAGES_ZK=true

②配置hbase-site.xml

<configuration>
<property>
        <name>hbase.rootdir</name>
        <value>hdfs://hadoop:9000/hbase</value>
</property>
<!--分布式运行模式,false(默认)为单机模式-->
<property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
</property>

<!--Zookeeper集群的地址列表,伪分布式用默认localhost-->
<property>
        <name>hbase.zookeeper.quorum</name>
        <value>localhost</value>
</property>
</configuration>

③启动并运行HBase(之前启动Hadoop),UI界面端口为:16010
启动HBase(start-hbase.sh),并jps查看
停止HBase(stop-hbase.sh

使用HBase数据库

  1. 进入shell界面:hbase shell
  2. 创建表hbase_1102有两个列族CF1和CF2
hbase(main):041:0> create 'hbase_1102',  {NAME=>'cf1'}, {NAME=>'cf2'}
  1. 向表中添加数据,在想HBase的表中添加数据的时候,只能一列一列的添加,不能同时添加多列。
hbase(main):042:0> put'hbase_1102', '001','cf1:name','Tom'
hbase(main):043:0> put'hbase_1102', '001','cf1:gender','man'
hbase(main):044:0> put'hbase_1102', '001','cf2:chinese','90'
hbase(main):045:0> put'hbase_1102', '001','cf2:math','91
  1. 查看表中的所有数据
hbase(main):046:0> scan 'hbase_1102'
  1. 查看其中某一个Key的数据
hbase(main):048:0> get'hbase_1102','001'
  1. 删除一个单元格
hbase(main):050:0> delete '表名' '行键名' '列族名'
  1. 删除一行
hbase(main):052:0> delete '表名' '行键名'
  1. 删除表(先disable ‘表名’;然后drop ‘表名’)

HBase的java API

1.和之前的实验一样,先创建一个Maven项目;
2. 在pom.xml中添加相关依赖;
如下:

//在标签<project></project>中添加
<dependencies>
	<dependency>
    	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-client</artifactId>
    	<version>2.7.7</version>
	</dependency>
  	
 	<dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-it</artifactId>
      <version>1.4.9</version>
    </dependency>
  </dependencies>
  1. 编写java程序:
package hbase.tables;

import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;

public class HbaseTables {

	public static Configuration conf;
	public static Connection con;
	public static Admin adm;
	
	@SuppressWarnings("all")
	public static void init() throws IOException {
		  conf=HBaseConfiguration.create();
		  conf.set("hbase.rootdir", "hdfs://47.106.78.4:9000/hbase");
		  con= ConnectionFactory.createConnection(conf);
		  adm = con.getAdmin();
		  System.out.println(adm);
	}  
	
	public static void createTable(String myTableName, String[] colFamily) throws IOException {
		 init();         
		 TableName tableName = TableName .valueOf(myTableName);
		 if (adm.tableExists(tableName)) {
			 System.out.println("table is exists!"); }else {
				 HTableDescriptor htd=new HTableDescriptor(tableName);
				 for(String str:colFamily) {     
					 HColumnDescriptor hcd =new HColumnDescriptor(str);
					 htd.addFamily(hcd); 
				 }         
				 adm.createTable(htd);
			 }
			   	close();
	}

	public static void close() {
		try {
			if (adm != null) { adm.close();
			    	}
			    	if (con != null) { con.close();
			    	}
			  	}catch (IOException e) {
			  		e.printStackTrace();}
	}

	public static void deleteTable(String myTableName) throws IOException {
		init();
		TableName tableName = TableName .valueOf(myTableName);
		if (adm.tableExists(tableName)) {
			adm.disableTable(tableName);
			adm.deleteTable(tableName);
			 }
		close();   
	}

	public static void listTables() throws IOException {
		init();   
		HTableDescriptor htds[] =adm.listTables();
		for(HTableDescriptor  htd : htds) {
			System.out.println(htd.getNameAsString());
		}
		close();
	}

	public static void insertRow(String myTableName, String rowKey, String colFamily, String col, String val) throws IOException {
		init();  
		TableName tableName =  TableName .valueOf(myTableName);
		@SuppressWarnings("deprecation")
		HTable table = new HTable(conf,tableName);
		Put put=new Put(rowKey.getBytes());
		put.addColumn(colFamily.getBytes(), col.getBytes(), val.getBytes());
		table.put(put);  
		table.close();
		close();
	}
	
	private static void deleteRow(String myTableName, String rowKey, String colFamily, String col) throws IOException {
		init();
		TableName tableName =TableName .valueOf(myTableName);
		@SuppressWarnings("deprecation")
		HTable table = new HTable(conf, tableName);
		Delete delete=new Delete(rowKey.getBytes());
		delete.addFamily(Bytes.toBytes(colFamily));
		delete.addColumn(Bytes.toBytes(colFamily), Bytes.toBytes(col));  
		table.delete(delete);
		table.close();
		close();
	}

	public static void getData(String myTableName, String rowKey, String colFamily, String col) throws IOException {   
		init();
		TableName tableName = TableName .valueOf(myTableName);
		@SuppressWarnings("deprecation")
		HTable table = new HTable(conf, tableName);
		Get get= new Get(rowKey.getBytes());
		Result result = table.get(get);
		showCell(result);  
		table.close(); 
		close();
	}

	private static void showCell(Result result) {
		
		Cell[] cells = result.rawCells();
		for (Cell cell : cells) {
			System.out.println("RowName:" + new String(CellUtil.cloneRow(cell)) + " ");
			System.out.println("Timetamp:" + cell.getTimestamp() + " ");
			System.out.println("column Family:" + new String(CellUtil.cloneFamily(cell)) + " ");
			System.out.println("row Name:" + new String(CellUtil.cloneQualifier(cell)) + " ");
			System.out.println("value:" + new String(CellUtil.cloneValue(cell)) + " ");
	        }

	}

	public static void main(String[] args) throws IOException {

		System.out.println("*****Please enter the number:1.createtable/2.insertRow/3.getData/4.deleteRow/5.listTables/6.deleteTable*****");
		for(int j=0;j<7;j++) {
		int i = 0;
		@SuppressWarnings("resource")
		Scanner scan = new Scanner(System.in);
		i = scan.nextInt();
		switch (i) {
		case 1: 
			System.out.println("please enter tablename:");
			String tbn = scan.next();
			String[] cf = {"cf1,cf2"}; 
			HbaseTables.createTable(tbn, cf);
			System.out.println("createTable success!!!");
			break;
		case 2:
			System.out.println("please enter tablename:");
			String tbn1 = scan.next();
			System.out.println("please enter rowkey:");
			String rk1 = scan.next();
			System.out.println("please enter column:");
			String clm1 = scan.next();
			System.out.println("please enter colname:");
			String cn1 = scan.next();
			System.out.println("please enter colvalue:");
			String cv1 = scan.next();
			HbaseTables.insertRow(tbn1, rk1, clm1, cn1, cv1);
			System.out.println("insertRow success!!!");
			break;
		case 3: 
			System.out.println("please enter tablename:");
			String tbn2 = scan.next();
			System.out.println("please enter rowkey:");
			String rk2 = scan.next();
			System.out.println("please enter colname:");
			String cn2 = scan.next();
			System.out.println("please enter colvalue:");
			String cv2 = scan.next();
			HbaseTables.getData(tbn2, rk2, cn2, cv2);
			System.out.println("getData success!!!");
			break;
		case 4:
			System.out.println("please enter tablename:");
			String tbn3 = scan.next();
			System.out.println("please enter rowkey:");
			String rk3 = scan.next();
			System.out.println("please enter column:");
			String clm3 = scan.next();
			System.out.println("please enter colname:");
			String cn3 = scan.next();
			HbaseTables.deleteRow(tbn3, rk3, clm3, cn3);
			System.out.println("deleteRow success!!!");
			break;
		case 5: 
			HbaseTables.listTables();
			System.out.println("listTables success!!!");
			break;
		case 6: 
			System.out.println("please enter tablename:");
			String tbn4 = scan.next();
			HbaseTables.deleteTable(tbn4);
			System.out.println("deleteTable success!!!");
			break;

		default:
			System.out.println("input error!!!");
			break;
		}
		}
	}

}
  1. 将项目打成jar包,上传到centos中运行(应先启动hbase)

(四)使用Redis 和MongoDB

安装和使用Redis

  1. 下载安装包:
wget http://download.redis.io/releases/redis-4.0.2.tar.gz
  1. 解压:
tar xzf redis-4.0.2.tar.gz
cd redis-4.0.2
make
make install
  1. 启动Redis
[root@hadoop bin]# redis-server ./redis.conf
  1. 使用Redis shell
redis-cli		//进入shell
quit;			//退出shell

安装和使用MongoDB

  1. 创建仓库文件:
vi /etc/yum.repos.d/mongodb-org-3.4.repo

添加如下配置,保存退出:
[mongodb-org-3.4] name=MongoDB Repository baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.4/x86_64/ gpgcheck=1 enabled=1 gpgkey=https://www.mongodb.org/static/pgp/server-3.4.asc

  1. yum安装
yum install -y mongodb-org
  1. 启动
service mongod start
  1. 停止
service mongod stop
  1. 重启
service mongod restart
  1. 进入数据库
mongo

(五)Hive安装配置及使用

安装Mysql

  1. 官网下载mysql-server(yum安装)
    wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
    若wget不可用,下载安装wget:yum -y install wget
  2. 解压rpm -ivh mysql-community-release-el7-5.noarch.rpm
  3. 安装yum install mysql-community-server
  4. 重启mysql服务:service mysqld restart
  5. 进入mysql:mysql -u root -p
  6. 添加用户hive,设置密码,然后授权
mysql> CREATE DATABASE hive; 
mysql> USE hive; 
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
mysql> GRANT ALL ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'hive'; 
mysql> GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive'; 
mysql> FLUSH PRIVILEGES; 
mysql> quit;

Hive的安装

  1. 下载安装包:apache-hive-2.3.4-bin.tar.gz
  2. 解压安装包(解压到/usr/local 的目录中)
  3. 配置Hive的环境变量
  4. 配置hive
cd /usr/local/hive/conf
touch hive-site.xml  	//之后将hive-default.xml.template的头部部分复制过来
[root@hadoop conf]# vi hive-site.xml		//打开后复制以下内容
<configuration>
  <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
  <!-- WARNING!!! Any changes you make to this file will be ignored by Hive.   -->
  <!-- WARNING!!! You must make your changes in hive-site.xml instead.         -->
  <!-- Hive Execution Parameters -->
<!--Hive作业的HDFS根目录位置 -->
<property>
    <name>hive.exec.scratchdir</name>
    <value>/user/hive/tmp</value>
</property>
<!--Hive作业的HDFS根目录创建写权限 -->
<property>
    <name>hive.scratch.dir.permission</name>
    <value>733</value>
</property>
<!--hdfs上hive元数据存放位置 -->
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
</property>
<!--连接数据库地址,名称 -->
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://47.106.78.4:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<!--连接数据库驱动 -->
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<!--连接数据库用户名称 -->
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>
<!--连接数据库用户密码 -->
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>350321</value>
</property>
<!--客户端显示当前查询表的头信息 -->
 <property>
  <name>hive.cli.print.header</name>
  <value>true</value>
</property>
<!--客户端显示当前数据库名称信息 -->
<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
     <value>false</value>
 </property>
<!-- 这是hiveserver2 -->
        <property>
                 <name>hive.server2.thrift.port</name>
                 <value>10000</value>
        </property>

        <property>
                <name>hive.server2.thrift.bind.host</name>
                <value>192.168.1.14</value>
        </property>
</configuration>

  1. MySQL Connector/J安装
    (1)下载压缩包,然后上传到centos主机中;
    官网下载地址:http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/
    mysql-connector-java-5.1.47.tar.gz
    (2)解压到/usr/local目录中;
    (3)将驱动包mysql-connector-java-5.1.47-bin.jar复制到/usr/local/hive/lib目录中。

启动及使用Hive

  1. 启动Hadoop
  2. 初始化Metastore架构:schematool -dbType mysql -initSchema
  3. 启动Hive的命令:hive(进入到hive shell中)
  4. Hive应用实例:wordcount
    (1)建数据源文件并上传到hdfs的/user/input目录下;
    (2)建数据源表t1:create table t1 (line string);
    (3)装载数据:load data inpath ‘/user/input’ overwrite into table t1;
    (4)编写HiveQL语句实现wordcount算法,建表wct1保存计算结果:
create table wct1 as select word, count(1) as count from (select explode (split (line, ' ')) as word from t1) w group by word order by word;

(5)查看wordcount计算结果:select * from wct1;

(六)Spark的安装及使用

Scala的安装

  1. 下载Scala:scala-2.12.8.tgz ,然后上传到centos的/usr/local目录中
  2. 解压Scala: tar -zxvf scala-2.12.8.tgz
  3. 重命名:mv scala-2.12.8 scala
  4. 测试是否安装成功:scala -version
  5. 启动Scala:scala

Spark的安装

  1. 下载:spark-2.4.2-bin-hadoop2.7.tgz,上传目录同上;
  2. 解压;
  3. 重命名;
  4. 启动:
    (1)先启动Hadoop集群;
    (2)到spark的/sbin目录下启动spark:./start-all.sh
    (3)使用jps 查看时会多出worker 和 mater 两个进程;
    (4)查看spark的页面的端口为8080;
    (4)进入spark-shell的命令为:spark-shell(先进入bin目录)

Spark应用程序:WordCount

加载本地文件

  1. 新建目录;
cd /usr/local/spark
mkdir mycode
cd mycode
mkdir wordcount
cd wordcount
  1. 新建文件,往里面写入几个单词,并用空格隔开;
vi word.txt
  1. 启动spark-shell;
  2. 把textFile变量中的内容再次写回到另外一个文本文件wordback.txt中:
val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
textFile.saveAsTextFile("file:///usr/local/spark/mycode/wordcount/writeback")
  1. 查看结果:
cd /usr/local/spark/mycode/wordcount/writeback/
cat part-00000

加载HDFS

  1. 先启动Hadoop
  2. 新建目录:
hdfs dfs -mkdir -p /user/hadoop
  1. 上传本地的word.txt到HDFS:
hdfs dfs -put /usr/local/spark/mycode/wordcount/word.txt /user/hadoop
  1. 回到spark-shell窗口,编写语句,把textFile变量中的内容再次写回到另外一个文本文件wordback.txt中:
val textFile = sc.textFile("hdfs://hadoop:9000/user/hadoop/word.txt")
textFile.saveAsTextFile("hdfs://hadoop:9000/user/hadoop/writeback")
  1. 查看结果:hdfs dfs -cat /user/hadoop/writeback/part-00000

词频统计

切换到spark-shell

scala> val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
scala> val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
scala> wordCount.collect()

编写Scala程序来执行词频统计

  1. 新建目录
cd /usr/local/spark/mycode/wordcount/
mkdir -p src/main/scala
  1. 新建Scala文件
cd /usr/local/spark/mycode/wordcount/src/main/scala
vi test.scala   		//打开后写入以下代码,保存后退出;
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
    def main(args: Array[String]) {
        val inputFile =  "file:///usr/local/spark/mycode/wordcount/word.txt"
        val conf = new SparkConf().setAppName("WordCount").setMaster("local[2]")
        val sc = new SparkContext(conf)
                val textFile = sc.textFile(inputFile)
                val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
                wordCount.foreach(println)       
    }
}
  1. 安装sbt
    先下载sbt:点击下载sbt
mkdir /usr/local/sbt
cd /usr/local/sbt		//然后将下载好的sbt传到这个目录下面
vi ./sbt			添加一下内容,之后保存并退出;
#!/bin/bash
SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
chmod u+x ./sbt
./sbt sbt-version			//该步骤时间较长,耐心等待
  1. 打包程序项目
cd /usr/local/spark/mycode/wordcount/
vi simple.sbt		//添加一下几行

注意Scala和spark的版本

name := "Simple Project"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"

保存退出后继续如下操作

cd /usr/local/spark/mycode/wordcount/
find .
cd /usr/local/spark/mycode/wordcount/  //请一定把这目录设置为当前目录
/usr/local/sbt/sbt package
  1. 运行jar包
/usr/local/spark/bin/spark-submit --class "WordCount"  /usr/local/spark/mycode/wordcount/target/scala-2.12/simple-project_2.12-1.0.jar

用java语言编写spark WordCount程序

  1. 在eclipse中新建一个Maven项目
  2. 修改生成的pom.xml文件(里面包含依赖包和打包工具)
//在<project></project>标签中添加;**注意**这里不要复制
<dependencies> 
  	<dependency>
    	<groupId>org.apache.spark</groupId>
    	<artifactId>spark-core_2.12</artifactId>
    	<version>2.4.2</version>
	</dependency>
</dependencies>	

<build>
        <plugins>
            <plugin>
                   <artifactId> maven-assembly-plugin </artifactId>
                   <configuration>
                        <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                             <manifest>
                                  <mainClass>spark.files.WordCountJava</mainClass>
                             </manifest>
                        </archive>
                   </configuration>
                   <executions>
                        <execution>
                             <id>make-assembly</id>
                             <phase>package</phase>
                             <goals>
                                  <goal>single</goal>
                             </goals>
                        </execution>
                   </executions>
              </plugin>
        </plugins>
    </build>

  1. 编写java程序
package spark.files;

import java.util.Arrays;
import java.util.Iterator;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

public class WordCountJava {
	public static void main(String[] args) {
		// 1.创建SparkConf
		SparkConf sparkConf = new SparkConf()
				.setAppName("wordCountLocal")
				.setMaster("local");
		
		// 2.创建JavaSparkContext
		// SparkContext代表着程序入口
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// 3.读取本地文件
		JavaRDD<String> lines = sc.textFile("/user/hadoop/word.txt");
		
		// 4.每行以空格切割
		JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
			public Iterator<String> call(String t) throws Exception {
				return Arrays.asList(t.split(" ")).iterator();
			}
		});
		
		// 5.转换为 <word,1>格式
		JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
			public Tuple2<String, Integer> call(String t) throws Exception {
				return new Tuple2<String, Integer>(t, 1);
			}
		});
		
		// 6.统计相同Word的出现频率
		JavaPairRDD<String, Integer> wordCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
			public Integer call(Integer v1, Integer v2) throws Exception {
				return v1 + v2;
			}
		});
		
		// 7.执行action,将结果打印出来
		wordCount.foreach(new VoidFunction<Tuple2<String,Integer>>() {
			public void call(Tuple2<String, Integer> t) throws Exception {
				System.out.println(t._1()+" "+t._2());
			}
		});
		
		// 8.主动关闭SparkContext 
		sc.close();
	}
}
  1. 将这个Maven项目打包,然后传到centos主机上;
  2. 运行jar包之前先启动Hadoop和spark;
  3. 运行jar包,得出词频统计的结果;
/usr/local/spark/bin/spark-submit 后面加上jar包所在目录和jar包名  

结束… … …

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值