第一章 Hadoop概述
1、Hadoop框架中最核心设计就是:HDFS和MapReduce。HDFS提供了
2、海量数据怎么存?当然是用分布式文件系统——HDFS
3、数据怎么用呢,分析、处理MapReduce框架,让你通过编写代码来实现对大数据的分析工作
第二章 Hadoop安装
1、hadoop的安装模式分为单机模式、伪分布模式、全分布模式。
2、安装hadoop前准备(配置Linux系统)
Ø 设置ip地址,验证设置是否成功,命令:ifconfig
Ø 关闭防火墙(这种方式关闭的防火墙,在系统重新启动后还是运行着)
执行命令:service iptables stop
验证: service iptables status
Ø 最终关闭防火墙的步骤
查看防火墙状态,执行命令如下:
chkconfig --list | grep iptables
关闭防火墙命令:
chkconfig iptables off
Ø 设置主机名
命令:vi /etc/sysconfig/network
设置如下:
NETWORKING=yes
HOSTNAME=hadoop0 #主机名称
Ø ip与hostname绑定
命令:vi /etc/hosts
设置如下:
127.0.0.1 localhost
192.168.233.129 hadoop0
①单机安装
Ø 安装JDK
通过命令 vi /etc/profile
在文件末尾加上
export JAVA_HOME=/opt/tools/jdk1.7.0_67
export PATH=$JAVA_HOME/bin:$PATH
通过命令source /etc/profile使配置文件生效
通过命令验证JDK安装是否成功,如果出现如下提示,说明安装成功
其中, 可能的选项包括:
-g 生成所有调试信息
-g:none 不生成任何调试信息
-g:{lines,vars,source} 只生成某些调试信息
-nowarn 不生成任何警告
-verbose 输出有关编译器正在执行的操作的消息
-deprecation 输出使用已过时的 API 的源位置
-classpath <路径> 指定查找用户类文件和注释处理程序的位置
-cp <路径> 指定查找用户类文件和注释处理程序的位置
-sourcepath <路径> 指定查找输入源文件的位置
-bootclasspath <路径> 覆盖引导类文件的位置
-extdirs <目录> 覆盖所安装扩展的位置
......
Ø 安装Hadoop
配置hadoop的环境配置变量:vi /etc/profile
export HADOOP_HOME=/opt/hadoop/hadoop1.2.1
export PATH=$HADOOP_HOME/bin:$PATH
通过source /etc/profile命令使配置生效
设置hadoop安装目录中的conf中的hadoop-env.sh中JAVA_HOME:
export JAVA_HOME=/opt/java/jdk1.7.0_67 #对应上面安装的Java的home目录
注意:启动hadoop需要对namenode进行格式化操作:
hadoop namenode -format
验证:进入Hadoop的安装目录的bin目录中,运行start-all.sh,通过jps命令查看Hadoop是否启动成功,如下图所示,说明启动成功:
②伪分布式模式
Ø 设置ssh免密码登陆
执行命令ssh-keygen -t rsa 生成公钥和私钥,公钥和私钥存放路径~/.ssh中,.ssh为隐藏文件夹,需要通过ll -a才能查看到
复制authorized_keys,通过命令 cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys,用于保存公钥
最后通过ssh命令登录,执行如下命令:
ssh localhost
ssh hadoop0
通过more authorized_keys查看生成的localhost和hadoop0对应的公钥
Ø 设置core-site.xml
Hadoop核心配置文件,用于配置HDFS和MAPReduce等
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop0:9000</value>
</property>
<!-- HDFS存储路径 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/tmp</value>
</property>
</configuration>
Ø 设置hdfs-site.xml
hadoop守护进程配置文件,包括namenode、secondary namenode、datanode等
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 副本份数 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Ø 设置mapred-site.xml
MapReduce守护进程配置文件,包括jobtracker和tasktracker
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop0:9001</value>
</property>
</configuration>
③全分布模式
Ø 分布结构主节点(1个,是hadoop0):NameNode、JobTracker、SecondaryNameNode
从节点(2个,是hadoop1、hadoop2):DataNode、TaskTracker
Ø 各节点重新产生ssh加密文件
每个节点通过命令ssh-keygen -t rsa重新生成公私钥
复制authorized_keys,通过命令 cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys,用于保存公钥。
Ø 两两节点之间的SSH免密码登陆
hadoop0和hadoop2将公钥分别传给hadoop1,之后通过ssh hadoopX连接就不需要密码了
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop1
然后通过scp /root/.ssh/authorized_keys hadoopX:/root/.ssh/命令分别将hadoop1中的authorized_keys传给hadoop0和hadoop2。
最后在hadoop1上执行
ssh hadoop0
ssh hadoop1
ssh hadoop2
ssh localhost
执行scp /root/.ssh/known_hosts hadoopX:/root/.ssh/命令分别将hadoop1中的authorized_keys传给hadoop0和hadoop2。
Ø 修改master
运行SecondaryNamenode的机器列表(每行一个)
将localhost改成hadoop0
Ø 修改slavers
运行DataNode和TaskTracker的机器列表(每行一个)
添加如下配置:
hadoop1
Hadoop2
Ø 启停hadoop的方式
①通过start-all.sh和stop-all.sh启动全部守护进程或者停止全部守护进程
②通过start-dfs.sh和stop-dfs.sh启动或者停止namenode,secondarynamenode,datanode。
③通过hadoop-daemons.sh单独启动或者停止一个守护进程
启动执行顺序如下:
hadoop-daemon.sh start namenode
hadoop-daemons.sh start datanode (启动从节点上的datanode用hadoop-daemons.sh,启动本机上的datanode用hadoop-daemon.sh)
hadoop-daemon.sh start secondarynamenode
hadoop-daemon.sh start jbotracker
hadoop-daemons.sh start tasktracker
停止执行顺序如下:
hadoop-daemons.sh stop tasktracker
hadoop-daemon.sh stop jbotracker
hadoop-daemon.sh stop secondarynamenode
hadoop-daemons.sh stop datanode
hadoop-daemon.sh stop namenode
通过jps查看hadoop0中的守护进程:
通过jps查看hadoop1中的守护进程:
通过jps查看hadoop2中的守护进程:
如果需要在任意一台节点都能启动和关停所有的节点,就需要保证slavers中的conf中的配置文件与master节点保持一致,比如core-site.xml、master、slavers等。
Ø 访问hdfs的web页面
在windows中访问linux系统,需要在C:\Windows\System32\drivers\etc\hosts,添加
192.168.233.129 hadoop0
192.168.233.130 hadoop1
192.168.233.131 hadoop2
在浏览器中输入:hadoop0:50070,出现如图页面:
Ø 访问mapreduce的web页面
在浏览器中输入:hadoop0:50030,出现如图页面:
第三章 MapReduce快速入门
Ø 在/opt/input/data/input中创建文件file1和file2:
file1:内容:
hello, i love coding
are you ok?
hello, i love hadoop
are you ok?
file2内容:
hello i love coding
are you ok ?
hello i love hadoop
are you ok
将这个两个文件放入HDFS中,执行命令:
hadoop fs -put /opt/hadoop/data/input/file* /usr/hadoop/input
/usr/hadoop/input该目录是HDFS中的目录,需要通过以下命令创建
hadoop fs -mkdir /usr/hadoop/input
Ø 在eclipse中创建一个Java工程,叫做wordcount,新建一个package叫做org.dragon.hadoop.mr,在包下创建WordCountMapper,WordCountReducer,WordCountMain三个类:
WordCountMapper类:
package org.dragon.hadoop.mr;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* @see mapper类
* @author ACER
*
*/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
/* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN, org.apache.hadoop.mapreduce.Mapper.Context)
* @see 每行会调用一次map函数
*/
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//获取每行数据的值
String lineValue = value.toString();
//进行分割
StringTokenizer st = new StringTokenizer(lineValue);
//循坏遍历
while(st.hasMoreTokens()){
//获取每个值
String wordValue = st.nextToken();
//设置map输出的key值
word.set(wordValue);
//上下文输出map的key和value值
context.write(word, one);
}
}
}
WordCountReducer类:
package org.dragon.hadoop.mr;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/**
* @see reducer类
* @author ACER
*
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
/* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Reducer#reduce(KEYIN, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
* 每一个相同key对应的数量,由reduce进行统计
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//用于累加
int sum = 0;
//循环遍历Iterable
for(final IntWritable value : values){
//累加
sum += value.get();
}
//设置总数
result.set(sum);
context.write(key, result);
}
}
WordCountMain类:
package org.dragon.hadoop.mr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
/**
* @see wordcount测试类
* @author ACER
*
*/
public class WordCountMain {
/**
* @wordcount测试
* @param args
* @throws IOException
*/
public static void main(String[] args) throws Exception {
//获取配置信息
Configuration configuration = new Configuration();
String[] otherArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
//这里必须有输入/输出
if(otherArgs.length != 2){
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
//创建Job,设置配置和名称
Job job = new Job(configuration, "wc");
//1.设置job运行的类
job.setJarByClass(WordCountMain.class);
//2.设置Mapper和Reducer类
job.setMapperClass(WordCountMapper.class);
//作业合成类
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
//3.设置输入和输出文件的目录
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//4.设置输出结果的key和value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//5.提交Job,等待运行结果,并在客户端显示运行信息,完成退出
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Ø 通过eclipse中的Export打包成jar包:
WordCountMain中的main函数作为入口:
Ø 将打好的jar包放入Hadoop的安装目录下,即/opt/hadoop/hadoop-1.2.1中,通过如下命令执行:
hadoop jar wc.jar /usr/hadoop/input/file* /usr/hadoop/output6
执行结果如下:
结果可以通过hadoop0:50070查看。
也可以通过命令在hadoop0中查看:
hadoop fs -text /usr/hadoop/output6/part-r-00000
! 5
;;; 1
? 2
aaa 3
aaaaaaaaaaaaaaaaaaalll 1
are 4
bbbb 1
beiging! 1
chongqing! 1
coding 2
dangtu! 1
ddd 2
hadoop 2
hdiohh 1
hello 3
hello, 2
i 6
jkk 1
jn 1
kkk 4
love 5
nnn 1
nnnn 1
ok 2
ok? 2
qqq 2
s 2
sqq 1
ss 1
sss 1
sws 1
to 1
welcome 1
you 4
第四章 Hadoop分布式文件系统详解
Ø 设置集群Block备份数
方案一:修改配置文件conf下的hdfs-site.xml下面配置:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
通过这种方案参数其实只在文件被写入dfs的时候起作用,虽然改变了配置文件,但是不会改变之前写入的文件备份数。
方案二:通过命令更改备份数
hadoop fs -setrep -R 3 /
这样可以改变整个HDFS里面的备份数,不需要重启HDFS系统。方案一:需要重启HDFS才能生效。方案二之前写入的文件备份数也会改变。
Ø Hadoop API常用接口介绍
1、FsUrlStreamHandleFactory类
package org.apache.hadoop.fs.api;
import java.io.InputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
/**
* @see 通过URL读取文件数据
* @author ACER
*
*/
public class HdfsURLReader {
/**
* FsUrlStreamHandlerFactory在Java虚拟机中只能调用一次
* 所以需要把其放置static
*/
static{
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
/**
* @param args
*/
public static void main(String[] args) {
InputStream inputStream = null;
try {
inputStream = new URL("hdfs://hadoop0:9000/usr/hadoop/output8/part-r-00000").openStream();
IOUtils.copyBytes(inputStream, System.out, 1024, false);
} catch (Exception e) {
e.printStackTrace();
}
}
}
2、FileSystem类
package org.apache.hadoop.fs.api;
import java.io.InputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
/**
* @see 使用FileSystem类操作文件
* @author ACER
*
*/
public class FileSystemReader {
/**
* @param args
*/
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
String uri = "hdfs://hadoop0:9000/usr/hadoop/output7/part-r-00000";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream inputStream = null;
inputStream = fs.open(new Path(uri));
IOUtils.copyBytes(inputStream, System.out, 4096, false);
} catch (Exception e) {
e.printStackTrace();
}
}
}
3、FileStatus类
package org.apache.hadoop.fs.api;
import java.net.URI;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
/**
* @see 通过FileStatus类查看文件或者目录的元信息
* @author ACER
*
*/
public class FileStatusMetaData {
/**
* @param args
*/
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
String dirURI = "hdfs://hadoop0:9000/usr/hadoop/output8";
FileSystem fs = FileSystem.get(URI.create(dirURI), conf);
FileStatus fileStatus = fs.getFileStatus(new Path(dirURI));
if(fileStatus.isDir()){
System.out.println("============这是一个目录!!!!");
}
System.out.println("目录路径:" + fileStatus.getPath());
System.out.println("目录长度:" + fileStatus.getLen());
System.out.println("目录修改日期:" + new Date(fileStatus.getModificationTime()));
System.out.println("目录上次访问日期:" + new Date(fileStatus.getModificationTime()));
System.out.println("目录备份数:" + fileStatus.getReplication());
System.out.println("目录块大小:" + fileStatus.getBlockSize());
System.out.println("目录所有者:" + fileStatus.getOwner());
System.out.println("目录所在的组:" + fileStatus.getGroup());
System.out.println("目录权限:" + fileStatus.getPermission());
System.out.println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");
System.out.println();
String fileURI = "hdfs://hadoop0:9000/usr/hadoop/output8/part-r-00000";
FileStatus fileStat = fs.getFileStatus(new Path(fileURI));
if(!fileStat.isDir()){
System.out.println("============这是一个文件!!!!");
}
System.out.println("文件路径:" + fileStat.getPath());
System.out.println("文件长度:" + fileStat.getLen());
System.out.println("文件修改日期:" + new Date(fileStat.getModificationTime()));
System.out.println("文件上次访问日期:" + new Date(fileStat.getModificationTime()));
System.out.println("文件备份数:" + fileStat.getReplication());
System.out.println("文件块大小:" + fileStat.getBlockSize());
System.out.println("文件所有者:" + fileStat.getOwner());
System.out.println("文件所在的组:" + fileStat.getGroup());
System.out.println("文件权限:" + fileStat.getPermission());
} catch (Exception e) {
e.printStackTrace();
}
}
}
4、BlockLocation类
package org.apache.hadoop.fs.api;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
/**
* @see 查找某个文件的block在集群中的位置
* @author ACER
*
*/
public class BlockLocation {
/**
* @param args
*/
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
String filePath = "hdfs://hadoop0:9000/usr/hadoop/input/access.log";
FileSystem fs = FileSystem.get(URI.create(filePath), conf);
Path fsPath = new Path(filePath);
FileStatus fileStatus = fs.getFileStatus(fsPath);
org.apache.hadoop.fs.BlockLocation[] blkLocations = fs.getFileBlockLocations(fileStatus, 0, fileStatus.getLen());
int blkLen = blkLocations.length;
for(int i=0; i<blkLen; i++){
String[] hosts = blkLocations[i].getHosts();
for(int j=0; j<hosts.length; j++){
System.out.println("blk_" + i + "_location:" + hosts[j]);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
附件A
一、Hive安装
第一步:解压hive包
tar -zxvf apache-hive-1.2.1-bin.tar.gz
第二步:增加hive环境变量
vi /etc/profile
#hive1.2.1
HIVE_HOME=/opt/hive/hive-1.2.1
PATH=$PATH:$HIVE_HOME/bin
生效:source /etc/profile
第三步:在HDFS上创建HIVE默认的目录/user/hive/warehouse并添加权限
hadoop fs -chmod -R 777 /user
hadoop fs -chmod -R 777 /tmp
第四步:安装MySQL,如果mysql安装在其他机器上,并在mysql中创建数据库hive
在hive的安装目录conf中创建hive-site.xml,设置元数据库为mysql,配置如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.233.1:3306/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>sxp</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>1234</value>
</property>
</configuration>
注意:除了通过xml配置元数据库配置外,还可以这么做(这种方式试了貌似不行,可能版本的问题):
①在hive-site.xml只配置如下配置:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- 控制是否在客户端创建和使用一个本地meatastore服务器,false使用远端metastore服务器 -->
<property>
<name>hibe.metastore.local</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.233.1:3306/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
</configuration>
②在conf建立jpox.properties文件:
javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver
javax.jdo.option.ConnectionURL=jdbc:mysql://192.168.233.1:3306/hive
javax.jdo.option.ConnectionUserName=sxp
javax.jdo.option.ConnectionPassword=1234
第五步:将mysql的驱动jar移动到hive的lib下。
第六步:输入hive启动hive
第七步:在hive上创建表
Create table user(
userid bigint,
user string,
age tinyint,
address string
)row format delimited fields terminated by '\t';
如果如下错误:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)
需要进入mysql设置hive的字符集为latin1
如下:
Ø 导入数据
①从本地系统导入数据到hive:
load data local inpath ‘/opt/data/input/access.log’ overwrite into table access_log;
②从HDFS导入数据到hive:
load data inpath ‘/opt/data/input/access.log’ overwrite into table access_log;
Ø 使用Hive
①通过show tables;查看hive中的数据库表
hive> show tables
> ;
OK
access_log
usr
Time taken: 4.795 seconds, Fetched: 2 row(s)
②通过insert overwrite将另一个表中查询的结果写入某张表中
hive> insert overwrite table access_log_count
> select http_refer,count(http_refer) from access_log
> group by http_refer;
通过select * from access_log_count查询到结果如下:
hive> select * from access_log_count;
OK
www.baidu.com 337216
www.google.com 338326
www.qq.com 338436
www.sina.com 338494
www.taobao.com 336945
Time taken: 0.237 seconds, Fetched: 5 row(s)
二、Sqoop安装
第一步:解压sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz
第二步:vi /etc/profile
加入如下配置:
#sqoop1.4.6
export SQOOP_HOME=/opt/sqoop/sqoop-1.4.6
export PATH=$PATH:$SQOOP_HOME/bin
使用时报如下错误,说明mysql-connector-java-xxx.jar版本的问题,可以使用mysql-connector-java-5.1.32-bin.jar这个版本
16/05/24 16:32:11 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/05/24 16:32:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/05/24 16:32:12 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/05/24 16:32:12 INFO tool.CodeGenTool: Beginning code generation
16/05/24 16:32:12 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `access_log` AS t LIMIT 1
16/05/24 16:32:12 ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@499fd0 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@499fd0 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:930)
at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:2622)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1846)
将mysql中的表access_log中数据导入HDFS:
sqoop import --append --connect jdbc:mysql://192.168.233.1:3306/access --username sxp --password 1234 --target-dir /usr/hadoop/input/access.log --m 1 --table access_log --fields-terminated-by ‘\t’;
将HDFS中文件数据导入到mysql:
sqoop export --connect jdbc:mysql://192.168.233.1:3306/access --username sxp --password 1234 --export-dir /usr/hadoop/input/access.log --table access_log --fields-terminated-by '\t';
可以在Linux系统部署定时任务:
crontab -e 增加如下:
30 17 * * * /opt/sqoop/sqoop-1.4.6/bin/sqoop import --append --connect jdbc:mysql://192.168.233.1:3306/access --username sxp --password 1234 --target-dir /usr/hadoop/input/access.log --m 1 --table access_log --fields-terminated-by ‘\t’2>>/opt/data/input/error.log
注意:通过定时启动导入/导出任务必须注意下几点:
1、sqoop的路径必须完整/opt/sqoop/sqoop-1.4.6/bin/sqoop
2、必须在sqoop的安装目录conf把sqoop-env-template.sh复制一份并命名为sqoop-env.sh
3、在sqoop-env.sh设置HADOOP_COMMON_HOME和HADOOP_MAPRED_HOME,如下:
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/hadoop/hadoop-1.2.1
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-1.2.1
不设置会报错,如下:
nput/error.log
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
X-Cron-Env: <LANG=en_US.UTF-8>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
Message-Id: <20160524213802.5423483C83@hadoop0.localdomain>
Date: Tue, 24 May 2016 17:38:02 -0400 (EDT)
Error: /opt/sqoop/sqoop-1.4.6/bin/../../hadoop does not exist!
Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.
From root@hadoop0.localdomain Tue May 24 17:41:02 2016
Return-Path: <root@hadoop0.localdomain>
X-Original-To: root
Delivered-To: root@hadoop0.localdomain
Received: by hadoop0.localdomain (Postfix, from userid 0)
id 5B04083C83; Tue, 24 May 2016 17:41:02 -0400 (EDT)
From: root@hadoop0.localdomain (Cron Daemon)
To: root@hadoop0.localdomain
Subject: Cron <root@hadoop0> /opt/sqoop/sqoop-1.4.6/bin/sqoop import --append --connect jdbc:mysql://192.168.233.1:3306/access --user
name sxp --password 1234 --target-dir /usr/hadoop/input/access.log --m 1 --table access_log --fields-terminated-by '\t' 2>>/opt/data/i
nput/error.log
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
X-Cron-Env: <LANG=en_US.UTF-8>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
Message-Id: <20160524214102.5B04083C83@hadoop0.localdomain>
Date: Tue, 24 May 2016 17:41:02 -0400 (EDT)
Error: /opt/sqoop/sqoop-1.4.6/bin/../../hadoop-mapreduce does not exist!
Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.