hive

最新推荐文章于 2023-06-16 10:22:46 发布

小鑫鑫吖

最新推荐文章于 2023-06-16 10:22:46 发布

阅读量291

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/dingwei_zhaoxin/article/details/82851344

Hive
   Hive是一款数据仓库的基础架构
   Hive是一款SQL的解析引擎
复合数据类型
   array   ----java中的数组
       hobby array<string>
   map       ----java中的map
       score map<string, int>
   struct   ----java中的对象
       address struct<province:string, city:string, zip:int>
   create table tbl(
       id int,
       name string,
       birthday date,
       hobby array<string>,
       score map<string, int>,
       address struct<province:string, city:string, zip:int>
   ) row format delimited
   fields terminated by '|'
   collection items terminated by ','
   map keys terminated by ':'
   lines terminated by '\n';
Hive中的两种表：
   内部表(MANAGED_TABLE)

   外部表(EXTERNAL_TABLE)
       表中的数据只是表定义对地址的一个引用，所以其数据的存在与否和表定义无关
       外部表的两大好处：
           可以提高数据的安全
           操作共享数据
   内---->外
       alter table xxx set tblproperties("EXTERNAL", "TRUE");
   外---->内
       alter table xxx set tblproperties("EXTERNAL", "FALSE");
-------------------------------------------------------------------------------
Hive中的两种功能表
   分区表
       /user/hive/warehouser/t/
                   http-access.log.2017-06-25
                   http-access.log.2017-06-26
                   http-access.log.2017-06-27
                   http-access.log.2017-06-28
       我要查询在2017-06-26这一天的用户数据
           select * form t where dt = '2017-06-26';
       ====>这么做会将所有的数据加载到内存中，然后才判断每一条数据过滤中其中符合条件的数据，
       这样很有可能造成内存OOM，效率极低
       ------>可以采用Hive中的分区表的方式将上述的大表拆分成多个小表
           按照某个特定字段进行拆分，把这个拆分字段就成分区字段，
           把拆分后的整个表称之为分区表

           在分区表中的分区字段，就相当于原来表中的一个普通的字段
       /user/hive/warehouser/t/
                           dt=2017-06-25/
                       http-access.log.2017-06-25
                           dt=2017-06-26/
                       http-access.log.2017-06-26
                           dt=2017-06-27/
                       http-access.log.2017-06-27
                           dt=2017-06-28/
                       http-access.log.2017-06-28

           select * form t where dt = '2017-06-26';
       定义一张分区表：
           create table t9_partition (
               id int,
               name string,
               birthday date,
               online boolean
           ) partitioned by (dt string comment "partition fields");
       加载数据：
           load data local inpath '/opt/data/hive/hive-t9.txt' into table t9_partition;
           FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned
           加载数据需要加载到特定的分区中
           正确的操作方式：没有分区会自动创建分区
               load data local inpath '/opt/data/hive/hive-t9.txt' into table t9_partition partition(dt='2017-06-25');
       分区表的基本操作：
           查询分区：
               show partitions t9_partition;
           添加分区：
               alter table t9_partition add partition(dt='2017-06-27');
           删除分区：
               alter table t9_partition add partition(dt='2017-06-28');
       以上的分区表可以提高我们对查询一整张大表的时候的查询效率，在工作过程中使用最频繁。
   桶表
       /user/hive/warehouser/t/
                           province='甘肃'/
                       person-information.gs
                           province='河南'/
                       person-information.hn
                           province='西藏'/
                       person-information.xz
                           province='广东'/
                       person-information.gd
       就是说，在做一些分区操作的时候，因为分区的特点，造成分区数据不均匀，进而影响整体的查询效率，
       有的分区查询很快，有的很慢，所以需要将这些数据重新进行一个划分，做到相对均匀。

       ----->解决方式，可以使用分桶的方式将数据相对均匀的分散到各个文件中。
       分桶的方式实际上就是采用hash分桶，hash分桶到底是怎么来做的呢？

       桶表的两大好处：
           1、在多表关联的时候，可以提高查询效率，前提需要对关联字段做一个分桶
           2、抽样
       创建一张桶表：
           create table t10_bucket(id int) clustered by (id) into 3 buckets;
           向桶表中加载数据：
               只能通过其他的进行转换，不能直接使用load加载
               insert into t10_bucket select id from t1;
视图：
   view
本地模式：
   可以通过set hive.exec.mode.local.auto=true;来开启运行的本地模式
   本地模式可以提高运行的效率，主要用作平时测试，
   运行不会再集群中区执行，只是从集群中获取一部分数据，在本地执行。
   我们在操作桶表的时候，不支持本地模式
数据的加载和导出
   []==>可选，<> ==>必须
   加载
       load
           load data [local] inpath 'path' [overwrite] into table [partition_psc];
           local：
               有==>从linux本地加载数据
               无==>从hdfs加载数据
           overwrite
               有==>覆盖掉表中原来的数据
               无==>在原来的基础上追加新的数据
       从其他表加载
           insert <overwrite|into> table t_des select [...] from t_src [...];
               overwrite
                   有==>覆盖掉表中原来的数据
                   无==>在原来的基础上追加新的数据
           ==>会转化成为MR执行
           需要注意的地方：t_des中列要和select [...] from t_src这里面的[...]一一对应起来。
       创建表的时候加载
           create table t_des as select [...] from t_src [...];
           这样会创建一张表，表结构为select [...] from t_src中的[...]
       动态分区的加载
           快速复制表结构
               create table t_d_partition like t_partition_1;
           hive (default)> show partitions t_partition_1;
               OK
               partition
               year=2015/class=bigdata
               year=2015/class=linux
               year=2016/class=bigdata
               year=2016/class=linux
           要将2016的数据都到入到t_d_partition的相关的分区中
               insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016;
           要将t_partition_1中所有数据都到入到t_d_partition的相关的分区中
               insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1;
           从hdfs上面删除的数据，并没有删除表结构，我们show partitions t_d_partition;是从metastore中查询出来的内容，如果你
           之手动删除的hdfs上面数据，它的元数据信息依然在。
           insert into t10_p_1 partition(year=2016, class) select * from t_partition_1;
           FAILED: SemanticException [Error 10094]: Line 1:30 Dynamic partition cannot be the parent of a static partition 'professional'
           动态分区不能成为静态分区的父目录
               需要将hive.exec.dynamic.partition.mode设置为nonstrict
           <property>
               <name>hive.exec.max.dynamic.partitions</name>
               <value>1000</value>
               <description>Maximum number of dynamic partitions allowed to be created in total.</description>
           </property>
       import
           import table stu from '/data/stu';
   导出
       1°、在hdfs的直接上操作
           hadoop fs -cp src_uri dest_uri
或者
           hive> export table tblName to 'hdfs_uri';
       2°、在终端使用directory
           insert overwrite [local] directory 'linux_fs_path' select ...from... where ...;
--------------------------------------------------------------------------------------
排序
   传统的排序order by
   hive中提供的排序方式有
       distribute by   将数据按照对应字段做拆分
       sort by           单个reducer的排序
       二者综合起来的作用就是order by
       distributed by field sort by field [asc|desc]
--------------------------------------------------------------------------------------
Hive的函数
   函数的定义和java、mysql一样
   函数有三种
       UDF（User Definition Function 用户定义函数）
           一路输入，一路输出
           sin(30°)=1/2
       UDAF（User Definition Aggregation Function 聚合函数）
           多路输入，一路输出
           max min count sum avg等等
       UDTF（User Definition Table Function 表函数）
           一路输入，多路输出
           explode
       show functions;列出hive中可用的函数列表
       desc function func_name;
   case when   ---->switch或if else
   if           ---->三元运算符
   explode       ---->将数组中的元素转换成多行数据
       a = [1, 2, 3, 4] explode(a) ===>
           1
           2
           3
           4
   split       ---->就是字符串中的split函数

   array       ---->
   collect_set
   collect_list
   concat_ws   ---->使用给定的字符串来连接元素
   --------------
   row_number   ---->分组排序或者二次排序

   第一个案例：使用hql统计wordcount(必须掌握)
       分析：
           hello   you
           hello   me
           hello   he
       使用mr的的过程
           step1----->split("\t")--->
               ["hello", "you"]
               ["hello", "me"]
               ["hello", "he"]
           step2----->遍历每一个数组，将数组中的每一个值，作为key，value为1写出去<key, 1>
               <"hello", 1>
               <"you", 1>
               <"hello", 1>
               <"me", 1>
               <"hello", 1>
               <"he", 1>

           step3,shuffle--->
               <"hello", [1, 1, 1]>
               <"you", 1>
               <"me", 1>
               <"he", 1>
           step 4, reduce ====>reduceByKey
       使用hql
           step 1 (mydb1)> select split(line, "\t") from test;
                   ["hello","you"]
                   ["hello","he"]
                   ["hello","me"]
           step 2 将数组中的每一行数据转化为多行
                   (mydb1)> select explode(split(line, "\t")) from test;
                       hello
                       you
                       hello
                       he
                       hello
                       me
           step 3 在step2的基础之上进行group by 即可
               select
                   w.word, count(w.word) as count
               from (select explode(split(line, "\t")) word from test) w
               group by w.word order by count desc;
   第二个案例：case when将一下对应的部门名称显示出来
           1--->学工组，2--->行政组,3---->销售组,4---->研发组,5---->其它
           hive (mydb1)> select * from t1;
           1
           2
           3
           4
           5
           select
           id,
           case id
           when 1 then "学工组"
           when 2 then "行政组"
           when 3 then "销售组"
           when 4 then "研发组"
           else "行政组"
           end
           from t1;
           分类显示
           1   学工组
           2   行政组
           3   销售组
           4   研发组
           5   其它
   第三个案例：row_number 二次排序
       三种连接
           交叉连接
               across join，会有笛卡尔积，所以不用
           内连接(等值连接)
               inner join
               将左表和右表中能够匹配的上的数据做输出
           外链接
               outer join
               左外连接(left outer join)

               右外链接(right outer join)

       根据员工、部分、薪资，这三张表，
           1、分组显示每一个部分员工的信息（启动显示部分名称，员工姓名，员工性别[男|女]，员工薪资），同时分组按照员工薪资降序排序
               select
               e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
               row_number() over(partition by e.deptid order by s.salary desc) rank
               from t_dept d
               left join t_employee e on d.id = e.deptid
               left join t_salary s on e.id = s.empid
               where s.salary is not null;
           2、获取显示部门薪资top2的员工信息
               select
               tmp.*
               from
               (select
               e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
               row_number() over(partition by e.deptid order by s.salary desc) rank
               from t_dept d
               left join t_employee e on d.id = e.deptid
               left join t_salary s on e.id = s.empid
               where s.salary is not null) tmp
               where tmp.rank < 3;
           如果查询的是单表，则可以不用子查询，只用用having来获取即可(having rank < 3)
   第四个案例、行转列
       create table t11_user(id int, name string) row format delimited fields terminated by ',';
       create table t11_address(id int, address string) row format delimited fields terminated by ',';
       第一步：做表关联，分析结果
           select u.id, u.name, a.address from t11_user u join t11_address a on u.id = a.uid;
       第二步：对出现的多个结果按照id来拼接字符串
           select u.id, max(u.name), concat_ws("," collect_set(a.address)) as addr
               from t11_user u join t11_address a
               on u.id = a.id group by u.id;
   第五个案例、列转行
       准备数据
       create table t_user_addr as
           select u.id, max(u.name), concat_ws("," collect_set(a.address)) as addr
                   from t11_user u join t11_address a
                   on u.id = a.id group by u.id;
       就使用explode行数就可以了
       select explode(split(addr, ",") from t_user_addr;
       查看多个字段
       select id, name, address from t12_user_addr lateral view explode(split(addr, ",")) a as address;
Hive的自定义函数
   UDF/UDTF/UDAF....
   UDF
   自定义函数需要遵循的6个步骤：
   1°、自定义一个Java类来继承UDF类
   2°、覆盖其中的evaluate()的函数，有系统去调用
   3°、将写好的程序打成一个jar，上传至服务器
   4°、将3°中的jar加载到hive的classpath
       hive终端执行add jar jar_path;
   5°、给自定义函数设置一个临时的名称，也就是说要创建一个临时的函数
       create tempapory function 函数名 as '写的evalutor所在类的全类名';
   6°、执行函数结束之后，可以手动销毁临时函数，或者不用管，因为当前会话消失，函数自动销毁
   需求：
   要根据用户的birthday，统计对应的生肖和星座
Hive中的MR
   1、什么情况下Hive可以避免MR
       如果我们的hql语句转化为hdfs的命令来查看hdfs中的数据化，自然不会转化为MR，
       那查看HDFS数据有哪些命令呢？
           hadoop fs -text/cat/tail ...
           这些都是全表，或是全字段的扫描，所以联想到我们hql---->
           select * from tblName的时候可以避免。
           第一种情况：普通的表
               select * from tblName [limit limit_num];
           第二种情况：分区
               select * from tblName where partition_sec [limit limit_num];
   2、hql转化为MR的执行过程
       通过查看hive日志，发现在日志文件中主要执行的类是org.apache.hadoop.hive.ql.Driver里面的各个方法，
       下面将各个方法的执行次序按照日志文件中的顺序提取出来了
       查看源码，读懂MR的过程，可以按照下面的顺序来读
       <run>
           <TimeToSubmit>
               <compile>
                   <parse></parse> --->hql拆解的过程
                   <semanticAnalyze> -->词义分析阶段
                       <partition-retrieving></partition-retrieving>
                   </semanticAnalyze>
               </compile>---->有了mr
           <execute> --->开始执行mr
           </TimeToSubmit>
               <runTasks>---------------------->org.apache.hadoop.hive.ql.exec.Task
                   <serializePlan></serializePlan>
                   <getSplits></getSplits> ---map的获取inputSplits
               </runTasks>
           </execute>
       </run>
   3、控制Hive中MR的数量
       1°、控制Map Task的数量
           Mapper task数量有inputsplit数量决定。Block size由参数dfs.block.size决定。默认的dfs.block.size大小为134217728字节（128M）
           但是在执行mapper task时使用的InputFormat是由hive.input.format参数决定的，
           参数的值是org.apache.hadoop.hive.ql.io.CombineHiveInputFormat，意味着在执行mapper task时，会对大量的小文件进行合并。
           这样会提高我们map task的执行效率。
       2°、控制Reduce Task的数量
           决定参数主要有：
           参数1   mapred.reduce.tasks（默认为-1）如果是负数，那么推测reducer任务数量。
           参数2 hive.exec.reducers.bytes.per.reducer决定每个reducer task可以处理的最大数据量，默认是256MB。
           参数3 hive.exec.reducers.max决定了最大可以只写的任务数量，默认是1009.
               当mapred.reduce.tasks值为负数是，会自动设置reudce的数量，但是是有上限的最大为hive.exec.reducers.max
           两个经验参考值
               计算reducer数的公式很简单N=min(参数2，总输入数据量/参数1)
               依据Hadoop的经验，可以将参数2设定为0.95*(集群中TaskTracker个数)。
                   正确的reduce任务的个数应该是0.95或者1.75 ×（节点数 ×mapred.tasktracker.tasks.maximum参数值）。
                   如果任务数是节点个数的0.95倍，那么所有的reduce任务能够在 map任务的输出传输结束后同时开始运行。
                   如果任务数是节点个数的1.75倍，那么高速的节点会在完成他们第一批reduce任务计算之后开始计算第二批 reduce任务，
                   这样的情况更有利于负载均衡。
       3°、Reduce数量只有一个的情况：
           1°进行汇总操作的时候，没有使用group by 分组
           2°使用了子句order by 排序（全局排序，只能在单节点进行排序）
           3°笛卡尔积M * N （表连接，但是没有on条件）
           为了提高reduce的执行效率，我们可以开启一个参数hive.exec.parallel=true（默认为false，开启并行执行mr），
           来提高我们的MR的执行效率，有点类似多线程。同时，有最大的线程并行数量限制，可通过set hive.exec.parallel.thread.number获取
二、Hive中的排序
   order by---->全局排序
       但是会有一个reduce task，这样效率不高
   可以用hive中特有的排序---->局部排序
       sort by
       distributed by
       通常这两个是联合起来用的 distributed by col sort by col,表示到各个reducer中进行局部排序
       同时也可以用cluster by来代替distributed by col sort by col，但是cluster by不支持asc，desc
三、Hive中文件的存储类型【必须记住hive支持的集中存储类型，parquet】
   hive中文件的存储类型，我们可以通过hive-site.xml配置文件的配置项（hive.default.fileformat）来获取
       <property>
           <name>hive.default.fileformat</name>
           <value>TextFile</value>
           <description>
           Expects one of [textfile, sequencefile, rcfile, orc].
           Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
           </description>
       </property>
   1、TextFile
       Hive默认格式，数据不做压缩，磁盘开销大，数据解析开销大。可结合Gzip、Bzip2、Snappy等使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，
       从而无法对数据进行并行操作。
       在执行之前开启压缩输出
           set hive.exec.compress.output=true;
           set mapred.output.compress=true;
           set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;/snappy/lz4/biz/default
           set io.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
       create table t_text(
           line string
       ) stored as textfile;
   2、SequenceFile
       SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。
       SequenceFile支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩。
       set mapred.output.compression.type=BLOCK;(在上述的基础上来做的)
       create table t_sequence(
           line string
       ) stored as sequencefile;
       insert overwrite table t_sequence select * from t_text;
       发现数据被大幅度的压缩
   3、RCFile
       RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。
       create table t_rcfile(
           line string
       ) stored as rcfile;
       insert overwrite table t_rcfile select * from t_text;

   总结：
       textfile 存储空间消耗比较大，并且压缩的text 无法分割和合并查询的效率最低,可以直接存储，加载数据的速度最高
       sequencefile 存储空间消耗大,压缩的文件可以分割和合并查询效率高，需要通过text文件转化来加载
       rcfile 存储空间最小，查询的效率最高，需要通过text文件转化来加载，加载的速度最低
       相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。

   在上述的过程中遇到一个问题：
       SequenceFile doesn't work with GzipCodesc without native-hadoop code!
       原因：
           这是因为默认的hadoop安装包是不支持压缩的，需要我们编译过之后才可以，
           当然我们也可以使用命令hadoop checknative -a检测一下可以使用的压缩依赖lib库,结果发现，都不支持，令人非常沮丧！！！
       解决：
           我们就可以将编译过的hadoop的lib($HADOOP_HOME/lib/native)包覆盖掉原来的（最好把之前的备份），
           再来执行就可以了。

hive的maven依赖：
   <properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <hive-api.version>2.1.0</hive-api.version>
   <hadoop-api.version>2.6.4</hadoop-api.version>
   <hadoop-core.version>1.2.1</hadoop-core.version>
   </properties>

<dependencies>
   <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>4.12</version>
   <scope>test</scope>
   </dependency>
   <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>${hadoop-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-core</artifactId>
   <version>${hadoop-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-core</artifactId>
   <version>${hadoop-core.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-exec</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-serde</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-service</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-metastore</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-common</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-cli</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-jdbc</artifactId>
   <version>${hive-api.version}</version>
   </dependency>
   <dependency>
   <groupId>org.apache.thrift</groupId>
   <artifactId>libfb303</artifactId>
   <version>0.9.0</version>
   </dependency>
</dependencies>

记录：
   hive中文注释乱码解决：
   在hive的元数据库中，执行一下脚本
       ALTER TABLE COLUMNS_V2 MODIFY COLUMN COMMENT VARCHAR(256) CHARACTER SET utf8;
       ALTER TABLE TABLE_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
       ALTER TABLE PARTITION_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
       ALTER TABLE PARTITION_KEYS MODIFY COLUMN PKEY_COMMENT VARCHAR(4000) CHARACTER SET utf8;
       ALTER TABLE INDEX_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
   同时将url，加上utf-8
       &useUnicode=true&characterEncoding=UTF-8
       <property>
           <name>javax.jdo.option.ConnectionURL</name>
           <value>jdbc:mysql://uplooking01:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
       </property>

在安装Hive程序HWI的时候的异常（2.1.0）
   在启动hwi的过程中，出现的问题：
   ar:file:/opt/hive/lib/ant-1.9.1.jar!/org/apache/tools/ant/antlib.xml:37: Could not create task or type of type: componentdef.
   原因是引用hive新版本2.1.0依赖的ant jar包没有相关的资源，需要使用最新的ant jar包

   解决方式：
       下载ant资源：ant-1.10.1.zip 安装包
       下载好之后上传到hive终端所在的服务器里面
       解压之后需要配置ANT_HOME
       将$ANT_HOME/lib/ant.jar 拷贝到$HIVE_HOME/lib/ant-1.10.1.jar

       需要提高$HIVE_HOME/lib/ant-1.10.1.jar的权限
       简单点设置最高权限：chmod 777 $HIVE_HOME/lib/ant-1.10.1.jar

   第二个可能会遇到的问题：
       之后重启hwi服务，
       需要大家将$JAVA_HOME/lib/tools.jar拷贝$HIVE_HOME/lib/目录下面才可以
在执行JDBC的时候，访问不了远程的Hive的ThriftServer服务
报的错误：root不能伪装为root
   是因为版本在进行升级的时候考虑到的安全策略，需要我们手动对root进行配置，需要将
hadoop中的root用户和hive中的root用户进行打通，配置在$HADOOP_HOME/etc/hadoop/core-site.xml
中进行配置：添加一下配置项
   <property>
       <name>hadoop.proxyuser.root.hosts</name>
       <value>*</value>
       <description>这是root用户访问的本机地址</description>
   </property>
   <property>
       <name>hadoop.proxyuser.root.groups</name>
       <value>root</value>
       <description>代理root设置的组用户</description>
   </property>
配置成功之后，需要同步到集群中的各个节点，
要想让集群重新加载配置信息，至少hdfs需要重启