大数据之hive（二） --- hiveserver2，JDBC连接操作hive，表增删改查，常用聚集查询，beeline客户端，分区表，桶表，连接查询

最新推荐文章于 2024-07-08 16:13:42 发布
原创最新推荐文章于 2024-07-08 16:13:42 发布 · 1.4k 阅读
3 ·
CC 4.0 BY-SA版权
大数据同时被 2 个专栏收录
141 篇文章
订阅专栏
Hive
5 篇文章
订阅专栏
本文详细介绍Hive服务启动、JDBC连接、表操作、查询优化、客户端连接、分区表及桶表应用等内容，帮助读者掌握Hive数据库的使用技巧。
一、启动hiveserver2服务：专门针对远程套接字连接的服务
----------------------------------------------------------
    1.$> hive --service hiveserver2 start &     //后台运行服务
    2.$> netstat -ano | grep 10000              //如果有10000端口处于监听状态，说明服务已经启动


二、使用JDBC连接操作hive
-----------------------------------------------
    1.创建maven项目

    2.添加依赖
         <dependencies>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-jdbc</artifactId>
                <version>2.1.0</version>
            </dependency>
         </dependencies>

    3.编写app:注意这里的驱动和服务器协议
         Class.forName("org.apache.hive.jdbc.HiveDriver");
         //建立连接
         Connection conn = DriverManager.getConnection("jdbc:hive2://192.168.43.131:10000/mydata","","");


三、操作表：增删改查
---------------------------------------------------
    1.增/新建
        //Insert
        ppst = conn.prepareStatement("insert into mytable (id,name,age) values (?,?,?)");
        for (int i = 0; i < 5; i++) {
            ppst.setInt(1, i);
            ppst.setString(2,"tom" + i);
            ppst.setInt(3,i + 12);
            ppst.executeUpdate();
        }
        //Create
        ppst = conn.prepareStatement("create table if not exists mytable (id int , name String , age int) ");
        ppst.executeUpdate();

    2.删：只支持删除表，drop table ; 不支持update 和 delete 删除记录
        //Drop
        ppst = conn.prepareStatement("drop table if exists mytable");
        ppst.executeUpdate();

    3.改：不支持update修改记录
        //Alter
        ppst = conn.prepareStatement("alter table mytable1 rename to mytable");  //重命名
        ppst.executeUpdate();

    4.查
        PreparedStatement ppst = conn.prepareStatement("select * from mytable");
        ResultSet set = ppst.executeQuery();
        while (set.next()) {
            System.out.print(set.getInt(1));
            System.out.print(set.getString(2));
            System.out.print(set.getString(3));
            System.out.print(set.getString(4));
            System.out.println();
        }


四、常用聚集查询
-------------------------------------------------------
    1.count()

    2.sum()

    3.avg()

    4.max()

    5.min()

     /**
     * 聚集查询
     */
    @Test
    public void tsAggregateFunctions() throws Exception {

        //Cuont
        ppst = conn.prepareStatement("select count (*) from mytable");

        //max
        ppst = conn.prepareStatement("select max(age) from mytable");

        //avg
        ppst = conn.prepareStatement("select avg(age) from mytable");

        ResultSet set= ppst.executeQuery();
        while (set.next()) {
            System.out.println(set.getString(1));
        }
        ppst.close();
        conn.close();
    }



五、beeline客户端远程连接服务器
-----------------------------------------------------------
    1.hive命令行虽然也可以操作数据库，但是只能在安装了hive的本机上操作，不能远程连接

    2.可以使用jdbc远程连接数据库。第二种方式就是使用beeline客户端。相当于windows的eclipse客户端

    3.beeline也是连接hiveserver2，从而与数据库进行交互

    4.连接方式
        a.$> hive --serviece beeline -u jdbc:hive2://s100:10000/mydata      //连接指定hive数据库

        b.$> beeline                //进入beeline客户端
          $beeline>  !connect  jdbc:hive2://s100:10000/mydata  //连接指定数据库，不需要输入账户和密码，直接回车就行


六、分区表:hive的优化的手段之一
--------------------------------------------------------
    1.分区本质也是目录，只不过在目录的层面上细化，来缩小数据的搜索范围

    2.创建分区表
        $hive> CREATE TABLE p1(id int , name string, age int ) PARTITIONED BY (Year INT, Month INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    3.创建分区的目录
        $hive> ALTER TABLE p1 ADD PARTITION (year=2014, month=11) PARTITION (year=2014, month=12);

    4.显示分区目录
        $hive> SHOW PARTITIONS employee_partitioned;

    5.创建好的目录结构
        wxr-xr-x     - ubuntu supergroup          0 2018-09-09 15:43 /user/hive/warehouse/mydata.db/p1
        drwxr-xr-x   - ubuntu supergroup          0 2018-09-09 15:43 /user/hive/warehouse/mydata.db/p1/year=2014
        drwxr-xr-x   - ubuntu supergroup          0 2018-09-09 15:43 /user/hive/warehouse/mydata.db/p1/year=2014/month=11
        drwxr-xr-x   - ubuntu supergroup          0 2018-09-09 15:43 /user/hive/warehouse/mydata.db/p1/year=2014/month=12

    6.删除分区,分区下的文件也会删除
        $hive> ALTER TABLE p1 DROP IF EXISTS PARTITION (year=2014, month=11);

    7.上传数据
        $hive> LOAD DATA LOCAL INPATH '/home/downloads/1.txt' OVERWRITE INTO TABLE p1 PARTITION (year=2014, month=11);

    8.查询分区表
        $hive> SELECT name, year, month FROM p1;
        $hive> select * from p1 where year = 2014 and month = 11;   //直接查询指定目录下的数据



七、桶表
------------------------------------------------------------
    1.如果想以id进行归类，但是如果分区的话，一个id一个区，这样目录就会很多，很零碎。这时候，就可以考虑桶表。
    桶表可以以某一列的值，例如id，进行hash算法，通过用户指定的桶数，进行分桶，将同样hash规律的数据存放于一个桶（目录）中，
    便于管理和查询。注意，分区是将数据放于不同的目录下，便于查找。分桶则是将一个文件的数据打散，拆分成若干个小数据片段进行分桶存储。

    2.创建桶表 -- 以id进行分桶，分成2个桶存放数据
        $hive> CREATE TABLE p2(id int , name string,  age int ) CLUSTERED BY (id) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    3.上传数据，因为桶表规避了基本的上传操作（load data loacal ...，仅仅将文件上传，但是不会分桶），如果想分桶，需要使用MR作业上传，也可以使用insert into
        a.$hive> LOAD DATA LOCAL INPATH '/home/ubuntu/downloads/1.txt' OVERWRITE INTO TABLE p2;   //数据不会分桶

        b.正确分桶步骤：需要查询其他表，复制到桶表
            1)设定MR任务数
            $hive> set map.reduce.tasks = 2;

            2)强制分桶
            $hive> set hive.enforce.bucketing = true;

            3)复制其他表数据到桶表：注意数据列类型，列数量要保持一致
            $hive> INSERT OVERWRITE TABLE p2 SELECT * FROM p1;

    4.查看数据结构：会将数据存放到桶个数的文件中
        drwxr-xr-x   - ubuntu supergroup          0 2018-09-09 16:50 /user/hive/warehouse/mydata.db/p2
        -rwxr-xr-x   3 ubuntu supergroup         30 2018-09-09 16:50 /user/hive/warehouse/mydata.db/p2/000000_0
        -rwxr-xr-x   3 ubuntu supergroup         40 2018-09-09 16:50 /user/hive/warehouse/mydata.db/p2/000001_0

    5.如何设定桶的数量
        评估数据量 -- 尽量保证每个桶的数据量是blocksize的2倍


八、连接查询
--------------------------------------------------------------
    1.准备表
        $hive> CREATE TABLE customer(id int , name string,  age int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
        $hive> CREATE TABLE orders(id int , orderno string,  price float, cid int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    2.加载数据到表
        $hive> LOAD DATA LOCAL INPATH '/mnt/hgfs/share/customer.txt' OVERWRITE INTO TABLE customer;
        $hive> LOAD DATA LOCAL INPATH '/mnt/hgfs/share/orders.txt' OVERWRITE INTO TABLE orders;


    3.连接查询
        select a.* ,b.* from customer a,orders b where a.id = b.cid;        //内连接
        select a.* ,b.* from customer a left outer join orders b where a.id = b.cid;        //左外连接
        select a.* ,b.* from customer a right outer join orders b where a.id = b.cid;        //右外连接
        select a.* ,b.* from customer a full outer join orders b where a.id = b.cid;        //全外连接（Hive支持全外，MySql不支持）

        select id,name from customer union select id,odrerno from orders;           //union查询