【hive】How to use Elephant Bird with Hive

使用Elephant-Bird与Hive结合的指南

转载 :

https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Overview

Let's quickly remind ourselves how Hive reads records so we understand how Elephant-Bird fits into the picture.

Hive data is organized as follows:

  • Databases contain Tables
  • Tables contain Partitions
  • Partitions have a StorageDescriptor

When reading data from a partition, Hive uses the storage descriptor to determine which InputFormat and SerDe to use. Each partition has its own storage descriptor, so you can change file and/or serialization formats at any time, and your existing partitions are still usable.

Hive creates an instance of the InputFormat for the partition, and uses it to read bytes from disk, splitting the input stream of bytes into individual records-worth of bytes.

Hive creates an instance of the SerDe, and initializes it with properties you provided when creating the table (serialization.class & serialization.format). Each record's worth of bytes from the input format is given to the SerDe, which converts it to the correct object type.

Elephant-Bird typically provides functionality of both input format and serde, reading bytes from disk, turning into a record's worth of bytes, and deserializing the bytes into an object of the correct type. To integrate with Hive in the least-intrusive way, Elephant-Bird was updated to include an input format that returns the raw record bytes instead of internally deserializing the record.

An alternative implementation of Elephant-Bird support for Hive could be done at the SerDe layer, providing a SerDe that expects an already deserialized record. An issue with this approach is the HiveMetaStore does not store properties for input formats – only serde's. Adding metastore support for input format properties is certainly possible, but is a larger change for little benefit. For this reason we chose to add raw bytes support to Elephant-Bird so it "just works" with Hive.

Example create table statement

You can create Hive tables as follows for your data stored by Elephant-Bird, and partitions you add to the tables will be readable.

create external table elephantbird_test
  -- no need to specify a schema - it will be discovered at runtime
  -- (myint int, myString string, underscore_int int)
  partitioned by (dt string)
    -- hive provides a thrift deserializer
    row format serde "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer"
    with serdeproperties (
      -- full name of our thrift struct class
      "serialization.class"="org.apache.hadoop.hive.serde2.thrift.test.IntString",
      -- use the binary protocol
      "serialization.format"="org.apache.thrift.protocol.TBinaryProtocol")
  stored as
    -- elephant-bird provides an input format for use with hive
    inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    -- placeholder as we will not be writing to this table
    outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

Reading Protocol Buffers

Elephant-Bird provides ProtobufDeserializer for reading records stored as serialized protocol buffers. This deserializer expectsBytesWritable records, making it suitable for use with a variety of input formats, such as DeprecatedRawMultiInputFormat, SequenceFile, and other file formats that store binary values.

create table users
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
    "serialization.class"="com.example.proto.gen.Storage$User")
  stored as
    inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat";

Dynamic Schemas

Don't we need to store the write-time schema? Won't things break horribly if we change the schema?

A benefit to storing thrift structs is there are well-known rules for evolving structs - never remove fields, never reuse field IDs, and add new fields as optional. If playing by those rules, bytes stored to disk can be restored to thrift structs based on the current schema, marking optional fields as null, or using default values as the case may be.

Hive Configuration

The following hive-site.xml property must be set for Elephant-Bird to work correctly.

<property>
  <name>hive.input.format</name>
  <value>org.apache.hadoop.hive.ql.io.HiveInputFormat</value>
  <description>Hive uses a wrapping input format so each partition can be handled differently.
    The default CombineHiveInputFormat does not actually ask input formats for their splits,
    instead looks at the files themselves and attempts to optimize based on file size and
    location. We use HiveInputFormat to disable this optimization as we want our loaders
    to report their splits.
  </description>
</property>
标题SpringBoot智能在线预约挂号系统研究AI更换标题第1章引言介绍智能在线预约挂号系统的研究背景、意义、国内外研究现状及论文创新点。1.1研究背景与意义阐述智能在线预约挂号系统对提升医疗服务效率的重要性。1.2国内外研究现状分析国内外智能在线预约挂号系统的研究与应用情况。1.3研究方法及创新点概述本文采用的技术路线、研究方法及主要创新点。第2章相关理论总结智能在线预约挂号系统相关理论,包括系统架构、开发技术等。2.1系统架构设计理论介绍系统架构设计的基本原则和常用方法。2.2SpringBoot开发框架理论阐述SpringBoot框架的特点、优势及其在系统开发中的应用。2.3数据库设计与管理理论介绍数据库设计原则、数据模型及数据库管理系统。2.4网络安全与数据保护理论讨论网络安全威胁、数据保护技术及其在系统中的应用。第3章SpringBoot智能在线预约挂号系统设计详细介绍系统的设计方案,包括功能模块划分、数据库设计等。3.1系统功能模块设计划分系统功能模块,如用户管理、挂号管理、医生排班等。3.2数据库设计与实现设计数据库表结构,确定字段类型、主键及外键关系。3.3用户界面设计设计用户友好的界面,提升用户体验。3.4系统安全设计阐述系统安全策略,包括用户认证、数据加密等。第4章系统实现与测试介绍系统的实现过程,包括编码、测试及优化等。4.1系统编码实现采用SpringBoot框架进行系统编码实现。4.2系统测试方法介绍系统测试的方法、步骤及测试用例设计。4.3系统性能测试与分析对系统进行性能测试,分析测试结果并提出优化建议。4.4系统优化与改进根据测试结果对系统进行优化和改进,提升系统性能。第5章研究结果呈现系统实现后的效果,包括功能实现、性能提升等。5.1系统功能实现效果展示系统各功能模块的实现效果,如挂号成功界面等。5.2系统性能提升效果对比优化前后的系统性能
在金融行业中,对信用风险的判断是核心环节之一,其结果对机构的信贷政策和风险控制策略有直接影响。本文将围绕如何借助机器学习方法,尤其是Sklearn工具包,建立用于判断信用状况的预测系统。文中将涵盖逻辑回归、支持向量机等常见方法,并通过实际操作流程进行说明。 一、机器学习基本概念 机器学习属于人工智能的子领域,其基本理念是通过数据自动学习规律,而非依赖人工设定规则。在信贷分析中,该技术可用于挖掘历史数据中的潜在规律,进而对未来的信用表现进行预测。 二、Sklearn工具包概述 Sklearn(Scikit-learn)是Python语言中广泛使用的机器学习模块,提供多种数据处理和建模功能。它简化了数据清洗、特征提取、模型构建、验证与优化等流程,是数据科学项目中的常用工具。 三、逻辑回归模型 逻辑回归是一种常用于分类任务的线性模型,特别适用于二类问题。在信用评估中,该模型可用于判断借款人是否可能违约。其通过逻辑函数将输出映射为01之间的概率值,从而表示违约的可能性。 四、支持向量机模型 支持向量机是一种用于监督学习的算法,适用于数据维度高、样本量小的情况。在信用分析中,该方法能够通过寻找最佳分割面,区分违约与非违约客户。通过选用不同核函数,可应对复杂的非线性关系,提升预测精度。 五、数据预处理步骤 在建模前,需对原始数据进行清理与转换,包括处理缺失值、识别异常点、标准化数值、筛选有效特征等。对于信用评分,常见的输入变量包括收入水平、负债比例、信用历史记录、职业稳定性等。预处理有助于减少噪声干扰,增强模型的适应性。 六、模型构建与验证 借助Sklearn,可以将数据集划分为训练集和测试集,并通过交叉验证调整参数以提升模型性能。常用评估指标包括准确率、召回率、F1值以及AUC-ROC曲线。在处理不平衡数据时,更应关注模型的召回率与特异性。 七、集成学习方法 为提升模型预测能力,可采用集成策略,如结合多个模型的预测结果。这有助于降低单一模型的偏差与方差,增强整体预测的稳定性与准确性。 综上,基于机器学习的信用评估系统可通过Sklearn中的多种算法,结合合理的数据处理与模型优化,实现对借款人信用状况的精准判断。在实际应用中,需持续调整模型以适应市场变化,保障预测结果的长期有效性。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
To optimize queries in Hive, you can follow these best practices: 1. Use partitioning: Partitioning is a technique of dividing a large table into smaller, more manageable parts based on specific criteria such as date, region, or category. It can significantly improve query performance by reducing the amount of data that needs to be scanned. 2. Use bucketing: Bucketing is another technique of dividing a large table into smaller, more manageable parts based on the hash value of a column. It can improve query performance by reducing the number of files that need to be read. 3. Use appropriate file formats: Choose the appropriate file format based on the type of data and the query patterns. For example, ORC and Parquet formats are optimized for analytical queries, while Text and SequenceFile formats are suitable for batch processing. 4. Optimize data storage: Optimize the way data is stored on HDFS to improve query performance. For example, use compression to reduce the amount of data that needs to be transferred across the network. To create a partition table with Hive, you can follow these steps: 1. Create a database (if it doesn&#39;t exist) using the CREATE DATABASE statement. 2. Create a table using the CREATE TABLE statement, specifying the partition columns using the PARTITIONED BY clause. 3. Load data into the table using the LOAD DATA statement, specifying the partition values using the PARTITION clause. Here&#39;s an example: ``` CREATE DATABASE my_db; USE my_db; CREATE TABLE my_table ( id INT, name STRING ) PARTITIONED BY (date STRING); LOAD DATA LOCAL INPATH &#39;/path/to/data&#39; OVERWRITE INTO TABLE my_table PARTITION (date=&#39;2022-01-01&#39;); ``` This creates a table called `my_table` with two columns `id` and `name`, and one partition column `date`. The data is loaded into the table with the partition value `2022-01-01`.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值