Classifying the SQL-on-Hadoop Solutions

本文探讨了SQL-on-Hadoop领域的六种主要解决方案,并对比它们的优缺点,指出市场正趋向于将SQL技术直接引入Hadoop集群,这将显著推动Hadoop在数据处理和分析领域的主导地位。

Classifying the SQL-on-Hadoop Solutions

Posted on October 2, 2013 at 10:00 am.

Almost a year and a half ago on this blog, I went on something that is probably best described as an anti-DBMS/Hadoop-connector rant. There was then (as there still is now) an incredible amount of use cases that require the combination of DBMS and Hadoop technologies, and at the time, both the Hadoop vendors and the DBMS vendors were pushing a “connector” approach, where the customer buys both a Hadoop product and a DBMS product and data can be passed back and forth between these two systems. I explained the architectural wastefulness that is associated with this approach, and why, given the way that parallel database systems and Hadoop are designed, it is relatively easy to combine them (architecturally speaking) into a single system. At the time there were only two solutions that took the combined system approach: Hive and Hadapt.

Since that post was written, it is good to see that several vendors have abandoned the connector approach and have instead launched initiatives (such as Stinger, Impala, and Drill) that, while still immature, are following (or extending) Hive and Hadapt, and going in the direction of bringing SQL technologies directly to Hadoop clusters. In my opinion, this is absolutely the right direction for the market, and will result in the furthering of Hadoop’s dominance in the data processing and analysis space.

Given the rapid entrance of these new “SQL-on-Hadoop” initiatives, now is a good time to classify them and study the similarities and differences between these approaches.

Before comparing and contrasting six approaches to SQL-on-Hadoop (Hive, Hadapt, Stinger, Impala, Polybase, and Drill), I should explain why these are the only approaches that are being compared in this post: since the DBMS/Hadoop connector approach is so fundamentally flawed from an architectural perspective, vendors that use this approach remain in a different category and are not directly competitive with the direct approaches to SQL-on-Hadoop. Even recent attempts from Greenplum and Aster Data to retrofit their MPP database to work on Hadoop clusters through the HAWQ and SQL-H projects respectively still fundamentally use the connector approach: at query time, data is extracted out of HDFS and sent over the network into their MPP execution engines for further processing. Even if the MPP execution engine sits on the same physical cluster as HDFS, if processing is not pushed down to the same nodes that store the data, the MPP database is essentially treating HDFS as a large (cheap) shared-disk storage system, and comes with the scalability constraints and network bottlenecks that are associated with this approach. Shared-disk architectures are fundamentally antithetical to the Google-made-famous “shared-nothing” design that Hadoop emulates, where processing is pushed as close to the data as possible. This is why these MPP+Hadoop vendors typically bundle hardware with software, so that high-end and expensive networking gear can be integrated into the cluster, in order to hide the fundamental limitations of the shared-disk architecture.

Therefore, we are left with the above-mentioned six technologies to compare. (It’s possible that there are additional SQL-on-Hadoop solutions that I’m not aware of – if so, please add them via the comment thread below). They are best divided into three categories, with two technologies placed inside each category:

(1)   SQL translated to MapReduce jobs over a Hadoop cluster. Both Hive and Stinger (without Tez) fall into this category. A SQL query that is sent to a Hadoop cluster is translated into a series of MapReduce jobs which are then processed by the cluster. A major advantage of this approach is that by integrating with Hadoop’s version of MapReduce, queries are run with Hadoop’s dynamic scheduler and are therefore highly tolerant of unexpected performance issues and other forms of heterogeneous performance across the cluster. Furthermore, they leverage MapReduce’s mid-query fault tolerance so that nodes that fail in the middle of query processing do not cause the entire query to fail. Combined, these two properties lead to consistent and reliable execution of queries across clusters containing thousands of nodes. Disadvantages include: (a) in order to facilitate the transaction of SQL to MapReduce jobs, the dialect of SQL that are spoken by these systems is not quite standard SQL, which complicates integration with third party tools; (b) due to the need to automatically generate MapReduce jobs for any type of SQL clause, the amount of SQL coverage is coming along slowly; and (c) due to processing exclusively using the MapReduce framework (Stinger with Tez falls in a different category), the per-query MapReduce overhead prevents the ability of these technologies to process queries interactively (this category is fundamentally a “batch processing” category).

(2)   SQL processed by a specialized (Google-inspired) SQL engine that sits on a Hadoop cluster. Both Impala and Drill fall into this category. Impala is inspired by Google’s F1 project and Drill by Google’s Dremel project. Both push down SQL (or, in the case of Drill/Dremel, SQL-like) operators down to where it is stored in the distributed file system (HDFS) and therefore have the advantage of collocating data with data processing. However, since both systems are building the SQL query execution engine from scratch, both suffer from the same (a) and (b) disadvantages of category (1) – non-standard SQL and poor SQL coverage. Furthermore, by completely eschewing MapReduce, they do not get the associated fault tolerance and dynamic scheduling (and therefore scalability) benefits that are inherent in MapReduce.

(3)   Processing of SQL queries are split between MapReduce and storage that natively speaks SQL. Both Hadapt and Polybase fall into this category. These systems attempt to get the best of both worlds, doing some processing in MapReduce and some processing in native SQL operators. When a SQL query is submitted to the Hadoop cluster, an optimizer analyzes the query, and decides what parts should be performed via MapReduce, and what parts via SQL operators. For queries that require interactive (sub-second) time, MapReduce is typically avoided, and the entire query is performed via native SQL. But for queries that require massive scale and mid-query fault tolerance, more work is left for the MapReduce engine.

Although each of these “SQL-on-Hadoop” categories has different advantages and disadvantages, as a group, they significantly bring Hadoop forward from where it was a year ago, and greatly expand the use cases for which Hadoop technology can be used. As vendors continue to abandon the DBMS-connector approach, customers win through cleaner architectures, fewer data silos, and simplified systems administration.

from http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/

### 基于类原型对比学习在多标签和细粒度教育视频分类中的应用 #### 类原型对比学习的核心概念 类原型对比学习是一种通过构建类别级别的代表性向量(即类原型),并利用这些原型之间的关系来进行特征学习的方法。这种方法能够有效捕捉类间差异以及类内一致性,从而提升模型的泛化能力[^1]。 具体而言,在多标签场景下,每个类别的原型可以通过该类别下的所有样本嵌入向量计算得到。通常采用均值池化的方式生成类原型 \( C_k \),其中 \( k \) 表示第 \( k \) 个类别: \[ C_k = \frac{1}{N_k} \sum_{i=1}^{N_k} f(x_i), \] 这里 \( N_k \) 是属于类别 \( k \) 的样本数量,\( f(x_i) \) 则是输入样本 \( x_i \) 经过编码器提取后的特征向量[^2]。 #### 对比损失函数的设计 为了实现更有效的特征表示学习,对比学习框架引入了一种特殊的损失函数——InfoNCE Loss (Information Noise Contrastive Estimation)。这种损失函数旨在最大化正样本对之间的相似性,同时最小化负样本对之间的相似性。对于给定查询样本 \( q \),其对应的正样本集合记作 \( P(q) \),而负样本集合则为 \( N(q) \),那么 InfoNCE Loss 可定义如下: \[ L_{contrastive}(q) = -\log \left( \frac{\exp(\text{sim}(q, p)/\tau)}{\sum_{n \in N(q)} \exp(\text{sim}(q,n)/\tau)+\sum_{p' \in P(q)} \exp(\text{sim}(q,p')/\tau)} \right). \] 这里的 \( \text{sim}() \) 函数通常是余弦相似度或者欧氏距离,参数 \( \tau \) 称为温度超参,控制分布的锐利程度。 #### 多标签与细粒度教育视频分类的应用挑战 当应用于多标签和细粒度教育视频分类时,主要面临以下几个方面的挑战: - **标签不平衡**:某些细粒度类别可能拥有远少于其他类别的标注数据,这会使得训练过程中难以形成可靠的类原型。 - **语义重叠**:不同类别之间可能存在较高的语义关联性,增加了区分难度。 - **时间依赖特性**:相比于静态图片,动态视频还包含了帧间的时间序列信息,这对建模提出了更高要求。 针对上述问题,可以考虑以下改进措施: 1. 引入自适应权重机制调整各类别的重要性; 2. 设计专门的模块捕获跨帧间的长期依赖关系,比如使用 LSTM 或 Transformer 结构; 3. 融合外部知识源辅助优化决策边界。 #### 实验验证与效果分析 实验表明,在多个公开基准数据集上,基于类原型对比学习的方法显著优于传统监督方法以及其他无监督预训练方案。特别是在低资源环境下,由于充分利用了有限样例内部的信息结构,性能优势更加明显。 ```python import torch.nn.functional as F def info_nce_loss(query_embeddings, positive_embeddings, negative_embeddings, temperature=0.5): """ Compute the contrastive loss using InfoNCE formulation. Args: query_embeddings (Tensor): Query embeddings of shape [batch_size, embedding_dim]. positive_embeddings (Tensor): Positive sample embeddings of same shape. negative_embeddings (Tensor): Negative samples with shape [num_negatives * batch_size, embedding_dim]. temperature (float): Temperature parameter controlling sharpness. Returns: Tensor: Scalar value representing computed loss. """ # Normalize all vectors to unit length queries_norm = F.normalize(query_embeddings, dim=-1) positives_norm = F.normalize(positive_embeddings, dim=-1) negatives_norm = F.normalize(negative_embeddings, dim=-1) logits_pos = torch.sum(queries_norm * positives_norm, dim=-1).unsqueeze(-1) / temperature logits_neg = torch.matmul(queries_norm.unsqueeze(1), negatives_norm.T.permute(0, 2, 1)) / temperature full_logits = torch.cat([logits_pos, logits_neg], dim=-1) labels = torch.zeros(full_logits.shape[:2]).to(logits_pos.device).long() return F.cross_entropy(full_logits.view(-1, full_logits.size(-1)), labels.view(-1)) ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值