Flink SQL:Queries(Group Aggregation)

原创

已于 2023-05-20 09:59:14 修改 · 411 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#flink #sql #apache

于 2022-10-26 17:39:48 首次发布

本文介绍了Apache Flink中关于SQL的聚合操作，包括GROUP BY、DISTINCT Aggregation、GROUPING SETS（如ROLLUP和CUBE）以及HAVING子句的用法。Flink支持批处理和流处理，对于流查询，状态大小可能无限增长，可以通过设置状态存活时间（TTL）来防止状态过大，但可能影响查询结果的正确性。

Group Aggregation

Batch Streaming

Like most data systems, Apache Flink supports aggregate functions; both built-in and user-defined. User-defined functions must be registered in a catalog before use.
与大多数数据系统一样，Apache Flink支持聚合函数；内置和用户定义。用户定义的函数必须在使用前在目录中注册。

An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the COUNT, SUM, AVG (average), MAX (maximum) and MIN (minimum) over a set of rows.
聚合函数从多个输入行计算单个结果。例如，聚合计算一组行的COUNT、SUM、AVG(平均值)、MAX(最大值)和MIN(最小值)。

SELECT COUNT(*) FROM Orders

For streaming queries, it is important to understand that Flink runs continuous queries that never terminate. Instead, they update their result table according to the updates on its input tables. For the above query, Flink will output an updated count each time a new row is inserted into the Orders table.
对于流式查询，重要的是要了解Flink运行的是永不终止的持续查询。相反，它们根据输入表的更新更新结果表。对于上面的查询，每次在Orders表中插入新行时，Flink都会输出更新的计数。

Apache Flink supports the standard GROUP BY clause for aggregating data.
Apache Flink支持聚合数据的标准GROUP BY子句。

SELECT COUNT(*)
FROM Orders
GROUP BY order_id

For streaming queries, the required state for computing the query result might grow infinitely. State size depends on the number of groups and the number and type of aggregation functions. For example MIN/MAX are heavy on state size while COUNT is cheap. You can provide a query configuration with an appropriate state time-to-live (TTL) to prevent excessive state size. Note that this might affect the correctness of the query result. See query configuration for details.
对于流式查询，计算查询结果所需的状态可能会无限增长。状态大小取决于分组的数量以及聚合函数的数量和类型。例如，MIN/MAX对状态大小影响比较大，而COUNT影响较小。您可以为查询配置提供适当的状态生存时间(TTL) ，以防止状态大小过大。请注意，这可能会影响查询结果的正确性。有关详细信息，请参阅查询配置。

Apache Flink provides a set of performance tuning ways for Group Aggregation, see more Performance Tuning.
Apache Flink为分组聚合提供了一组性能调优方法，请参阅更多性能调优。

DISTINCT Aggregation

Distinct aggregates remove duplicate values before applying an aggregation function. The following example counts the number of distinct order_ids instead of the total number of rows in the Orders table.
Distinct聚合在应用聚合函数之前会删除重复的值。下面的示例统计不同order_id的数量，而不是Orders表中的总行数。

SELECT COUNT(DISTINCT order_id) FROM Orders

For streaming queries, the required state for computing the query result might grow infinitely. State size is mostly depends on the number of distinct rows and the time that a group is maintained, short lived group by windows are not a problem. You can provide a query configuration with an appropriate state time-to-live (TTL) to prevent excessive state size. Note that this might affect the correctness of the query result. See query configuration for details.
对于流式查询，计算查询结果所需的状态可能会无限增长。状态大小主要取决于不同行的数量和维护组的时间，短时间的逐窗口分组不是问题。您可以为查询配置提供适当的状态生存时间(TTL)，以防止状态大小过大。请注意，这可能会影响查询结果的正确性。有关详细信息，请参阅查询配置。