ClickHouse 25.5 版本发布说明

图片

本文字数:11882;估计阅读时间:30 分钟

作者:ClickHouse Team

本文在公众号【ClickHouseInc】首发

图片

又到了每月一次的版本更新时间!

发布概要

ClickHouse 25.5 带来了 15 项新功能 🌸 23 项性能优化 🦋 以及 64 个缺陷修复 🐝

本次发布的亮点包括:向量相似度索引进入 Beta 阶段、Iceberg 表支持 Hive metastore catalog、支持查看每个函数何时被添加到 ClickHouse、本地工具 clickhouse-local 引入隐式表支持等诸多功能更新。

新贡献者

我们热烈欢迎所有在 25.5 版本中首次做出贡献的开发者!ClickHouse 社区正在持续壮大,正是因为大家的努力,ClickHouse 才能不断发展壮大。

以下是本次的新贡献者名单:

Andrian Iliev, Colerar, Dasha Wessely, Denis, KovalevDima, Kyamran, Marta Paes, Mojtaba Ghahari, Sachin Singh, YjyJeff, andrei tinikov, caicre, codeworse, gvoelfin, morsapaes, samay-sharma, shanfengp, tdufour

Iceberg 表支持 Hive metastore catalog

贡献者:scanhex12

ClickHouse 正在不断拓展对 Lakehouse 架构的支持。此前的版本已支持 Unity 和 AWS Glue 数据目录,本次新增了对 Hive metastore catalog 的支持。

用户可以通过 DataLakeCatalog 表引擎创建一个表来连接 Hive catalog:

SET allow_experimental_database_hms_catalog=true;
CREATE DATABASE hive
ENGINE = DataLakeCatalog('thrift://hive:9083', '', '')
SETTINGS
  catalog_type = 'hive',
  warehouse = 'demo',
  storage_endpoint = 'http://minio:9000/warehouse';

随后,便可以像操作普通表一样查询该 catalog 中的表:

SELECT count()
FROM default.`iceberg.hits`;

clickhouse-local 中的隐式表

贡献者:Alexey Milovidov

在使用 clickhouse-local 工具处理标准输入文件时,现在可以省略 FROM 和 SELECT 子句。

举个例子,假设我们想要解析 GitHub 上 football.json 项目中的一个 JSON 文件,可以使用如下查询来探索文件中的属性结构:

curl -sS 'https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json' | 
./clickhouse -q "JSONAllPathsWithTypes(json)" --input-format JSONAsObject
{'matches':'Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))','name':'String'}

从查询结果中,我们可以看到两个属性:name 和 matches。接下来,我们可以进一步探索 matches 属性的内容:

curl -sS https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json | 
./clickhouse -q "arrayJoin(json.matches::Array(JSON)) LIMIT 5" --input-format JSONAsObject
{"date":"2024-08-16","round":"Matchday 1","score":{"ft":["1","0"],"ht":["0","0"]},"team1":"Manchester United FC","team2":"Fulham FC","time":"20:00"}
{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["0","2"],"ht":["0","0"]},"team1":"Ipswich Town FC","team2":"Liverpool FC","time":"12:30"}
{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["2","0"],"ht":["1","0"]},"team1":"Arsenal FC","team2":"Wolverhampton Wanderers FC","time":"15:00"}
{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["0","3"],"ht":["0","1"]},"team1":"Everton FC","team2":"Brighton & Hove Albion FC","time":"15:00"}
{"date":"2024-08-17","round":"Matchday 1","score":{"ft":["1","0"],"ht":["1","0"]},"team1":"Newcastle United FC","team2":"Southampton FC","time":"15:00"}

如果我们希望对 matches 数组执行更深入的查询操作,就需要将查询语句改写为包含 FROM 子句的形式,如下所示:

curl -sS https://raw.githubusercontent.com/openfootball/football.json/refs/heads/master/2024-25/en.1.json |
./clickhouse -q \
"m.date, m.team1, m.team2, m.score.ft::Array(String)[1] || '-' || m.score.ft::Array(String)[2] AS score
FROM table
ARRAY JOIN json.matches::Array(JSON) AS m
LIMIT 5" \
--input-format JSONAsObject \
--output-format Pretty \
--output_format_pretty_row_numbers=0

此外,我们还支持通过指定输出格式,在查询结果中包含列名:

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ m.date     ┃ m.team1              ┃ m.team2                    ┃ score ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ 2024-08-16 │ Manchester United FC │ Fulham FC                  │ 1-0   │
├────────────┼──────────────────────┼────────────────────────────┼───────┤
│ 2024-08-17 │ Ipswich Town FC      │ Liverpool FC               │ 0-2   │
├────────────┼──────────────────────┼────────────────────────────┼───────┤
│ 2024-08-17 │ Arsenal FC           │ Wolverhampton Wanderers FC │ 2-0   │
├────────────┼──────────────────────┼────────────────────────────┼───────┤
│ 2024-08-17 │ Everton FC           │ Brighton & Hove Albion FC  │ 0-3   │
├────────────┼──────────────────────┼────────────────────────────┼───────┤
│ 2024-08-17 │ Newcastle United FC  │ Southampton FC             │ 1-0   │
└────────────┴──────────────────────┴────────────────────────────┴───────┘

introduced_in 字段用于追踪函数的新增版本

贡献者:Robert Schulzesy

stem.functions 系统表新增了一个字段 introduced_in,用于记录每个函数首次被添加到 ClickHouse 的版本号。目前该字段已经覆盖了一部分新添加的函数,未来也会陆续补充旧函数的信息。

下面是 25.5 版本中新增的函数列表:

SELECT name, introduced_in, returned_value
FROM system.functions
WHERE introduced_in = '25.5'
FORMAT PrettySpace;
 name                   description

 icebergBucket          Implements logic of iceberg bucket transform: https://iceberg.apache.org/spec/#bucket-transform-details.
 icebergHash            Implements logic of iceberg hashing transform: https://iceberg.apache.org/spec/#appendix-b-32-bit-hash-requirements.
 mapContainsValue       Checks whether the map has the specified value.
 mapContainsValueLike   Checks whether map contains value LIKE specified pattern.
 mapContainsKey         Checks whether the map has the specified key.
 mapExtractValueLike    Returns a map with elements which value matches the specified pattern.
 stringBytesEntropy     Calculates Shannon's entropy of byte distribution in a string.
 stringBytesUniq        Counts the number of distinct bytes in a string.

Parquet 格式支持 Geo 类型

贡献者:scanhex12

ClickHouse 的 Parquet 读取器现已支持 Geo 类型。我们可以通过以下查询语句,对 Overture Maps 数据集中的几何字段进行解析:

SELECT id, geometry, bbox, 
       COLUMNS('^id|geometry|bbox') APPLY(toTypeName)
FROM s3('s3://overturemaps-us-west-2/release/2025-04-23.0/theme=places/type=place/*')
WHERE categories.primary = 'pizza_restaurant'
LIMIT 1
SETTINGS schema_inference_make_columns_nullable = 0;
Row 1:
──────
id:                   08f95008132998a6036bed7d0b7e7661
geometry:             (119.4407421,-5.15814)
bbox:                 (119.44073,119.44074,-5.15814,-5.158139)
toTypeName(id):       String
toTypeName(geometry): Tuple(Float64, Float64)
toTypeName(bbox):     Tuple(
    xmin Float32,
    xmax Float32,
    ymin Float32,
    ymax Float32)

现在,geometry 字段可以被正确识别并解析。而在旧版本中,该字段通常会以二进制的形式显示,如下所示:

Row 1:
──────
id:                   08ff2a1b744f064403498c7aa7fce7a1
geometry:             (-149.069345,-83.5991503)
bbox:                 (-149.06935,-149.06932,-83.59915,-83.599144)
toTypeName(id):       String
toTypeName(geometry): Point
toTypeName(bbox):     Tuple(
    xmin Float32,
    xmax Float32,
    ymin Float32,
    ymax Float32)

有了这个改进,ClickHouse 已可用于查询 GeoParquet 数据集。我们也一直在寻找有趣的地理数据集,欢迎大家推荐值得尝试的数据源!

向量搜索进入 Beta 阶段

贡献者:Shankar Iyer

ClickHouse 中的向量相似度索引功能现已从实验阶段升级为 Beta 阶段,并新增了对混合搜索中前置过滤和后置过滤策略的支持。

我们以下面这段示例,展示如何基于 Cohere 的 Wikipedia 数据子集使用向量索引。首先,我们启用相似度索引功能:

SET allow_experimental_vector_similarity_index=1;

接着,我们创建一个包含 emb 向量字段的表,并基于该字段建立 HNSW 索引:

CREATE TABLE wikiEmbeddings (
  id Int32,
  title String,
  text String,
  url String,
  wiki_id Int32,
  views Float32,
  paragraph_id Int32,
  langs Int32,
  emb Array(Float32),
  INDEX emb_hnsw emb TYPE vector_similarity('hnsw', 'L2Distance', 768)
)
ORDER BY id;

然后从 Hugging Face 导入相应数据:

INSERT INTO wikiEmbeddings
select *
FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings/resolve/main/data/train-00000-of-00253-8d3dffb4e6ef0304.parquet')
SETTINGS max_http_get_redirects=10;

现在就可以开始查询了。以下查询将返回数据集中最相似的一条记录:

SELECT * EXCEPT emb
FROM wikiEmbeddings
WHERE id = 120356;
Row 1:
──────
id:           120356
title:        Germany national football team
text:         The Germany national football team () represents Germany in men's international football and played its first match in 1908. The team is governed by the German Football Association ("Deutscher Fußball-Bund"), founded in 1900. Between 1949 and 1990, separate German national teams were recognised by FIFA due to Allied occupation and division: the DFB's team representing the Federal Republic of Germany (commonly referred to as West Germany in English between 1949 and 1990), the Saarland team representing the Saar Protectorate (1950–1956) and the East Germany team representing the German Democratic Republic (1952–1990). The latter two were absorbed along with their records; the present team represents the reunified Federal Republic. The official name and code "Germany FR (FRG)" was shortened to "Germany (GER)" following reunification in 1990.
url:          https://en.wikipedia.org/wiki?curid=250204
wiki_id:      250204
views:        3271.844
paragraph_id: 0
langs:        98

结果显示,这条记录来自 Wikipedia 上关于德国国家足球队的页面。我们可以进一步查询与该记录最相似的其他条目:

WITH (SELECT emb FROM wikiEmbeddings WHERE id = 120356) AS lookup
SELECT id, title, text, url, L2Distance(emb, lookup) AS dist
FROM wikiEmbeddings
ORDER BY dist
LIMIT 3
FORMAT Vertical;
Row 1:
──────
id:    120356
title: Germany national football team
text:  The Germany national football team () represents Germany in men's international football and played its first match in 1908. The team is governed by the German Football Association ("Deutscher Fußball-Bund"), founded in 1900. Between 1949 and 1990, separate German national teams were recognised by FIFA due to Allied occupation and division: the DFB's team representing the Federal Republic of Germany (commonly referred to as West Germany in English between 1949 and 1990), the Saarland team representing the Saar Protectorate (1950–1956) and the East Germany team representing the German Democratic Republic (1952–1990). The latter two were absorbed along with their records; the present team represents the reunified Federal Republic. The official name and code "Germany FR (FRG)" was shortened to "Germany (GER)" following reunification in 1990.
url:   https://en.wikipedia.org/wiki?curid=250204
dist:  0

Row 2:
──────
id:    120367
title: Germany national football team
text:  The Federal Republic of Germany, which was referred to as West Germany, continued the DFB. With recognition by FIFA and UEFA, the DFB maintained and continued the record of the pre-war team. Switzerland was the first team that played West Germany in 1950, with the latter qualifying for the 1954 World Cup and the former hosting it.
url:   https://en.wikipedia.org/wiki?curid=250204
dist:  5.537961

Row 3:
──────
id:    120366
title: Germany national football team
text:  After World War II, Germany was banned from competition in most sports until 1950. The DFB was not a full member of FIFA, and none of the three new German states – West Germany, East Germany, and Saarland – entered the 1950 World Cup qualifiers.
url:   https://en.wikipedia.org/wiki?curid=250204
dist:  6.296622

不过,这些结果都来自同一篇 Wikipedia 文章的不同段落,因此参考意义有限。为了让结果更加丰富多样,我们可以添加 WHERE 子句,排除与查询记录具有相同 wiki_id 的条目:

WITH (SELECT emb FROM wikiEmbeddings WHERE id = 120356) AS lookup,
     (SELECT wiki_id FROM wikiEmbeddings WHERE id = 120356) AS lookup_wiki_id
SELECT id, title, text, url, L2Distance(emb, lookup) AS dist
FROM wikiEmbeddings
WHERE wiki_id <> lookup_wiki_id
ORDER BY dist
LIMIT 3
FORMAT Vertical;
Row 1:
──────
id:    6420
title: Germany
text:  Football is the most popular sport in Germany. With more than 7 million official members, the German Football Association ("Deutscher Fußball-Bund") is the largest single-sport organisation worldwide, and the German top league, the Bundesliga, attracts the second-highest average attendance of all professional sports leagues in the world. The German men's national football team won the FIFA World Cup in 1954, 1974, 1990, and 2014, the UEFA European Championship in 1972, 1980 and 1996, and the FIFA Confederations Cup in 2017.
url:   https://en.wikipedia.org/wiki?curid=11867
dist:  6.9714794

Row 2:
──────
id:    6336
title: Germany
text:  Formal unification of Germany into the modern nation-state was commenced on 18 August 1866 with the North German Confederation Treaty establishing the Prussia-led North German Confederation later transformed in 1871 into the German Empire. After World War I and the German Revolution of 1918–1919, the Empire was in turn transformed into the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a totalitarian dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany as a whole was organized into two separate polities with limited sovereignity: the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany, while Berlin continued its Four Power status. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic.
url:   https://en.wikipedia.org/wiki?curid=11867
dist:  6.98255

Row 3:
──────
id:    7767
title: FIFA World Cup
text:  After FIFA was founded in 1904, it tried to arrange an international football tournament between nations outside the Olympic framework in Switzerland in 1906. These were very early days for international football, and the official history of FIFA describes the competition as having been a failure.
url:   https://en.wikipedia.org/wiki?curid=11370
dist:  7.5603456
dist:  7.736428

正如前文所述,WHERE 子句中的过滤条件可以采用后置或前置方式执行。

后置过滤指的是先使用向量索引检索候选数据,再对其执行过滤条件;而前置过滤则是先筛选符合条件的记录,再进行向量相似度匹配。

你可以通过设置系统参数 vector_search_filter_strategy 来控制过滤方式。该参数默认值为 auto,表示由查询引擎根据启发式规则自动决定最优策略。你也可以将其显式设置为 postfilter 或 prefilter,以指定过滤顺序。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值