本文翻译自:Best way to select random rows PostgreSQL
I want a random selection of rows in PostgreSQL, I tried this: 我想在PostgreSQL中随机选择一行,我试过这个:
select * from table where random() < 0.01;
But some other recommend this: 但其他一些人推荐这个:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast. 我有一个非常大的桌子,有5亿行,我希望它快。
Which approach is better? 哪种方法更好? What are the differences? 有什么区别? What is the best way to select random rows? 选择随机行的最佳方法是什么?
#1楼
参考:https://stackoom.com/question/AoGO/选择随机行PostgreSQL的最佳方法
#2楼
postgresql order by random(), select rows in random order: postgresql order by random(),按随机顺序选择行:
select your_columns from your_table ORDER BY random()
postgresql order by random() with a distinct: postgresql以random()顺序排列:
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
postgresql order by random limit one row: postgresql命令随机限制一行:
select your_columns from your_table ORDER BY random() limit 1
#3楼
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible. Erwin Brandstetter概述的物化视图“可能的替代方案”的变体是可能的。
Say, for example, that you don't want duplicates in the randomized values that are returned. 例如,假设您不希望在返回的随机值中出现重复项。 So you will need to set a boolean value on the primary table containing your (non-randomized) set of values. 因此,您需要在包含(非随机)值集的主表上设置布尔值。
Assuming this is the input table: 假设这是输入表:
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES
table as needed. 根据需要填充ID_VALUES
表。 Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES
table once: 然后,如Erwin所述,创建一个物化视图,将ID_VALUES
表随机化一次:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. 请注意,实例化视图不包含已使用的列,因为这将很快变得过时。 Nor does the view need to contain other columns that may be in the id_values
table. 视图也不需要包含可能在id_values
表中的其他列。
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values
, selecting id_values
from id_values_randomized
with a join, and applying the desired criteria to obtain only relevant possibilities. 为了获得(并“消耗”)随机值,在id_values
上使用UPDATE- id_values
,从连接中选择id_values
的id_values_randomized
,并应用所需的条件以仅获得相关的可能性。 For example: 例如:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 5)
RETURNING id;
Change LIMIT
as necessary -- if you only need one random value at a time, change LIMIT
to 1
. 根据需要更改LIMIT
- 如果一次只需要一个随机值,则将LIMIT
更改为1
。
With the proper indexes on id_values
, I believe the UPDATE-RETURNING should execute very quickly with little load. 使用id_values
上的正确索引,我相信UPDATE-RETURNING应该在很少负载的情况下快速执行。 It returns randomized values with one database round-trip. 它返回一个数据库往返的随机值。 The criteria for "eligible" rows can be as complex as required. “符合条件”行的标准可以根据需要复杂化。 New rows can be added to the id_values
table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). 可以随时将新行添加到id_values
表中, id_values
化视图(可能在非高峰时间运行),它们就可以被应用程序访问。 Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values
table. 物化视图的创建和刷新将很慢,但只有在将新ID添加到id_values
表时才需要执行。
#4楼
If you want just one row, you can use a calculated offset
derived from count
. 如果只需要一行,则可以使用从count
派生的计算offset
。
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
#5楼
Add a column called r
with type serial
. 添加名为r
的列,类型为serial
。 Index r
. 指数r
。
Assume we have 200,000 rows, we are going to generate a random number n
, where 0 < n
<= 200, 000. 假设我们有200,000行,我们将生成一个随机数n
,其中0 < n
<= 200,000。
Select rows with r > n
, sort them ASC
and select the smallest one. 选择r > n
行,将它们排序为ASC
并选择最小的行。
Code: 码:
select * from YOUR_TABLE
where r > (
select (
select reltuples::bigint AS estimate
from pg_class
where oid = 'public.YOUR_TABLE'::regclass) * random()
)
order by r asc limit(1);
The code is self-explanatory. 代码不言自明。 The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 . 中间的子查询用于从https://stackoverflow.com/a/7945274/1271094快速估计表行数。
In application level you need to execute the statement again if n
> the number of rows or need to select multiple rows. 在应用程序级别,如果n
>行数或需要选择多行,则需要再次执行该语句。
#6楼
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table : 从PostgreSQL 9.5开始,有一种新的语法专用于从表中获取随机元素:
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable
. 这个例子将为mytable
提供5%的元素。
See more explanation on this blog post: http://www.postgresql.org/docs/current/static/sql-select.html 请在此博客文章中查看更多解释: http : //www.postgresql.org/docs/current/static/sql-select.html