GBase 8c 函数与操作符——34-全文检索 | 表和索引

1、搜索一个表

在没有索引的情况下的全文检索:查询文档中每行title匹配friend:

gbase=# SELECT * FROM pgweb;
 id |                                                                                     body                                                                                     |   title   | last_mod_date
----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+---------------
  3 | England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west.                                            | England   | 2010-01-01
  5 | Russia, also officially known as the Russian Federation, is a sovereign state in northern Eurasia.                                                                           | Russia    | 2010-01-01
  1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's most populous state.                                                                  | China     | 2010-01-01
  1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's most populous state.                                                                  | China     | 2010-01-01
  4 | Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. | Australia | 2010-01-01
  8 | France, is a sovereign state comprising territory in western Europe and several overseas regions and territories.                                                            | France    | 2010-01-01
  9 | Italy officially the Italian Republic, is a unitary parliamentary republic in Europe.                                                                                        | Italy     | 2010-01-01
 10 | India, officially the Republic of India, is a country in South Asia.                                                                                                         | India     | 2010-01-01
 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.                                                                   | Mexico    | 2010-01-01
  2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley.                                                      | America   | 2010-01-01
  6 | Japan is an island country in East Asia.                                                                                                                                     | Japan     | 2010-01-01
  7 | Germany, officially the Federal Republic of Germany, is a sovereign state and federal parliamentary republic in central-western Europe.                                      | Germany   | 2010-01-01
 11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America.                                                         | Brazil    | 2010-01-01
 12 | Canada is a country in the northern half of North America.                                                                                                                   | Canada    | 2010-01-01
(14 rows)

gbase=# SELECT id, body, title FROM pgweb WHERE to_tsvector('english', body) @@to_tsquery('english', 'america');
 id |                                                          body                                                           |  title
----+-------------------------------------------------------------------------------------------------------------------------+---------
 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.              | Mexico
  2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America
 11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America.    | Brazil
 12 | Canada is a country in the northern half of North America.                                                              | Canada
(4 rows)

该查询还会返回friends和friendly等类似的结果。

上面的查询指定使用english配置参数。也可以忽略配置参数:

gbase=# SELECT id, body, title FROM pgweb WHERE to_tsvector(body) @@to_tsquery('america');
 id |                                                          body                                                           |  title
----+-------------------------------------------------------------------------------------------------------------------------+---------
 13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America.              | Mexico
  2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America
 11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America.    | Brazil
 12 | Canada is a country in the northern half of North America.                                                              | Canada
(4 rows)

这个查询将使用由default_text_search_config设置的配置。

2、创建索引

可以创建一个GIN索引来加速文本搜索:

gbase=# CREATE INDEX pgweb_idx ON pgweb USING GIN(to_tsvector('english', body));
CREATE INDEX

to_tsvector()函数有两个版本。只输一个参数的版本和输两个参数的版本。只输一个参数时,系统默认采用default_text_search_config所指定的分词器。

注意:创建索引时必须使用to_tsvector的两参数版本。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。这是因为索引的内容必须不受default_text_search_config的影响,否则索引内容可能不一致。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且没有办法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不可能的。

因为在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。也就是说,WHERE to_tsvector(‘english’,body) @@ ‘a & b’ 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b’不能使用索引。这确保只使用这样的索引——索引各条目是使用相同的分词器创建的。

索引中的分词器名称由另一列指定时可以创建更复杂的表达式索引,例如:

gbase=# CREATE INDEX pgweb_idx_2 ON pgweb USING GIN(to_tsvector(config_name, body));
CREATE INDEX

这里config_name是pgweb表中的一个列。可以在同一个索引中混合配置,同时记录配置所属的索引项。

索引可以连接列:

gbase=# CREATE INDEX pgweb_idx_3 ON pgweb USING GIN(to_tsvector('english', title || ' ' || body));
CREATE INDEX

另一种方法是创建一个单独的tsvector列来保存to_tsvector的输出。这个例子是title和body的连接,使用coalesce来保证当其他域为NULL时一个域仍然能留在索引中:

gbase=# ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
ALTER TABLE
gbase=# UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
UPDATE 14

然后创建一个GIN索引来加速搜索:

gbase=# CREATE INDEX textsearch_idx ON pgweb USING GIN(textsearchable_index_col);
CREATE INDEX

最后执行一个快速的全文检索:

gbase=# SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10;
 title
--------
 Canada
 Mexico
(2 rows)

在使用一个单独的列来存储tsvector表示时,有必要创建一个触发器在title或body改变时保证tsvector列为当前值。

单独列方法相对于表达式索引的一个优势在于,它不必为了利用索引而在查询中显式地指定文本搜索配置。另一个优势是搜索将会更快,因为它不必重做to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,但是它要求更少的磁盘空间,因为tsvector表示没有被显式地存储下来。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值