1、搜索一个表
在没有索引的情况下的全文检索:查询文档中每行title匹配friend:
gbase=# SELECT * FROM pgweb;
id | body | title | last_mod_date
----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+---------------
3 | England is a country that is part of the United Kingdom. It shares land borders with Scotland to the north and Wales to the west. | England | 2010-01-01
5 | Russia, also officially known as the Russian Federation, is a sovereign state in northern Eurasia. | Russia | 2010-01-01
1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's most populous state. | China | 2010-01-01
1 | China, officially the People's Republic of China(PRC), located in Asia, is the world's most populous state. | China | 2010-01-01
4 | Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. | Australia | 2010-01-01
8 | France, is a sovereign state comprising territory in western Europe and several overseas regions and territories. | France | 2010-01-01
9 | Italy officially the Italian Republic, is a unitary parliamentary republic in Europe. | Italy | 2010-01-01
10 | India, officially the Republic of India, is a country in South Asia. | India | 2010-01-01
13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico | 2010-01-01
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America | 2010-01-01
6 | Japan is an island country in East Asia. | Japan | 2010-01-01
7 | Germany, officially the Federal Republic of Germany, is a sovereign state and federal parliamentary republic in central-western Europe. | Germany | 2010-01-01
11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America. | Brazil | 2010-01-01
12 | Canada is a country in the northern half of North America. | Canada | 2010-01-01
(14 rows)
gbase=# SELECT id, body, title FROM pgweb WHERE to_tsvector('english', body) @@to_tsquery('english', 'america');
id | body | title
----+-------------------------------------------------------------------------------------------------------------------------+---------
13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America
11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America. | Brazil
12 | Canada is a country in the northern half of North America. | Canada
(4 rows)
该查询还会返回friends和friendly等类似的结果。
上面的查询指定使用english配置参数。也可以忽略配置参数:
gbase=# SELECT id, body, title FROM pgweb WHERE to_tsvector(body) @@to_tsquery('america');
id | body | title
----+-------------------------------------------------------------------------------------------------------------------------+---------
13 | Mexico, officially the United Mexican States, is a federal republic in the southern part of North America. | Mexico
2 | America is a rock band, formed in England in 1970 by multi-instrumentalists Dewey Bunnell, Dan Peek, and Gerry Beckley. | America
11 | Brazil, officially the Federative Republic of Brazil,is the largest country in both South America and Latin America. | Brazil
12 | Canada is a country in the northern half of North America. | Canada
(4 rows)
这个查询将使用由default_text_search_config设置的配置。
2、创建索引
可以创建一个GIN索引来加速文本搜索:
gbase=# CREATE INDEX pgweb_idx ON pgweb USING GIN(to_tsvector('english', body));
CREATE INDEX
to_tsvector()函数有两个版本。只输一个参数的版本和输两个参数的版本。只输一个参数时,系统默认采用default_text_search_config所指定的分词器。
注意:创建索引时必须使用to_tsvector的两参数版本。只有指定了分词器名称的全文检索函数才可以在索引表达式中使用。这是因为索引的内容必须不受default_text_search_config的影响,否则索引内容可能不一致。由于default_text_search_config的值可以随时调整,从而导致不同条目生成的tsvector采用了不同的分词器,并且没有办法区分究竟使用了哪个分词器。正确地转储和恢复这样的索引也是不可能的。
因为在上述创建索引中to_tsvector使用了两个参数,只有当查询时也使用了两个参数,且参数值与索引中相同时,才会使用该索引。也就是说,WHERE to_tsvector(‘english’,body) @@ ‘a & b’ 可以使用索引,但WHERE to_tsvector(body) @@ 'a & b’不能使用索引。这确保只使用这样的索引——索引各条目是使用相同的分词器创建的。
索引中的分词器名称由另一列指定时可以创建更复杂的表达式索引,例如:
gbase=# CREATE INDEX pgweb_idx_2 ON pgweb USING GIN(to_tsvector(config_name, body));
CREATE INDEX
这里config_name是pgweb表中的一个列。可以在同一个索引中混合配置,同时记录配置所属的索引项。
索引可以连接列:
gbase=# CREATE INDEX pgweb_idx_3 ON pgweb USING GIN(to_tsvector('english', title || ' ' || body));
CREATE INDEX
另一种方法是创建一个单独的tsvector列来保存to_tsvector的输出。这个例子是title和body的连接,使用coalesce来保证当其他域为NULL时一个域仍然能留在索引中:
gbase=# ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
ALTER TABLE
gbase=# UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
UPDATE 14
然后创建一个GIN索引来加速搜索:
gbase=# CREATE INDEX textsearch_idx ON pgweb USING GIN(textsearchable_index_col);
CREATE INDEX
最后执行一个快速的全文检索:
gbase=# SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('north & america') ORDER BY last_mod_date DESC LIMIT 10;
title
--------
Canada
Mexico
(2 rows)
在使用一个单独的列来存储tsvector表示时,有必要创建一个触发器在title或body改变时保证tsvector列为当前值。
单独列方法相对于表达式索引的一个优势在于,它不必为了利用索引而在查询中显式地指定文本搜索配置。另一个优势是搜索将会更快,因为它不必重做to_tsvector调用来验证索引匹配。表达式索引方法更容易建立,但是它要求更少的磁盘空间,因为tsvector表示没有被显式地存储下来。