全文检索:
http://www.postgresql.org/docs/9.3/interactive/textsearch.html
postgres 切词:
http://blog.163.com/digoal@126/blog/static/163877040201422410175698/
ts_parse:
获取切词后的所有词组
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围');
SELECT * FROM ts_parse('english', 'hello world! 2010年保障房建设在全国范围');
to_tsvector,to_tsquery:获取一个text的所有切词
select to_tsvector('english', ' example is a concatenation of title and body');
select ' before, full text search functionality includes the ability to do many m' @@ 'ability';
select to_tsvector(' before, full text search functionality includes the ability to do many m') @@ to_tsquery('ability');
在tsvector字段上加Gin索引可以加快检索速度:
http://www.postgresql.org/docs/9.3/interactive/textsearch-tables.html#TEXTSEARCH-TABLES-SEARCH
建立一个索引需要格式如CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));#body==>text字段
建索引的时候需注意
to_tsvector函数有两个版本
:
That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot.
CREATE INDEX name ON table USING gist(column);
Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or tsquery type.
Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector type.
使用zhparser插件:
一般postgresql的切词是english的,对于中文切词不支持。
http://blog.163.com/digoal@126/blog/static/163877040201422410175698/
相似度pg_trgm :
此插件暂时对中文不支持。
1:安装中间件:
postgres=# create extension pg_trgm;
CREATE EXTENSION
查看pg_trgm的所有方法:
postgres=# \dx+ pg_trgm
Objects in extension "pg_trgm"
Object Description
--------------------------------------------------------------------------------------------------
function gin_extract_query_trgm(text,internal,smallint,internal,internal,internal,internal)
function gin_extract_value_trgm(text,internal)
function gin_trgm_consistent(internal,smallint,text,integer,internal,internal,internal,internal)
function gtrgm_compress(internal)
function gtrgm_consistent(internal,text,integer,oid,internal)
function gtrgm_decompress(internal)
function gtrgm_distance(internal,text,integer,oid)
function gtrgm_in(cstring)
function gtrgm_out(gtrgm)
function gtrgm_penalty(internal,internal,internal)
function gtrgm_picksplit(internal,internal)
function gtrgm_same(gtrgm,gtrgm,internal)
function gtrgm_union(bytea,internal)
function set_limit(real)
function show_limit()
function show_trgm(text)
function similarity(text,text)
function similarity_dist(text,text)
function similarity_op(text,text)
operator %(text,text)
operator <->(text,text)
operator class gin_trgm_ops for access method gin
operator class gist_trgm_ops for access method gist
operator family gin_trgm_ops for access method gin
operator family gist_trgm_ops for access method gist
type gtrgm
(26 rows)
2:函数介绍:
1:similarity 计算两个字符串的相似度
postgres=# select similarity('entries','entry');
similarity
------------
0.4
2:show_limit、set_limit查看和修改系统的相似度阀值:
postgres=# select show_limit();
show_limit
------------
0.3
postgres=# select set_limit(0.3);
set_limit
-----------
0.3
3:similarity_op,和操作符“%”一样都是判断两个字符串的相似度是否大于等于系统相似度阀值:
postgres=# select similarity_op('entries','ries');
similarity_op
---------------
t
(1 row)
postgres=# select similarity_op('entries','ri');
similarity_op
---------------
f
<span style="font-family: Arial, Helvetica, sans-serif;">postgres=# select 'entries' % 'ries';</span>
?column?
----------
t
postgres=# select 'entries' % 'ri';
?column?
----------
f
4:操作符“<->” 返回两个相似度的距离(1减去其两个字符串的相似度)
postgres=# select similarity('entries' ,'entri');
similarity
------------
0.555556
postgres=# select 'entries' <-> 'entri';
?column?
----------
0.444444
(1 row)
3:比较like/ilike,%,<->走索引和不走索引的时间比
like是区分大小写的。
ilike是不区分大小写的。
%和<->是pg_trgm的操作符。
postgres=# create table test_pg_trgm (id serial primary key,info text);
CREATE TABLE
postgres=# insert into test_pg_trgm (info) select generate_series(1,1000000)||'digoal';
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (5 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
INSERT 0 1000000
建立gist索引:
postgres=# create index idx_test_pg_trgm_1 on test_pg_trgm using gist (info gist_trgm_ops);
LOG: checkpoints are occurring too frequently (6 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 35253.859 ms
建立gin索引:
postgres=# create index idx_test_pg_trgm_1 on test_pg_trgm using gin (info gin_trgm_ops);
LOG: checkpoints are occurring too frequently (3 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (4 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (3 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 11286.458 ms
1:like
不走索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
count
-------
1
(1 row)
Time: 373.475 ms
走gist索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
count
-------
1
(1 row)
Time: 16.443 ms
走gin索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
count
-------
1
(1 row)
Time: 3.065 ms
2:%
不走索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
count
-------
1132
(1 row)
Time: 1532.612 ms
走gist索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
count
-------
1132
(1 row)
Time: 166.685 ms
走gin索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
count
-------
1132
(1 row)
Time: 2478.780 ms
3:<->
不走索引:postgres=# select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
id | info
------+------------
1234 | 1234digoal
(1 row)
Time: 1759.215 ms
走gist索引:
postgres=# select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
id | info
------+------------
1234 | 1234digoal
(1 row)
Time: 126.471 ms
走gin索引:
postgres=# select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
id | info
------+------------
1234 | 1234digoal
(1 row)
Time: 1690.692 ms