ZT oracle全文索引

最新推荐文章于 2023-12-16 13:30:18 发布

转载最新推荐文章于 2023-12-16 13:30:18 发布 · 144 阅读

本文详细介绍了Oracle数据库中全文索引的创建与维护方法，包括索引的不同类型、创建语法及参数配置，并提供了实例演示如何使用全文索引进行高效文本搜索。

全文索引通常用在文本或网页搜索中,比如类似对 "大学生活真好!" 或者一个文件进行索引,查询方法跟使用平常的SELECT的区别是,需要在WHERE条件中使用一个特定的操作符,全文索引分几个类别,分别使用于不同的搜索环境,不同的类别使用不同的操作符进行搜索.没有全文索引时,这类查询通常用下面的sql查询：

SELECT id,doc FROM table WHERE doc like ‘%国家%’;

[@more@]

SELECT id,doc FROM table WHERE doc like ‘%国家%’;

这些查询能够很好search我们想要的,但随着数据库的增长,查询速度是不可忍受的,数据库只能通过full table scan来完成这项工作.有了全文索引后,能够很好的解决这类问题,oracle会采用一种类似普通索引的技术来索引文档.其实在内部oracle利用了特殊的词法(lexer)分割器,词法分割器跟语言有关系,比如:english、日语、汉字等都有不同的分割器.分割器将文本根据平常的使用习惯分割成最小的词组单元进行存储,oracle 称他为token.比如"大学生活真好!" oracke可能会分割成"大学生"、"生活"、“真好"、"生活真好"等小单元进行存储,所以创建后的全文索引的容量是表的好多倍.索引的内容可以来自表中的某个列或者多个列或子表的列或者某个文件或某个网页等.

索引类别:

到10GR2为止,共有四种全文搜索类型:

Index Type	Description	Supported Preferences and Parameters	Query Operator	Notes
`CONTEXT`	Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML or plain text. With a context index, you can customize your index in a variety of ways. This index type requires `CTX_DDL.SYNC_INDEX` after DML on base table.	All `CREATE` `INDEX` preferences and parameters supported except for `INDEX` `SET`. These supported parameters include the index partition clause, and the format, charset, and language columns.	`CONTAINS` Grammar is called the `CONTEXT` grammar, which supports a rich set of operations. The `CTXCAT` grammar can be used with query templating.	Supports all documents services and query services. Supports indexing of partitioned text tables.
`CTXCAT`	Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments. Other columns in the base table, such as item names, prices and descriptions can be included in the index to improve mixed query performance. This index type is transactional, automatically updating itself after DML to base table. No `CTX_DDL.SYNC_INDEX` is necessary.	`INDEX` `SET` `LEXER` `STOPLIST` `STORAGE` `WORDLIST` (only prefix_index attribute supported for Japanese data) Format, charset, and language columns not supported. Table and index partitioning not supported.	`CATSEARCH` Grammar is called `CTXCAT`, which supports logical operations, phrase queries, and wildcarding. The `CONTEXT` grammar can be used with query templating. Theme querying is supported.	This index is larger and takes longer to build than a `CONTEXT` index. The size of a `CTXCAT` index is related to the total amount of text to be indexed, number of indexes in the index set, and number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set. The `CTXCAT` index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)
`CTXRULE`	Use `CTXRULE` index to build a document classification or routing application. The `CTXRULE` index is an index created on a table of queries, where the queries define the classification or routing criteria.	See "CTXRULE Parameters and Limitations".	`MATCHES`	Single documents (plain text, HTML, or XML) can be classified using the `MATCHES` operator, which turns a document into a set of queries and finds the matching rows in the `CTXRULE` index.
`CTXPATH`	Create this index when you need to speed up existsNode() queries on an XMLType column.	`STORAGE`	Use with existsNode()	Can only create this index on XMLType column. Although this index type can be helpful for `existsNode()` queries, it is not required for XML searching. See "XML Searching"

创建语法

CREATE INDEX [schema.]index ON [schema.]table(column) INDEXTYPE IS

ctxsys.context [ONLINE]
[LOCAL [(PARTITION [partition] [PARAMETERS('paramstring')]

[, PARTITION [partition] [PARAMETERS('paramstring')]])]

[PARAMETERS(paramstring)] [PARALLEL n] [UNUSABLE]];

这里的context还可以换成ctxcat、ctxrule、ctxpath几个类型,支持分区索引.这里很重要的部分

是PARAMETERS部分,定义了索引的各种创建条件.

PARAMETERS(paramstring)

The syntax for paramstring is as follows:

paramstring =

'[DATASTORE datastore_pref]  定义要所索引的数据来自哪里,
 [FILTER filter_pref] 定义过滤器,
 [CHARSET COLUMN charset_column_name]  当选用显示指定

                                   一个过滤器需要的列
 [FORMAT COLUMN format_column_name] 过滤器需要格式化的列

 [LEXER lexer_pref] 定义一个此法分析器
 [LANGUAGE COLUMN language_column_name]  多语言环境中,显示指定lexer分析的多语言列

 [WORDLIST wordlist_pref]  定义一个常用搜索词表
 [STORAGE storage_pref]  定义索引的存储参数
 [STOPLIST stoplist] 定义不需要index的词组列表
 [SECTION GROUP section_group] 定义一个索引区域(通常用在索引html、xml的环境）
 [MEMORY memsize] 分配索引创建需要的内存空间(in 10G)
 [POPULATE | NOPOPULATE] 是否创建一个空的索引
 [[METADATA] SYNC (MANUAL | EVERY "interval-string" | ON COMMIT)] 定义索引同步时间(in 10G)
 [TRANSACTIONAL]' 是否开启事务更新(in 10G)

```
索引创建      
```
```
 [oracle@ique ~]$ sqlplus mayp/mayp
```

SQL*Plus : Release 10.2.0.4.0 - Production on Thu Aug 21 19:45:43 2008

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - ProductionWith the Partitioning, OLAP, Data Mining and Real Application Testing option

SQL> create table oratext(id number,name varchar2(30),address varchar2(200))

pctfree 10 tablespace assm/  2

Table created.

SQL> alter table oratext add constraint pk_oratext primary key(id) using index/  2

Table altered.

SQL> exec ctx_ddl.create_preference('mayp_storage','basic_storage');Elapsed: 00:00:00.22

SQL> exec ctx_ddl.set_attribute('mayp_storage','I_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.11SQL> exec ctx_ddl.set_attribute('mayp_storage','K_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.04SQL> exec ctx_ddl.set_attribute('mayp_storage','R_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.06

SQL> set feedback onSQL> exec ctx_ddl.set_attribute('mayp_storage','N_TABLE_CLAUSE','TABLESPACE TEST');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01SQL> exec ctx_ddl.set_attribute('mayp_storage','I_INDEX_CLAUSE','TABLESPACE TEST COMPRESS 2');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01SQL> exec ctx_ddl.set_attribute('mayp_storage','P_TABLE_CLAUSE','TABLESPACE TEST');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04SQL> exec ctx_ddl.create_preference('mayp_lexer','chinese_lexer')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.02SQL> exec ctx_ddl.create_preference('mayp_datastore','direct_datastore');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.06

SQL> create index oratext_address on oratext(address) indextype is ctxsys.context

parameters('lexer mayp_lexer datastore mayp_datastore storage mayp_storage')/  2

Index created.

Elapsed: 00:00:04.48

这里我重新定了了三个索引的参数:mayp_datastore(要索引的数据的存储方式),mayp_storage(索引的存储位置),

mayp_lexer(词法分割器).在索引创建完之后,会发现多了以dr$index_name开头的四张表和

一个dr$索引和oratext_address索引:

SQL> select table_name,tablespace_name from user_tables;

Tab.Name                       TABLESPACE_NAME------------------------------ ------------------------------DR$ORATEXT_ADDRESS$I           TESTDR$ORATEXT_ADDRESS$R           TESTSYS_IOT_OVER_45158             MSSMSYS_IOT_MAP_45158              MSSMORATEXT                        ASSMDR$ORATEXT_ADDRESS$NDR$ORATEXT_ADDRESS$K

SQL> select index_name,tablespace_name from user_indexes;

Ind.Name                       TABLESPACE_NAME------------------------------ ------------------------------SYS_IL0000011906C00002$$       TESTPK_P                           TESTSYS_IL0000011907C00002$$       TESTDR$ORATEXT_ADDRESS$X           TESTSYS_IL0000046825C00006$$       TESTSYS_IL0000046830C00002$$       TESTORATEXT_ADDRESSSYS_IL0000046726C00006$$       TESTSYS_IL0000046731C00002$$       TEST

其中DR$ORATEXT_ADDRESS$I这张表最关键,它存储被lexer分割后的token,也就是我们可以利用搜索的关键字 .

SQL> select token_text from dr$oratext_address$i;

TOKEN_TEXT----------------------------------------------------------------1128工业海街号街南路南苏桐苏州市桐路星海园区

13 rows selected.

Elapsed: 00:00:00.01

通过一个查询来看看:

SQL> select name,address from  oratext where contains(address,'园区')>0;

NAME                           ADDRESS------------------------------ --------------------------------------------------王皓                           苏州市工业园区星海街南11号司马格达                       苏州市工业园区苏桐路28号张磊                           苏州市工业园区苏桐路28号

3 rows selected.

Elapsed: 00:00:01.49

SQL> select name,address from  oratext where contains(address,'苏桐路')>0;

NAME                           ADDRESS------------------------------ --------------------------------------------------司马格达                       苏州市工业园区苏桐路28号张磊                           苏州市工业园区苏桐路28号

2 rows selected.

Elapsed: 00:00:00.07

SQL> select name,address from  oratext where contains(address,'星海街')>0;

NAME                           ADDRESS------------------------------ --------------------------------------------------王皓                           苏州市工业园区星海街南11号

1 row selected.

```
索引维护
```

ctxsys.context全文索引在发生DML语句后,不同自动同步索引,必须通过ctx_ddl.sync_index同步,

可以定义一个job进行同步,

SQL> insert into oratext values(4,'马成','苏州市观前街7号');

1 row created.

Elapsed: 00:00:00.08SQL> commit;

Commit complete.

Elapsed: 00:00:00.00

SQL> select name,address from oratext where contains(address,'观前街')>0;

no rows selected

Elapsed: 00:00:00.09

SQL> exec ctx_ddl.sync_index('oratext_address','2m');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.18

SQL> select name,address from oratext where contains(address,'观前街')>0;

NAME                           ADDRESS------------------------------ --------------------------------------------------马成                           苏州市观前街7号

1 row selected.

Elapsed: 00:00:00.02

在索引的同步过程中,可能会产生碎片,我们可以定时通过ctx_ddl.optimize_index维护索引.

ctx_report package可以报告全文索引的一些统计信息:

ctx_report.index_size：报告索引的大小

Elapsed: 00:00:00.00SQL> select ctx_report.index_size('oratext_address') from dual;

CTX_REPORT.INDEX_SIZE('ORATEXT_ADDRESS')--------------------------------------------------------------------------------===========================================================================                    INDEX SIZE FOR MAYP.ORATEXT_ADDRESS===========================================================================TABLE:                               MAYP.DR$ORATEXT_ADDRESS$ITABLESPACE NAME:                     TESTBLOCKS ALLOCATED:                                                        8BLOCKS USED:                                                             3BYTES ALLOCATED:                                         65,536 (64.00 KB)BYTES USED:                                              24,576 (24.00 KB)

LOB SEGMENT:                         MAYP.SYS_LOB0000046825C00006$$TABLE NAME:                          MAYP.DR$ORATEXT_ADDRESS$ILOB COLUMN:                          TOKEN_INFOTABLESPACE NAME:                     TESTBLOCKS ALLOCATED:                                                        8BLOCKS USED:                                                             2BYTES ALLOCATED:                                         65,536 (64.00 KB)BYTES USED:                                              16,384 (16.00 KB)

INDEX (NORMAL):                      MAYP.DR$ORATEXT_ADDRESS$XTABLE NAME:                          MAYP.DR$ORATEXT_ADDRESS$ITABLESPACE NAME:                     TEST.........................

TOTALS FOR INDEX MAYP.ORATEXT_ADDRESS---------------------------------------------------------------------------TOTAL BLOCKS ALLOCATED:                                                 72TOTAL BLOCKS USED:                                                      19TOTAL BYTES ALLOCATED:                                 589,824 (576.00 KB)TOTAL BYTES USED:                                      155,648 (152.00 KB)

ctx_report.create_index_script:创建索引定义脚本

SQL> select ctx_report.create_index_script('oratext_address') from dual;

CTX_REPORT.CREATE_INDEX_SCRIPT('ORATEXT_ADDRESS')--------------------------------------------------------------------------------begin  ctx_ddl.create_preference('"ORATEXT_ADDRESS_DST"','DIRECT_DATASTORE');end;/

begin  ctx_ddl.create_preference('"ORATEXT_ADDRESS_FIL"','NULL_FILTER');end;..........................

create index "MAYP"."ORATEXT_ADDRESS"  on "MAYP"."ORATEXT"      ("ADDRESS")  indextype is ctxsys.context  parameters('    datastore       "ORATEXT_ADDRESS_DST"    filter          "ORATEXT_ADDRESS_FIL"    section group   "ORATEXT_ADDRESS_SGP"    lexer           "ORATEXT_ADDRESS_LEX"    wordlist        "ORATEXT_ADDRESS_WDL"    stoplist        "ORATEXT_ADDRESS_SPL"    storage         "ORATEXT_ADDRESS_STO"  ')/

begin  ctx_output.end_log;end;/1 row selected.

Elapsed: 00:00:00.23

ctx_report.index_stats：报告索引的内部结构,已经碎片的程度

SQL> declare    l_index_stats clob;    begin     ctx_report.index_stats(index_name=>'oratext_address',report=>l_index_stats);     dbms_output.put_line(l_index_stats);  end;   /  2    3    4    5    6    7  ===========================================================================                  STATISTICS FOR"MAYP"."ORATEXT_ADDRESS"===========================================================================

indexed documents:                                                      3allocated docids:4$I rows:                                                               17

---------------------------------------------------------------------------                             TOKENSTATISTICS---------------------------------------------------------------------------

unique tokens:                                                         17average $I rows per token:1.00tokens with most $I rows:  园区 (0:TEXT)                                                         1  星海 (0:TEXT)                                                         1  桐路 (0:TEXT)1  苏州市 (0:TEXT)                                                       1  苏桐 (0:TEXT)                                                         1  前街 (0:TEXT)1

...........

---------------------------------------------------------------------------                         FRAGMENTATIONSTATISTICS---------------------------------------------------------------------------

total size of $I data:                                                 93

$I rows:17estimated $I rows if optimal:                                          17estimated row fragmentation:                                          0 %

garbage docids:1estimated garbage size:                                                19

most fragmented tokens:  园区 (0:TEXT)                                                       0 %  星海 (0:TEXT)0 %  桐路 (0:TEXT)                                                       0 %  苏州市 (0:TEXT)                                                     0 %  苏桐 (0:TEXT)0 %  前街 (0:TEXT)                                                       0 %  南 (0:TEXT)                                                         0 %  路 (0:TEXT)0 %  街南 (0:TEXT)                                                       0 %  街 (0:TEXT)                                                         0 %  号 (0:TEXT)0 %  海街 (0:TEXT)                                                       0 %  观前 (0:TEXT)                                                       0 %  工业 (0:TEXT)0 %  7 (0:TEXT)                                                          0 %  28 (0:TEXT)                                                         0 %  11 (0:TEXT)0 %

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03这个索引没有碎片.

--结束

参考文档:

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/82387/viewspace-1017150/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/82387/viewspace-1017150/