Storing Documents in Multiple Languages in LOB Data Types

最新推荐文章于 2025-11-24 15:52:13 发布

原创最新推荐文章于 2025-11-24 15:52:13 发布 · 109 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#数据库

Oracle 专栏收录该内容

193 篇文章

订阅专栏

本文介绍了OracleText如何支持在CLOB、NCLOB和BLOB数据类型中存储多语言文档，并进行内容搜索。CLOB依赖数据库字符集，NCLOB独立于字符集但不支持内容搜索，BLOB则不转换数据，适合存储二进制文档。创建多语种索引时，需建立特定语言的词法器并指定语言列。对于CLOB和BLOB类型的文档，可以创建相应的索引来实现内容搜索。

You can store documents in multiple languages in CLOB, NCLOB, or BLOB data types and set up Oracle Text to enable content search for the documents.

Data in CLOB columns is stored in the AL16UTF16 character set when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB column requires less storage space than the same document in a LONG column using AL32UTF8, typically around 30% less, depending on the contents of the document.

Documents in NCLOB format are also stored in the AL16UTF16 character set regardless of the database character set or national character set. The storage space requirement is the same as for CLOB data. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB. However, content search on NCLOB with Oracle Text is not supported.

Documents in BLOB format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) cannot be applied to the BLOB data type.

The following table lists the advantages and disadvantages of the CLOB, NCLOB, and BLOB data types when storing documents:

Table 6-3 Comparison of LOB Data Types for Document Storage

Data Types	Advantages	Disadvantages
CLOB	Content search support with Oracle Text String manipulation support	Depends on database character set Data conversion is necessary for insertion Cannot store binary documents
NCLOB	Independent of database character set String manipulation support	No content search support Data conversion is necessary for insertion Cannot store binary documents
BLOB	Independent of database character set Content search support No data conversion, data stored as is Can store binary documents such as Microsoft Word or Microsoft Excel	No string manipulation support

Creating Indexes for Searching Multilingual Document Contents

Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB format and BLOB format. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.

Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describes the high level steps to create indexes for documents in multiple languages. It contains the following topics:

Creating Multilexers

The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:

ctx_ddl.create_preference('english_lexer', 'basic_lexer');

ctx_ddl.set_attribute('english_lexer','index_themes','yes');

ctx_ddl.create_preference('german_lexer', 'basic_lexer');

ctx_ddl.set_attribute('german_lexer','composite','german');

ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');

After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER object:

ctx_ddl.create_preference('global_lexer','multi_lexer');

Now add the language-specific lexers to the multilexer preference using the add_sub_lexer call:

ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle Japanese documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

Creating Indexes for Documents Stored in the CLOB Data Type

The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB data type to store your documents, then add the language column to the table where the documents are stored:

CREATE TABLE globaldoc

(doc_id NUMBER PRIMARY KEY,

language VARCHAR2(30),

text CLOB);

To create an index for this table, use the multilexer preference and specify the name of the language column:

CREATE INDEX globalx ON globaldoc(text)

indextype IS ctxsys.context

parameters ('lexer

global_lexer

language

column

language');

Creating Indexes for Documents Stored in the BLOB Data Type

In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE statement can specify columns called characterset and format:

CREATE TABLE globaldoc (

doc_id NUMBER PRIMARY KEY,

language VARCHAR2(30),

characterset VARCHAR2(30),

format VARCHAR2(10),

text BLOB

);

You can put word-processing or spreadsheet documents into the table and specify binary in the format column. For documents in HTML, XML and text format, you can put them into the table and specify text in the format column.

Because there is a column in which to specify the character set, you can store text documents in different character sets.

When you create the index, specify the names of the format and character set columns:

CREATE INDEX globalx ON globaldoc(text)

indextype is ctxsys.context

parameters ('filter inso_filter

lexer global_lexer

language column language

format column format

charset column characterset');

You can use the charset_filter if all documents are in text format. The charset_filter converts data from the character set specified in the charset column to the database character set.