Storing Documents in Multiple Languages in LOB Data Types

本文介绍了OracleText如何支持在CLOB、NCLOB和BLOB数据类型中存储多语言文档,并进行内容搜索。CLOB依赖数据库字符集,NCLOB独立于字符集但不支持内容搜索,BLOB则不转换数据,适合存储二进制文档。创建多语种索引时,需建立特定语言的词法器并指定语言列。对于CLOB和BLOB类型的文档,可以创建相应的索引来实现内容搜索。

You can store documents in multiple languages in CLOBNCLOB, or BLOB data types and set up Oracle Text to enable content search for the documents.

Data in CLOB columns is stored in the AL16UTF16 character set when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB column requires less storage space than the same document in a LONG column using AL32UTF8, typically around 30% less, depending on the contents of the document.

Documents in NCLOB format are also stored in the AL16UTF16 character set regardless of the database character set or national character set. The storage space requirement is the same as for CLOB data. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB. However, content search on NCLOB with Oracle Text is not supported.

Documents in BLOB format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) cannot be applied to the BLOB data type.

The following table lists the advantages and disadvantages of the CLOBNCLOB, and BLOB data types when storing documents:

Table 6-3 Comparison of LOB Data Types for Document Storage

Data Types

Advantages

Disadvantages

CLOB

Content search support with Oracle Text

String manipulation support

Depends on database character set

Data conversion is necessary for insertion

Cannot store binary documents

NCLOB

Independent of database character set

String manipulation support

No content search support

Data conversion is necessary for insertion

Cannot store binary documents

BLOB

Independent of database character set

Content search support

No data conversion, data stored as is

Can store binary documents such as Microsoft Word or Microsoft Excel

No string manipulation support

Creating Indexes for Searching Multilingual Document Contents

Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB format and BLOB format. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.

Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describes the high level steps to create indexes for documents in multiple languages. It contains the following topics:

  1. Creating Multilexers

The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:

ctx_ddl.create_preference('english_lexer', 'basic_lexer');

ctx_ddl.set_attribute('english_lexer','index_themes','yes');

ctx_ddl.create_preference('german_lexer', 'basic_lexer');

ctx_ddl.set_attribute('german_lexer','composite','german');

ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');

After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER object:

ctx_ddl.create_preference('global_lexer','multi_lexer');

Now add the language-specific lexers to the multilexer preference using the add_sub_lexer call:

ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle Japanese documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

  1. Creating Indexes for Documents Stored in the CLOB Data Type

The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB data type to store your documents, then add the language column to the table where the documents are stored:

CREATE TABLE globaldoc

  (doc_id    NUMBER       PRIMARY KEY,

  language  VARCHAR2(30),

  text      CLOB);

To create an index for this table, use the multilexer preference and specify the name of the language column:

CREATE INDEX globalx ON globaldoc(text)

  indextype IS ctxsys.context

  parameters ('lexer

               global_lexer

               language

               column

               language');

  1. Creating Indexes for Documents Stored in the BLOB Data Type

In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE statement can specify columns called characterset and format:

CREATE TABLE  globaldoc (

   doc_id       NUMBER       PRIMARY KEY,

   language     VARCHAR2(30),

   characterset VARCHAR2(30),

   format       VARCHAR2(10),

   text         BLOB

  );

You can put word-processing or spreadsheet documents into the table and specify binary in the format column. For documents in HTML, XML and text format, you can put them into the table and specify text in the format column.

Because there is a column in which to specify the character set, you can store text documents in different character sets.

When you create the index, specify the names of the format and character set columns:

CREATE INDEX globalx ON globaldoc(text)

  indextype is ctxsys.context

  parameters ('filter inso_filter

                lexer global_lexer

                language column language

                format  column format

                charset column characterset');

You can use the charset_filter if all documents are in text format. The charset_filter converts data from the character set specified in the charset column to the database character set.

在ABAP开发中,SAP内存(SAP Memory)是一种用于在不同程序或会话之间共享数据的全局内存区域。当尝试在SAP内存中存储大量数据时,可能会遇到“分页溢出”(Paging Overflow)错误。该错误通常表示SAP内存的容量限制被超出,导致无法继续写入数据[^1]。 SAP内存的大小是有限的,默认情况下,每个SPA/GPA参数的最大数据长度为64KB。当尝试存储的数据超过该限制,或者在多个参数之间累计占用内存超出系统允许的上限时,就会触发分页溢出错误[^1]。 ### 解决方案与建议 1. **优化数据存储结构**: 避免将大量数据一次性写入SAP内存。可以考虑仅存储关键标识符(如订单号、客户编号等),并在需要时通过数据库查询获取完整数据,而不是将全部数据缓存在SAP内存中。 2. **使用ABAP内存(ABAP Memory)代替SAP内存**: 如果数据仅需在同一个内部会话(Internal Session)中的程序之间共享,可以使用`EXPORT TO MEMORY`和`IMPORT FROM MEMORY`语句操作ABAP内存。ABAP内存的容量通常比SAP内存更大,适合处理较大的数据量。 3. **检查SPA/GPA参数使用情况**: 确保没有不必要的SPA/GPA参数被设置。使用`SET PARAMETER`时应仅保留必要的数据,并在数据使用完毕后及时清除,例如使用`DELETE PARAMETER`语句。 4. **增加SAP内存大小(需谨慎)**: 在某些情况下,可以通过调整SAP系统的参数(如`ztta/roll_extension`或`rdisp/roll_max_offheap`)来增加内存分配,但这应由系统管理员在性能评估后进行,并非推荐的常规做法。 5. **使用替代机制传递数据**: 对于跨程序或跨事务的数据传递,可以考虑使用数据库表、临时表(如`GT`表)、共享内存(Shared Memory)或全局类变量(Global Class Attributes)等方法,以避免SAP内存的限制。 --- ### 示例代码:使用ABAP内存代替SAP内存 ```abap * 存储数据到ABAP内存 DATA: lv_data TYPE string VALUE 'Some large data to store'. EXPORT data = lv_data TO MEMORY ID 'MYKEY'. * 从ABAP内存读取数据 DATA: lv_result TYPE string. IMPORT data = lv_result FROM MEMORY ID 'MYKEY'. WRITE: / 'Data from memory:', lv_result. ``` --- ### 注意事项 - 在使用SAP内存时,应始终确保数据的及时清理,以避免内存泄漏和溢出问题。 - 如果问题频繁发生,建议通过事务码`ST22`查看DUMP信息,分析具体出错的程序位置和上下文,以便进一步优化代码逻辑或数据处理方式[^2]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值