Unicode Bidirectional Algorithm

最新推荐文章于 2024-06-24 09:37:10 发布

maimang09

最新推荐文章于 2024-06-24 09:37:10 发布

阅读量1.2k

点赞数

本文介绍Unicode双向文本处理算法的基本原理，包括显式方向性嵌入、覆盖及终止代码的概念，并详细解释了基本显示算法的四个主要阶段：段落分离、初始化、嵌入等级解析与重新排序。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://www.unicode.org/reports/tr9/tr9-27.html

When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected。

The directional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The directional types associated with numbers are calledweak types, and characters of those types are called weak directional characters.

Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.

2.1 Explicit Directional Embedding

Abbr.	Code	Chart	Name	Description
LRE	U+202A		LEFT-TO-RIGHT EMBEDDING	Treat the following text as embedded left-to-right.
RLE	U+202B		RIGHT-TO-LEFT EMBEDDING	Treat the following text as embedded right-to-left.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.

2.2 Explicit Directional Overrides

Abbr.	Code	Chart	Name	Description
LRO	U+202D		LEFT-TO-RIGHT OVERRIDE	Force following characters to be treated as strong left-to-right characters.
RLO	U+202E		RIGHT-TO-LEFT OVERRIDE	Force following characters to be treated as strong right-to-left characters.

The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.

2.3 Terminating Explicit Directional Code

Abbr.	Code	Chart	Name	Description
PDF	U+202C		POP DIRECTIONAL FORMATTING	Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO.

2.4 Implicit Directional Marks

Abbr.	Code	Chart	Name	Description
LRM	U+200E		LEFT-TO-RIGHT MARK	Left-to-right zero-width character
RLM	U+200F		RIGHT-TO-LEFT MARK	Right-to-left zero-width character

There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display

3 Basic Display Algorithm

The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds infour main phases:

Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
Initialization. A list of directionalcharacter types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

3.1 Definitions

BD1. The bidirectional characters types are values assigned to each Unicode character, including unassigned characters. The formal property name in the Unicode Character Database [UCD] is Bidi_Class.

BD2. Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.

BD3. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embedding level will become clear when the reordering algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the paragraph direction.

In some contexts the paragraph direction is also known as the base direction.

BD6. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status has three states, as shown in Table 2.

Table 2. Directional Override Status

Status	Interpretation
Neutral	No override is currently active
Right-to-left	Characters are to be reset to R
Left-to-right	Characters are to be reset to L

BD7. A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level (a level run is also known as a directional run).

5.1 Reference Code

There are two versions of BIDI reference code available. Both have been tested to produce identical results. One version is written in Java, and the other is written in C++. The Java version is designed to closely follow the steps of the algorithm as described below. The C++ code is designed to show one of the optimization methods that can be applied to the algorithm, using a state table for one phase.

One of the most effective optimizations is to first test for right-to-left characters and not invoke the Bidirectional Algorithm unless they are present.

There are two directories containing source code for reference implementations at [Code9]. Implementers are encouraged to use this resource to test their implementations. There is an online demo of bidi code at http://unicode.org/cldr/utility/bidi.jsp, which shows the results, plus the levels and the rules invoked for each character.