Unicode Bidirectional Algorithm

本文介绍Unicode双向文本处理算法的基本原理,包括显式方向性嵌入、覆盖及终止代码的概念,并详细解释了基本显示算法的四个主要阶段:段落分离、初始化、嵌入等级解析与重新排序。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://www.unicode.org/reports/tr9/tr9-27.html


When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected


The directional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The directional types associated with numbers are calledweak types, and characters of those types are called weak directional characters.


Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.



2.1 Explicit Directional Embedding

Abbr.CodeChartNameDescription
LREU+202A https://i-blog.csdnimg.cn/blog_migrate/f36d642cfc4049ddf828bd60f375ae3c.gifLEFT-TO-RIGHT EMBEDDINGTreat the following text as embedded left-to-right.
RLEU+202B https://i-blog.csdnimg.cn/blog_migrate/56945f53679f97b8f244fb7a2223d973.gifRIGHT-TO-LEFT EMBEDDINGTreat the following text as embedded right-to-left.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.


2.2 Explicit Directional Overrides

Abbr.CodeChartNameDescription
LROU+202D https://i-blog.csdnimg.cn/blog_migrate/3bc7bff6dd400b5206873d9a92467f91.gifLEFT-TO-RIGHT OVERRIDEForce following characters to be treated as strong left-to-right characters.
RLOU+202Ehttps://i-blog.csdnimg.cn/blog_migrate/27630b13ff8b79c5b11da60d86da4f06.gifRIGHT-TO-LEFT OVERRIDEForce following characters to be treated as strong right-to-left characters.


The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.



2.3 Terminating Explicit Directional Code

Abbr.CodeChartNameDescription
PDFU+202C

https://i-blog.csdnimg.cn/blog_migrate/4dba545ec910acc466c01b36776d1c31.gif

POP DIRECTIONAL FORMATTINGRestore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO.



2.4 Implicit Directional Marks



Abbr.CodeChartNameDescription
LRMU+200Ehttps://i-blog.csdnimg.cn/blog_migrate/3114ad87f0a8d41d5a2572afe26e42c3.gifLEFT-TO-RIGHT MARKLeft-to-right zero-width character
RLMU+200F https://i-blog.csdnimg.cn/blog_migrate/0b5b2f94c98039a72da15f4b03f785a4.gifRIGHT-TO-LEFT MARKRight-to-left zero-width character


There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display



3 Basic Display Algorithm

The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds infour main phases:

  • Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
  • Initialization. A list of directionalcharacter types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
  • Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
  • Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

 

3.1 Definitions

BD1. The bidirectional characters types are values assigned to each   Unicode character, including unassigned characters. The formal property name in the    Unicode Character Database [UCD] is Bidi_Class.

BD2. Embedding levels are numbers that indicate how deeply the text is   nested, and the default direction of text on that level. The minimum embedding level of text is   zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format     codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is     to provide a precise stack limit for implementations to guarantee the same results. Sixty-one     levels is far more than sufficient for ordering, even with mechanically generated formatting;     the display becomes rather muddied with more than a small number of embeddings.

BD3. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even,   and R if the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain     Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly     embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English     text and numbers will always be an even level; Arabic text (excluding numbers) will always be an     odd level. The exact meaning of the embedding level will become clear when the reordering     algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. The paragraph embedding level is the embedding level that   determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the   paragraph direction.

  • In some contexts the paragraph direction is also known as the base direction.

BD6. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status   has three states, as shown in Table 2.

Table 2. Directional Override Status

StatusInterpretation
NeutralNo override is currently active
Right-to-leftCharacters are to be reset to R
Left-to-rightCharacters are to be reset to L

BD7. A level run is a maximal substring of characters that have the   same embedding level. It is maximal in that no character immediately before or after the substring   has the same level (a level run is also known as a directional run).

 

 

 

5.1 Reference Code

There are two versions of BIDI reference code available. Both have been tested to   produce identical results. One version is written in Java, and the other is written in C++. The   Java version is designed to closely follow the steps of the algorithm as described below. The C++   code is designed to show one of the optimization methods that can be applied to the algorithm,   using a state table for one phase.

One of the most effective optimizations is to first test for right-to-left     characters and not invoke the Bidirectional Algorithm unless they are present.

There are two directories containing source code for reference implementations at [Code9]. Implementers are encouraged to use this   resource to test their implementations. There is an online demo of bidi code at http://unicode.org/cldr/utility/bidi.jsp, which shows the results, plus the levels and the rules invoked for each character.


 

资源下载链接为: https://pan.quark.cn/s/1bfadf00ae14 “STC单片机电压测量”是一个以STC系列单片机为基础的电压检测应用案例,它涵盖了硬件电路设计、软件编程以及数据处理等核心知识点。STC单片机凭借其低功耗、高性价比和丰富的I/O接口,在电子工程领域得到了广泛应用。 STC是Specialized Technology Corporation的缩写,该公司的单片机基于8051内核,具备内部振荡器、高速运算能力、ISP(在系统编程)和IAP(在应用编程)功能,非常适合用于各种嵌入式控制系统。 在源代码方面,“浅雪”风格的代码通常简洁易懂,非常适合初学者学习。其中,“main.c”文件是程序的入口,包含了电压测量的核心逻辑;“STARTUP.A51”是启动代码,负责初始化单片机的硬件环境;“电压测量_uvopt.bak”和“电压测量_uvproj.bak”可能是Keil编译器的配置文件备份,用于设置编译选项和项目配置。 对于3S锂电池电压测量,3S锂电池由三节锂离子电池串联而成,标称电压为11.1V。测量时需要考虑电池的串联特性,通过分压电路将高电压转换为单片机可接受的范围,并实时监控,防止过充或过放,以确保电池的安全和寿命。 在电压测量电路设计中,“电压测量.lnp”文件可能包含电路布局信息,而“.hex”文件是编译后的机器码,用于烧录到单片机中。电路中通常会使用ADC(模拟数字转换器)将模拟电压信号转换为数字信号供单片机处理。 在软件编程方面,“StringData.h”文件可能包含程序中使用的字符串常量和数据结构定义。处理电压数据时,可能涉及浮点数运算,需要了解STC单片机对浮点数的支持情况,以及如何高效地存储和显示电压值。 用户界面方面,“电压测量.uvgui.kidd”可能是用户界面的配置文件,用于显示测量结果。在嵌入式系统中,用
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值