Automatic conversion between simplified and traditional Chinese - Meta

简繁转换
本文介绍了中文简体和繁体之间的自动转换系统,该系统已在中文维基百科等多个项目中实施。文章探讨了背景、实现方法,并提及了对于其他语言转换系统的可能应用。

Automatic conversion between simplified and traditional Chinese - Meta

Automatic conversion between simplified and traditional Chinese


 

From Meta, a Wikimedia project coordination wiki



Jump to: navigation, search


This page is outdated, but if it were updated, it may still be useful. Please help by correcting, augmenting and revising the text into an up-to-date form.

Need help? See the Translation FAQ or Meta:Babylon. All translators should also subscribe to translators-l to be kept up-to-date (and to ask questions).

 


 

 

This page describes the writing system conversion system currently implemented at the Chinese Wikipedia (zh). It has been running since December 23, 2004[1] and has been generally received well by the community there. There has been some interest in implementing similar conversion systems for other languages[citation needed], and it is hoped that this page can cast some light as to how to approach using the infrastructure that's been developed as a by-product of the Chinese conversion system.

For an essay expounding on some of the problems with the current conversion system, see Problems with server-side Chinese to Chinese translation in MediaWiki.

In addition to Chinese Wikipedia, Chinese Wiktionary, Wikiquote, and Wikibooks also have the conversion systems. When Multilingual Wikisource hosted texts in all languages without the conversion system there, there were problems with regard to how to handle simplified and traditional Chinese. However, once Chinese Wikisource was opened in 2005 with the trend of opening language subdomains, the problem has been solved.

Multilingual sites like Meta, Wikimedia Commons, and Multilingual Wikisource have no automatic conversion between simplified and traditional Chinese. Multilingual Wikisource no longer accepts Chinese articles, so they have to be posted on Chinese Wikisource.

Contents

 [hide

[edit] Background

Current Chinese language speakers use two different writing systems ― simplified and traditional. Traditional Chinese characters are used primary in the Republic of China (Taiwan), Hong Kong, Macau, and the Chinese diaspora in North America. Simplified Chinese is used in Mainland China and Singapore. Chinese written in Malaysia, though no official status, is commonly simplified as well.

Sample differences of Chinese writing systems
 
TraditionalSimplified
  
  

Automatic conversion between them is critical for the future of Chinese Wikipedia. Current conversion tools such as libiconv convert characters between two different encodings (such as GB and BIG5), but the situation for Chinese is the mapping between characters inside the Unicode character set, which includes both simplified and traditional Chinese characters. For most simplified and traditional Chinese characters, there exist a one-to-one mapping, but there are still about 100 pairs of simplified and traditional Chinese characters which are not one-to-one.[citation needed] Also, because there is no space between two adjacent characters in a Chinese sentence, it makes for more complex parsing.

A completely automatic conversion would need some AI features, and would be difficult. Instead, we can introduce some kind of new markup to solve the problem.

[edit] Implementing similar conversion system for other languages

[edit] In MediaWiki 1.4

Most conversion related functions are implemented in the language object.[citation needed] For Chinese, this is in LanguageZh.php. At a minimum, the language object should override the following methods[citation needed] to support multiple variants:

  • getVariants
function getVariants() {
        return array('zh', 'zh-cn', 'zh-tw', 'zh-sg', 'zh-hk');
}

This function returns an array of language variants that's supported. There should exist language files for the supported variants.

  • getPreferredVariant()
function getPreferredVariant()

This function should return the user's preferred variant.

  • convert()
function convert( $text, $isTitle=false )

This function implements the actual conversion of the text. the flag isTitle signals whether the input text is the article title. In the Chinese, sometime titles are converted a bit differently. This function is called near the end of the parser (Parser.php).[citation needed] Anything that's parsed by the parser will be converted. Note that the parsing of the manual markup -{}- is also done in this function. This should probably be singled out so that other languages can reuse it easily.[citation needed]

  • getExtraHashOptions()
function  getExtraHashOptions() {
        $variant = $this->getPreferredVariant();
        return '!' . $variant ;
}

This function returns a string of the user's preferred language variant. It is used for caching purpose.

Note that some of these functions are not Chinese specific, for example getExtraHashOptions() and getPreferredVariant(), and the parsing code for the manual markup in convert(). Some code refactoring can be done if and when there is a need to implement conversion systems for other languages.[citation needed]

The code for supporting customizable conversion table is a bit messy currently.[citation needed] Code cleanup is underway and it's usage will be updated here shortly.[citation needed]

[edit] In MediaWiki 1.5

Since April 2005, there is a LanguageConverter object that encapsulates most conversion related functionalities. Conversion system for the Serbian language is being actively worked on, and a preliminary test site is up and running at (All links are broken now):

There is also a test site with conversion support for English, just for fun atm:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值