每位软件开发者必须了解 Unicode 和字符集_the absolute minimum every software developer abso-优快云博客

本文链接：https://blog.youkuaiyun.com/u013669912/article/details/140420715

注：20年前的旧文，机翻，已初校。

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
每个软件开发人员必须绝对、积极地了解有关 Unicode 和字符集的最低要求（没有借口！）

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be?
有没有想过神秘的 Content-Type 标签？你知道，它应该放在 HTML 中，但你却不知道它应该是什么？

Did you ever get an email from your friends in Bulgaria with the subject line “??? ??? ??? ???”?
你是否曾收到过来自保加利亚朋友的电子邮件，主题为“??? ??? ??? ???”

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.
令我沮丧的是，许多软件开发人员并不真正了解字符集、编码、Unicode 等神秘的世界。几年前，FogBUGZ 的一位 beta 测试人员想知道它是否可以处理日语传入电子邮件。日语？他们有日语电子邮件？我不知道。当我仔细查看我们用于解析 MIME 电子邮件的商业 ActiveX 控件时，我们发现它在处理字符集时完全是错误的，因此我们实际上必须编写大量代码来撤消它所做的错误转换并重新正确执行。当我查看另一个商业库时，它也有一个完全错误的字符代码实现。我与该软件包的开发人员进行了通信，他认为他们 “对此无能为力”。像许多程序员一样，他只是希望一切都能以某种方式过去。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.
但事实并非如此。当我发现流行的 Web 开发工具 PHP 几乎完全不考虑字符编码问题，随意使用 8 位字符，使得开发良好的国际 Web 应用程序几乎不可能时，我想，够了。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
因此，我要宣布一件事：如果你是一名 2003 年的程序员，却不懂字符、字符集、编码和 Unicode 的基本知识，而我发现你犯了这种错误，我就要惩罚你，让你在潜艇里剥六个月的洋葱。我发誓我会这么做的。

And one more thing: IT’S NOT THAT HARD.
还有一件事情：这没那么难。

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.
在本文中，我将向你详细介绍每个程序员都应该知道的内容。关于 “纯文本 = ascii = 字符为 8 位” 的所有内容不仅是错误的，而且是完全错误的，如果你仍然以这种方式编程，那么你就和不相信细菌的医生好多少。在阅读完本文之前，请不要再编写任何代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.
在我开始之前，我应该警告你，如果你是少数了解国际化的人之一，你会发现我的整个讨论有点过于简单化。我只是想在这里设定一个最低标准，以便每个人都能理解正在发生的事情，并能编写出能够处理除英语子集之外的任何语言文本的代码，这些子集不包括带重音的单词。我还应该警告你，字符处理只是创建国际化软件所需的一小部分，但我一次只能写一件事，所以今天是字符集。

A Historical Perspective

历史视角

The easiest way to understand this stuff is to go chronologically.
理解这些内容的最简单方法是按时间顺序进行。

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.
你可能认为我会在这里讨论非常古老的字符集，例如 EBCDIC。好吧，我不会。EBCDIC 与你的生活无关。我们不必追溯那么久远。

ASCII table

Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.
回到过去的时代，当 Unix 刚刚发明，K&R 正在编写《C 编程语言》时，一切都非常简单。EBCDIC 即将被淘汰。唯一重要的字符是老式的无重音英文字母，我们为它们提供了一种称为 ASCII 的代码，它能够使用 32 到 127 之间的数字来表示每个字符。空格是 32，字母 “A” 是 65，等等。这可以方便地存储在 7 位中。当时大多数计算机都使用 8 位字节，因此你不仅可以存储所有可能的 ASCII 字符，而且还有一堆多余的位，如果你心怀不轨，你可以将其用于自己的邪恶目的：WordStar 的笨蛋们竟然将高位打开，以表示单词的最后一个字母，从而将 WordStar 限制在只能处理英文文本中。小于 32 的代码被称为不可打印字符，并用于输入脏话。只是开玩笑。它们被用作控制字符，例如 7 会使你的计算机发出哔哔声，而 12 会使当前页面的纸张飞出打印机并送入新的页面。