What Is UTF-8 And Why Is It Important?

本文介绍了UTF-8编码的基本原理及其重要性。UTF-8是一种多字节编码方式,能够高效地表示ASCII字符集,并兼容单字节编码,同时简化了多语言文本处理。

Unicode is a character set supported across many commonly used software applications and operating systems. For example, many popular web browser, e-mail, and word processing applications support Unicode. Operating systems that support Unicode include Solaris Operating Environment, Linux, Microsoft Windows 2000, and Apple's Mac OS X. Applications that support Unicode are often capable of displaying multiple languages and scripts within the same document. In a multilingual office or business setting, Unicode's importance as a universal character set cannot be overlooked.

Unicode is the only practical character set option for applications that support multilingual documents. However, applications do have several options for how they encode Unicode. An encoding is the mapping of Unicode code points to a stream of storable code units or octets. The most common encodings include the following:

  • UTF-8
  • UTF-16
  • UTF-32
Each encoding has advantages and drawbacks. However, one encoding in particular has gained widespread acceptance. That encoding is UTF-8. This article describes UTF-8, what it is, and why it is important.

Table 1 defines some terms that are used in this document.

Table 1 Common Definitions


Character Set A repertoire of characters that have been collected together for some purpose.
Coded Character Set An ordered character set in which each character has an assigned integer value.
Code Point The integer value of a character within a coded character set.
Character Encoding A mapping of code points to a series of bytes.
Code Unit A single octet or byte of an encoded character.
Charset A set of characters that has been encoded using a character encoding . Often used as a synonym for character encoding.

What is it?

Unicode 3.1 code points exist in the range U+0000 - U+10FFFF . Although each of the code points can be stored and manipulated as 32-bit integers, convincing the world to use a 32-bit wide character encoding won't be immediately successful everywhere. This is especially true for Western European and non-Asian nations in general, which can encode their legacy character sets in as little as one byte per character.

UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character. [1]

The encoding algorithm is straightforward. Table 2 below shows how bits from a Unicode code point are arranged in the encoding for different character ranges.

Table 2 UTF-8 Bit Encoding of a Unicode Code Point


Character Range 1st Byte 2nd Byte
3rd Byte
4th Byte
U+0000 - U+007F 00..7F


U+0080..U+07FF C2..DF 80..BF 80..BF
U+0800..U+0FFF E1..EC 80..BF 80..BF  
U+1000..U+CFFF E1..EC 80..BF 80..BF  
U+D000..U+D7FF ED 80..9F 80..BF  
U+D800..U+DFFF ill-formed
 
 
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F090..BF80..BF80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF
F480..8F80..BF80..BF

As the above table shows, characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The next range, U+0080 - U+07FF , contains the remaining characters for most of the world's scripts and includes characters with diacritics. This range requires two bytes of encoded storage. The notable scripts in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese. These scripts require three bytes of storage for each character. Finally, the non-BMP range contains characters that can be represented as surrogate pairs in UTF-16. Most of the new characters in this range are Chinese ideographs. The newly defined characters in this range require four bytes in the UTF-8 encoding.

Algorithms for producing a UTF-8 encoded character can be very simple. The following Java code shows how you can easily create your own UTF-8 encoder [2] :

/**
* Converts an array of Unicode scalar values (code points) into
* UTF-8. This algorithm works under the assumption that all
* surrogate pairs have already been converted into scalar code
* point values within the argument.
*
* @param ch an array of Unicode scalar values (code points)
* @returns a byte[] containing the UTF-8 encoded characters
*/
public static byte[] encode(int[] ch) {
// determine how many bytes are needed for the complete conversion
int bytesNeeded = 0;
for (int i=0; i<ch.length; i++) {
if (ch[i] < 0x80) {
++bytesNeeded;
}
else if (ch[i] < 0x0800) {
bytesNeeded += 2;
}
else if (ch[i] < 0x10000) {
bytesNeeded += 3;
}
else {
bytesNeeded += 4;
}
}
// allocate a byte[] of the necessary size
byte[] utf8 = new byte[bytesNeeded];
// do the conversion from character code points to utf-8
for(int i=0, bytes = 0; i<ch.length; i++) {
if(ch[i] < 0x80) {
utf8[bytes++] = (byte)ch[i];
}
else if (ch[i] < 0x0800) {
utf8[bytes++] = (byte)(ch[i]>> 6 | 0xC0);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else if (ch[i] < 0x10000) {
utf8[bytes++] = (byte)(ch[i]>> 12 | 0xE0);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
else {
utf8[bytes++] = (byte)(ch[i]>> 18 | 0xF0);
utf8[bytes++] = (byte)(ch[i]>> 12 & 0x3F |
0x80);
utf8[bytes++] = (byte)(ch[i]>> 6 & 0x3F | 0x80);
utf8[bytes++] = (byte)(ch[i] & 0x3F | 0x80);
}
}
return utf8;
}

Why is it Important?

UTF-8 is an important encoding because of the following reasons:
  • ASCII compatible
  • easily supported
  • compact and efficient for most scripts
  • easily processed, unlike other multibyte encodings
At the recent Unicode Conference in Hong Kong, one company said that their move to Unicode was simplified by the adoption of UTF-8. Instead of changing their products' code to support 16-bit or 32-bit wide Unicode characters, they chose UTF-8 instead. What was their reason? They said that their system had lots of hard-coded comparisons to find specific ASCII characters in text. Instead of modifying their code everywhere, they simply changed their character encoding to UTF-8, which is compatible with ASCII. In other words, single byte ASCII characters retain their encoded value in UTF-8. For example, code that checks for a '' can continue checking for the byte value 0x5C instead of changing the code to check for 0x005C . Modifying hundreds of lines of text processing code scattered throughout thousands of lines of miscellaneous code can be time consuming and error prone. Sometimes selecting the UTF-8 encoding can provide the easiest and most cost-effective way to get a basic level of Unicode support in a legacy application.

Most applications have basic text-handling algorithms. Many of those algorithms make flawed assumptions about a character's storage requirements. For example, many programmer's assume that a character requires only a single byte of storage. Another common assumption, especially for C programmers, is that a text string never contains the value 0x00 . If this value does appear, it typically marks the end of the text string. Encodings like UTF-16 and UTF-32 store characters as 16- or 32-bit values. When a string of 16- or 32-bit values are processed as a series of byte values, the value 0x00 often appears, especially in Latin-based scripts. This complicates and confuses existing text processing algorithms, leading to miscalculated string lengths, oddly concatenated strings, and search failures. On the other hand, because UTF-8's basic code unit is a byte, legacy algorithms can typically run with only minor adjustments, if any.

One complaint often aimed at Unicode is that it requires so much more space than legacy encodings for Latin-based scripts. In other words, UTF-16 or UTF-32 require 16 or 32 bits of storage for most characters instead of a single byte required by the series of ISO-8859 encodings. However, UTF-8 stores the ASCII subset of all these charsets in as little as one byte. The ASCII subset is definitely the most used set of characters for Western European and American languages. As mentioned earlier, most Western European languages can be written with 1.1 bytes per character on average. This is almost as efficient as ASCII, but it allows for up to four bytes per character for rare characters and obscure scripts when necessary.

Although many new development projects standardize quickly on Unicode, older projects often used legacy character sets that supported a small set of related languages. Experienced internationalization and localization engineers remember updating text processing algorithms to handle both "single-byte" and "multibyte" character sets. Do you remember updating your code to check "lead" bytes and possibly "trail" bytes during processing? Remember how difficult it was to find the beginning of a character if your index into the text was an arbitrary location? The problem was that trail bytes could also be lead bytes in some encodings. The Shift-JIS encoding, for example, was difficult to process backwards for this reason.

When Unicode became available as a fixed-width 16-bit encoding, many were excited to throw out multibyte encodings. Understandably, you may be hesitant to adopt a multibyte Unicode encoding after all the troubles you may have had with multibyte Asian character sets. However, UTF-8 is different, and it doesn't have all of the same problems as those legacy encodings. For example, it is much easier to find the start of a character from any arbitrary point in a text string. So called "trail" bytes of a UTF-8 character sequence always have the bit pattern 10xxxxxx , so it is easy to find one's way back to the beginning of a character. A character pointer is at most three bytes away from the character's beginning. Even with most Asian ideographs, character boundaries are at most just a couple of bytes away. Figure 1 shows several characters and their encoding in UTF-8. Notice the hexadecimal byte sequence E5 , AD ,97 . If asked to find the character's beginning from the location marked 1 , we could proceed as follows to find the character boundary at location 2 in the figure:

  1. Does the current byte start with the bit pattern 10xxxxxx?
  2. If yes, move left and go to step #1.
  3. Finished.
Figure 1 Finding Character Boundaries is Relatively Simple

Unlike some legacy character encodings, UTF-8 is fairly easy to parse and manipulate. The bit patterns of the encoding allow you to quickly determine whether your character index points to a character's beginning or somewhere else. Moving backward or forward within a string is easy.

Summary

UTF-8 is a compact, efficient Unicode encoding. The encoding distributes a Unicode code value's bit pattern across one, two, three, or even four bytes. This encoding is a multibyte encoding.

UTF-8 encodes ASCII in only one byte. That means that languages that use Latin-based scripts can be represented with only 1.1 bytes per character on average. Other languages may require more bytes per character. Only the Asian scripts have significant encoding overhead in UTF-8 compared to UTF-16.

UTF-8 is useful for legacy systems that want Unicode support because developers don't have to drastically modify text processing code. Code that assumes single-byte code units typically don't fail completely when provided UTF-8 text instead of ASCII or even Latin-1.

Finally, unlike some legacy encodings, UTF-8 is easy to parse. So-called lead and trail bytes are easily distinguished. Moving forwards or backwards in a text string is easier in UTF-8 than many other multibyte encodings.


[1] Forms of Unicode, Mark Davis, September 1999, http://www.ibm.com/developerworks/unicode/library/utfencodingforms/index.html .
[2] This code has not been optimized for size or speed.

© 2001 John O'Conner. John O'Conner is a staff engineer specializing in Java internationalization.

这是我的答案:Business Scenario & Database Justification As the online movie platform rapidly expands, users are no longer satisfied with simply browsing popular releases—they seek personalized suggestions, reliable ratings, and detailed reviews to guide their next viewing choices. However, the existing system is incoherent, difficult to maintain, and lacks support for scalable data analysis or personalized features. To address these challenges, a relational database system has been proposed. This system aims to store, relate, and retrieve structured data efficiently across several key entities: users, movies, genres, reviews, favorites, and recommendations. A file-based or spreadsheet solution fails to offer referential integrity, concurrent access, and query efficiency, which are essential for real-time movie recommendation and review retrieval. ✅ Database Components Covered: Users, Movies, Movie_Genres, Reviews, Favorites, Recommendations Mapping tables & foreign key constraints ensure relational consistency Real-time queries enable custom experiences for each user 2. Business Rules and Assumptions The database design is grounded in the following operational rules and assumptions: 🟢 A user must register before posting reviews or receiving recommendations. 🟢 A movie can belong to multiple genres, and each genre can apply to many movies. 🟢 A review is always linked to one user and one movie only. 🟢 A user can save a movie as favorite, and each favorite entry tracks timestamp. 🟢 Movie recommendations are generated per user by algorithms, stored along with confidence scores. 🟢 Only registered users can interact with the platform. Admin-level privileges are assumed for content moderation and system management. 3. Challenges and Possible Solutions This system must address data privacy, ethical considerations, and performance bottlenecks: Privacy Concern: Storing emails and user behavior requires GDPR-compliant encryption and anonymization. → Solution: Store hashed passwords, limit access via role-based permission. Scalability Issue: As user base grows, inefficient queries may overload the system. → Solution: Normalize the data structure to 3NF, use indexes on foreign keys. Ethical Concern: Biased or abusive reviews must be handled. → Solution: Add moderation workflow and time-based filters to detect anomalies. 4. Functional Requirements The database must support the following operations to meet both user-facing and internal business needs: User Registration & Login Movie Management: Add/update/delete movie entries Genre Categorization: Link movies to multiple genres Search & Filter Movies by genre, rating, language, etc. Review Management: Users can add/edit/delete reviews Favorites Tracking: Users can save favorite movies Recommendation Delivery: Users receive algorithm-based suggestions Admin Reporting: Average ratings, top genres, user activity logs 5. Reflection on Requirement Gathering The requirements were derived by analyzing the business scenario provided, identifying gaps in user experience, and aligning database features with expected platform behavior. The use of MoSCoW prioritization and entity-relationship planning helped to balance feature inclusion against system complexity. 这是我的问题: Q1: Requirements Description (Marks 20) You will begin the project by working on the above given topic. You will identify your companies business requirements by doing some search / research. Identify the business requirements that will allow you to understand the business processes. Build a list of business needs, rules and assumptions based on your scenario. Use the following categories to help you with this:  Business Scenario: A business scenario describes a specific situation or context in which a business operates, including its processes, requirements, goals, and challenges. It outlines the need for a solution (e.g., a database) and defines how the solution will address the needs of the business. In the context of database design, the scenario should clarify why the database is necessary, what it aims to achieve, and how it fits into the business workflow.You should clearly state the need for a database and identify its components in paragraphs. Why its important to design a database instead of spreadsheet or file system in the context of problem mentioned? Usually, one paragraph pertains to one or more tables and relationships.  Business rules and assumptions: Business rules and assumptions are foundational elements in database design that help define how a business operates and how its data is structured, managed, and processed. These elements ensure that the database accurately reflects the real-world processes and constraints of the organization. It is used to understand business processes and the nature, role, and scope of the data. For Example, I) A customer cannot place an order without registering in the system; ii) Each product have a unique product ID etc.  Problems and possible solutions: In the context of database design, problems and possible solutions refer to the challenges that arise due to various legal, ethical, financial, or operational considerations that need to be addressed to ensure the database functions effectively within the given business environment. Identifying these challenges early allows the designer to propose practical solutions that minimize risks and optimize performance. These problems can be defined as legal, ethical, and financial considerations that requires attention and a possible solution to alleviate the situation. Functional Requirements: Functional requirements specify the actions and features a system must perform to meet user needs. For the movie database, the focus will be on features or functionalities like movie management (add, update, delete), search and filter options, to name a few for a seamless user experience. Write the list of functionalities for the database system you design or gather the requirements. 请基于我的答案 对我的问题生成一个思维导图解决问题
05-17
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值