UTF-8 and Unicode Standards
What is UTF-8?
UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) lossless encoding of Unicode characters.
UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.
Standards
- RFC 3629: UTF-8, a transformation format of ISO 10646. November 2003. The Unicode Standard 4.0, August 2003. [purchase from Amazon.com] In particular, see the informal description of UTF-8 in sections 2.5 and 2.6, pages 30-32, and a much more formal definition in sections 3.9 and 3.10, pages 77-81.
Articles and background reading
- UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn
- Forms of Unicode, an excellent overview by Mark Davis
- Wikipedia UTF-8 contains a good discussion of why five- and six-octet sequences are now illegal UTF-8
- Unicode Transformation Formats [czyborra.com]
- Unicode UTF-8 FAQ
- Unicode in XML and other Markup Languages: Unicode Technical Report #20
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), an amusing and informative article by Joel Spolsky
- Forms of Unicode, an excellent overview by Mark Davis
Character Sets
The MIME character set attribute for UTF-8 is UTF-8. Character sets are case-insensitive, so utf-8 is equally valid. [IANA Character Sets].
In an HTML file, place this tag inside <head> ... </head>:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
In an XML prolog, the encoding is typically specified as an attribute:
<?xml version="1.0" encoding="UTF-8" ?>
In Apache server config or .htaccess, this will cause the HTTP header to be generated for text/html and text/plain content:
AddDefaultCharset UTF-8

댓글을 달아 주세요