UTF-8 and Unicode

Unicode Transformation Format 8-bit is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. [1]

The Unicode Standard, Version 5.0

UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet.

UTF-8 is the default encoding for XML and since 2010 has become the dominant character set on the Web.


Articles and background reading

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

Character Sets

The MIME character set attribute for UTF-8 is UTF-8. Character sets are case-insensitive, so utf-8 is equally valid. [IANA Character Sets].

In a modern HTML 5 page, place this tag inside <head> ... </head>:

<meta charset="UTF-8">

In an XML prolog, the encoding is typically specified as an attribute:

<?xml version="1.0" encoding="UTF-8" ?>