Select and apply a character encoding

Intended audience: HTML developers (who use editors or scripts), script developers (PHP, JSP u.a.), CSS developers, web project managers, and anyone looking for guidance on how to choose and apply character encoding


Which character encoding should you choose for your content and how do you apply it to your content?

Content is composed of a sequence of characters. Characters include the letters of the alphabet, punctuation marks, etc. However, in a computer, content is stored as a sequence of bytes, which are numerical values. Some characters are represented by more than one byte. As with ciphers in espionage, the way sequences of bytes are converted to characters depends on the key with which the text was encoded. The Key is called in this context Character encoding.

This article gives simple advice on what character encoding to use for your content and how to apply it, d.h. how to create a document in this character encoding.

If you want to better understand what characters and character encodings are, read the article Character Encoding for Beginners .

Short answer

Use UTF-8 for all your content. Consider converting content in deprecated character encodings to UTF-8.

If you cannot use Unicode encoding, check that the encoding you have chosen is supported by different browsers and that this encoding is not on the list of encodings to avoid, which according to current specifications should not be used.

Check if your choice is overridden by server-side HTTP settings.

In addition to declaring the encoding of the document inside the document and/or on the server, you need to save the text in that encoding to apply it to your content.

Developers must also ensure that the different parts of the system can communicate with each other.


Apply the character encoding to the content

Content authors should specify the character encoding of their pages using one of the methods described in Specifying Character Encoding in HTML.

However, it is important to understand that it is not enough to specify the character encoding inside the document or on the server. This does not change the bytes; you must save the text in this character encoding. (The specification only helps the browser to interpret the byte sequence in which the text is stored.)

It is best to set a character encoding such as UTF-8 in your editor as the default for new documents, if possible. The following image shows how to do this in Dreamweaver preferences.

For information on normalization forms, see Normalization in HTML and CSS . For information on the Unicode signature (BOM), see The BOM ( byte-order mark ) in HTML .

In Dreamweaver, the settings for new documents allow you to preset a character encoding

You should also make sure that your server delivers documents with the correct HTTP specifications, because these override the specifications inside the document (see below).

Developers also need to make sure that the different parts of the system can communicate with each other. Web pages need to work with scripts in the backend, databases, etc. can communicate. Of course, this works best if everything is UTF-8 encoded. What developers need to consider can be found in the article Migration to Unicode .

Why use UTF-8?

An HTML page can be encoded in only one character encoding. You cannot encode different parts of a document in different character encodings.

A Unicode encoding such as UTF-8 can support many languages and make pages and forms conform to any mix of languages. If you use a Unicode encoding, you do not need server-side logic to determine the character encoding separately for each page delivered or for all incoming form data. This significantly reduces the processing time for a multilingual website or application.

Unicode encoding also allows to use many more languages mixed on one web page than it would be possible with any other character encoding.

Support for a particular character encoding, not even a Unicode encoding, does not necessarily mean that a browser will render the text correctly. A number of fonts, such as.B. the Arabic and Indian scripts, require additional rules to convert a sequence of characters in memory into the appropriate sequence of characters (glyphs) to be displayed.

The barriers to using Unicode are very low these days. In January 2012, Google announced that over 60% of the web now uses UTF-8 out of several billion web pages examined. If you add the number of ASCII-only web pages (ASCII is a subset of UTF-8), the figure increases to close to 80%.

There are 3 different character encodings for Unicode: UTF-8, UTF-16 and UTF-32. Of these, only UTF-8 is recommended for use for web content. The HTML5 specification says: "Authors should use UTF-8. Validators can tell authors not to use obsolete character encodings. Authoring tools should use UTF-8 as the default for new documents."

All ASCII characters are encoded by exactly the same bytes in UTF-8 as in ASCII encoding, which is often helpful for interoperability and backward compatibility.

Consideration of the HTTP header

A character encoding specification in the HTTP header overrides specifications within the document. If the HTTP header specifies a character encoding that does not match the one you want to use for your content, this poses a problem if you cannot change the server settings.

You may not have access to the information in the HTTP header and need to ask your server administrators for help. On the other hand, you can change server settings if you have limited access to configuration files or if you are generating pages with scripting languages. See HTTP charset parameter setting for more information on how to change the character encoding specification for a number of files on the server or for content generated by scripting language.

Before doing so, you should check if the HTTP header contains a character encoding specification. You can use the W3C Internationalization Checker to find out if a character encoding is specified in the HTTP header, and if so, which one. The article Checking HTTP headers refers to alternative tools for checking the server’s character encoding specification.

More information

This section contains subtleties that you don’t necessarily need to know, but are mentioned here for the sake of completeness.

What to do if you can’t use UTF-8?

If you really can’t avoid using an encoding other than UTF-8, you must choose one from a limited set of character encoding identifiers to ensure maximum interoperability and future readability of your content, and to minimize security vulnerabilities.

Until recently, the IANA registry was the reference for identifiers of character encodings. The IANA registry often contains multiple identifiers for the same encoding. In these cases you should use the designator marked as " preferred ".

The new Encoding specification includes a list tested against current browser implementations. You can find them in the table in the Encodings section. It is best to use the identifiers in the left column of this table.

Notice: If an identifier appears in one of these sources, it does not automatically mean that it would be good to use that coding. Read the following section to learn which character encodings you should avoid.

Avoid these character codes

The HTML5 specification lists some character encodings that you should avoid.

Documents may not JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), ISO-2022-based coding or EBCDIC-Use based codings. The reason is that ASCII character codes in it represent non-ASCII characters, which is a security vulnerability.

Documents may also not CESU-8, UTF-7, BOCU-1 or SCSU-Use encodings; these were never intended for web content and the HTML5 specification prohibits browsers from using them.

The specification also advises against the use of UTF-16 and from the use of UTF-32 Is "particularly discouraged".

Other character encodings listed in the Encoding specification should also not be used, including Big5 and EUC-JP, which are problematic in terms of interoperability. ISO-8859-8 (Hebrew encoding for visual letter order) you should also not use, but an encoding that encodes in logical letter order (UTF-8; or if that is not possible: ISO-8859-8-i).

The characters listed in the Encoding Specification replacement-Encoding is actually not encoding, but a fallback that maps each octet (byte) to the Unicode character code U+FFFD REPLACEMENT CHARACTER. Obviously, it does not make sense to transmit data in this coding.

The x-user-defined-Encoding is a one-byte encoding, the lower half of which is ASCII and the upper half of which maps into the Unicode Private Use Area ( PUA). Like the private use space in general, this coding should be avoided on the public Internet because it is detrimental to interoperability and long-term use.

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: