Character Set

Character Sets

1. Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
The purpose for defining these two character sets is to separate out the environment in which the source is translated from the environment in which the translated output is executed. When these two environments are different, the translation process is commonly known as cross compiling. An implementation may add additional characters to either the source or execution character set. There is no requirement that any additional characters exist in either environment.
A collating sequence is defined, but the particular values for the characters are not specified. Only these, characters are guaranteed to be supported by a conforming implementation. The characters used in the definition of the C language exist within both the source and execution character sets. It is intended that a C translator be able to successfully translate a C translator written in C. What is needed for C is to determine the necessary repertoire, ignore the collating sequence altogether (it is of no importance to the language), and then find ways of expressing the repertoire in a way that should give no problems with currently popular code sets.
Developers whose native tongue is English tend to be unaware of the distinction between source and execution character sets. Most of these developers do most of their development in environments where they are identical.
Each set is further divided into basic character sets and it separates out the two components of the character set used by an implementation; the one which is always required to be provided and the extended set which is optional. This explicit subdivision of characters into sets is new in C99. The wording in the C90 Standard specified the minimum contents of the basic source and basic execution character sets. These terms are now defined exactly, with all other characters being called extended characters. However, the handling of such characters is part of the application domain and outside the scope of these coding guidelines. The issues involved in programs written in one locale targeted at another locale is also largely outside the scope of these coding guidelines.
Using characters from the developer’s native language can make an important contribution to program readability for those developers who share that native language. Some applications are now developed using people whose native languages differ from each other. The issue of using extended characters to improve source code readability is not always clear-cut. What is the best way to handle programs made up of translation units developed by developers from different locales; should all developers working on the same application use the same locale? To a large extent these questions involve predicting the future. Who will be doing the future development and maintenance of the software? It may not be possible to provide a reliable answer to this question.
A developer might be forgiven for thinking that this term applied to the set of extended characters only. For both the source character set and the execution character set, the following statement is true: extended_character_set = basic_character_set + extended_characters;
This has already been stated for the members of the source character set. Although it might not specify their values, the standard does specify some of the properties of objects that hold them. Developers tend to make several assumptions about the values of the execution character set:
  • They are the same as the source character set.
  • All the uppercase letters, all the lowercase letters, and all the digits are contiguous.
  • They are less than 128.
  • The actual values used by a translator (e.g., space being 32).
Only the assumption about the digits being contiguous is guaranteed to be true. A program may contain implicit dependencies on the representation of members of the execution character set because developers are not aware they are making assumptions about something that is not fixed. Designing programs to accommodate the properties of different character sets is not a trivial matter that can be covered in a few guideline recommendations.
2. In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
This describes the two methods specified by the standard for representing members of the execution character set in character constants and string literal. Escape sequences are a method of representing execution characters in the source, which may not be represent-able in the source character set. They make it possible to explicitly specify a particular numeric value, which is known to represent a given character in the execution character set (as defined by the implementation). This is one route through which characters appearing in the source code can appear in the output produced by a program, another is the __func__ reserved identifier which provides a mechanism for the name of a function to appear in a string.
String literal are not always used to simply represent character sequences. A developer may choose to embed other, numeric, information within a string literal. Relying on characters to have the desired value would create a dependence on a particular character set being used and create literal that were harder to interpret (use of escape sequences makes it explicit that a numeric value is required). The contents of string laterals therefore need to be interpreted in the context in which they are used. The value of an escape sequence may, or may not, be the same as that of a member of the basic character set. The extent to which the value of an escape sequence does, or does not, represent a member of the basic character set is one of intent on the part of the developer.
This defines the term null character (an equivalent one for a null wide character is given in the library section). The null character is used to terminate string literal. The C committee once received a request from a communications-related standards committee asking that this requirement be removed from the C Standard. The sending of null bytes was causing problems on some communications links. The C committee pointed out that C’s usage was a long-established practice and that they had no plans to change it. There is a common beginner’s mistake that is sometimes not diagnosed because an implementation has defined the NULL macro to be 0, rather than (void *) 0. If C++ compatible headers are being used, the problem is not helped by that language’s explicit requirement that the null pointer be represented by 0.
A string literal may contain more than one null character, or none at all. In the former case the literal will contain more than one string (according to the definition of that term given in the library section) — for instance, “abc00xyz” and char s[3] = “abc”. Each null character terminates a string even though more than one of them may appear in a string literal. Null characters are different from other escape sequences in a string literal in that they have the special status of acting as a terminator (e.g., the library string searching routines terminate when a null character is encountered, leaving any subsequent characters in the literal unchecked). Any surprising behavior occurring because of this usage is a fault and these coding guidelines are not intended to recommend against the use of constructs that are obviously faults.
3. Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z the 26 lowercase letters of the Latin alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z the 10 decimal digits 0 1 2 3 4 5 6 7 8 9 the following 29 graphic characters ! ” # % & ‘ ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~ the space character( ), and control characters representing horizontal tab (\h), vertical tab (\v), new line (\n) and form feed(\f). The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.
C89 Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed; thus the common Japanese practice of using the glyph “¥” for the C character “\” is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.
For this reason, and for economy of description, source code is described as if it undergoes the same translation as text that is input by the standard library I/O routines: each line is terminated by some newline character regardless of its external representation. The characters vertical tab, form feed, carriage return, and new-line are sometimes referred to as line break characters. This term describes the most commonly seen visual effect of their appearance in a text file, not how a translator is required to interpret them.
Experience shows that developers usually have access to computers capable of displaying and inputting characters from the ASCII character set. When these characters are intended to appear in the execution character set, it becomes an applications issue. Most source code is displayed using a fixed-width font. Research has shown that people read text faster, when it is displayed in a variable-width font than a fixed-width font. Comparing text has also been found to be quicker, but not searching for specific words. Use of a variable width font would also enable more characters to be displayed on a line, reducing the need to split statements across more than one line. Horizontal tab is a single white-space character. However, when viewing source code containing such a character, many display devices appear to replace it with more than one white-space character. There is no agreed-on spacing for the horizontal tab character and its use can cause the appearance of the source code to vary between display devices. The standard contains an alternative method of representing horizontal tab in string literal and character constants. With the concept of multibyte characters, “native” characters could be used in string literal and character constants, but this use was very dependent on the implementation and did not usually work in heterogeneous environments. Also, this did not encompass identifiers.
The definition of character already specifies that it fits in a byte. However, a character constant has type int; which could be thought to imply that the value representation of characters need not fit in a byte. This wording clarifies the situation. The representation of members of the basic execution character set is also required to be a non-negative value. A general principle of coding guidelines is to recommend against the use of representation information. In this case the standard is guaranteeing that a character will fit within a given amount of storage. Relying on this requirement might almost be regarded as essential in some cases. The Committee realized that a large number of existing programs depended on this statement being true. It is certainly true for the two major character sets used in the English-speaking world, ASCII, EBCDIC, and all of the human language digit encodings specified in Unicode.
The C library makes a distinction between text and binary files. However, there is no requirement that source files exist in either of these forms. The worst-case scenario: In a host environment that did not have a native method of delimiting lines, an implementation would have to provide/define its own convention and supply tools for editing such files. Some integrated development environments do define their own conventions for storing source files and other associated information. Unicode Technical Report #13: “Unicode newline guidelines” discusses the issues associated with representing new-lines in files. The ISO 6429 standard also defines NEL (NExt Line, hexadecimal 0x85) as an end-of-line indicator. The Microsoft Windows convention is to indicate this end-of-line with a carriage return/line feed pair, \r\n; the Unix convention is to use a single line feed character \n; the MacIntosh convention is to use the carriage return character, \r.  Some mainframes implement a form of text files that mimic punched cards by having fixed-length lines. Each line contains the same number of characters, often 80. The space after the last user-written character is sometimes padded with spaces, other times it is padded with null characters.
These characters form part of the set of 96 execution character set members defined by the standard, plus new line which is introduced in translation phase 1. However, these characters are not in the basic source character set, and are represented in it using escape sequences. The standard does not prohibit such characters from occurring in a source file outright. The Committee was aware of implementations that used such characters to extend the language. For instance, the use of the @ character in an object definition to specify its address in storage. The list of exceptions is extensive. The only usage remaining, for such characters, is as a punctuator. Any other character has to be accepted as a pre-processing token. It may subsequently, for instance, be stringized. It is the attempt to convert this pre-processing token into a token where the undefined behavior occurs pre-processing.
4. A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
There is a third kind of case that characters can have, titlecase (a term sometimes applied to words where the first letter is in uppercase, or titlecase, and the other letters are in lowercase). In most instances titlecase is the same as uppercase, but there are a few characters where this is not true; for instance, the titlecase of the Unicode character U01C9, lj, is U01C8, Lj, and its uppercase is U01C7, LJ. All implementations are required to support the basic source character set to which this terminology applies. Annex D lists those universal character names that can appear in identifiers. However, they are not referred to as letters (although they may well be regarded as such in their native language). The term letter assumes that the orthography (writing system) of a language has an alphabet. Some orthographies, for instance Japanese, don’t have an alphabet as such (let alone the concept of upper and lowercase letters). Even when the orthography of a language does include characters that are considered to be matching upper and lowercase letters by speakers of that language (e.g., æ and Æ, å and Å), the C Standard does not define these characters to be letters.
5. The universal character name construct provides a way to name other characters.
In theory all characters on planet Earth and beyond. In practice, those defined in ISO 10646.
Advertisements
Tagged ,
%d bloggers like this: