Environment

Before you can efficiently institute new programming methodologies, you need to understand the features available to you in your programming environment. Quite often, it is beneficial to ask yourself why a particular feature is present in the environment. If you are already using the feature, great, but is there another way you could also be using it? If you are not using the feature, try to think of why the feature was added, because someone needed and requested the feature. Why did they need the feature and how are they using it? Try to become an expert in your environment. In the process of learning about all the features of your environment, you may eventually become an expert on it. And the learning process never stops. With each new version of the tools in your environment, look in the manuals to find out what new features have been added.
Standard C defines two types of Implementations i.e. Hosted Environment and Freestanding Environment. The C Standard formalizes a separation between the language and the library by distinguishing between hosted and freestanding implementations. Informally, a hosted implementation is a C translation and execution environment running under an operating system with full support for the language and library. A freestanding implementation is a C translation and execution environment with nearly full language support but essentially no support for the standard library’s runtime components – an environment not uncommon among low-end embedded systems.
Here’s what the C99 Standard actually says:
“A conforming hosted implementation shall accept any strictly conforming program. A conforming freestanding implementation shall accept any strictly conforming program that does not use complex types and in which the use of the features specified in the library clause (clause 7) is confined to the contents of the standard headers <float.h>, <iso646.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, <stdint.h>, and <limits.h>.”
An implementation translates C source files and executes C programs in two data processing system environments, which will be called the translation environment and the execution environment in this International Standard.

1. Conceptual Models

1.1 Translation Environment

A C program consists of units called source files. In Source files the header files are included by #include directive and when we hit compile then C Compiler starts expanding source file, this phase of unit is known as pre-processing translation unit. When all the files are expanded then it translate the source code to Machine Code and this Machine code is know as translation unit. From a translation unit, the compiler generates an object file, which can be further processed and linked (possibly with other object files) to form an executable program. Thus, we can divide the Translation Environment into two parts:

1.1.1. Program Structure

The text of the program is kept in units called source files, (or preprocessing files) in this International Standard. A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit. Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program.
Program Structure unit is the Pre-Processing Translation unit. Term pre-processing translation unit, which is not generally used by programmers. A pre-processing translation unit is parsed according to the syntax for pre-processing directives. Thus, we can say that A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion pre-processing directives, is called a translation unit.
The standard committee does not specify what information is preserved in these translated translation units. It could be a high-level representation, even some tokenized form, of the original source. It is most commonly relocatable machine code and an associated symbol table. In the Microsoft Windows environment, translated files are usually given the suffix .obj and libraries the suffix .lib or .dll. In a Unix environment, the suffix .o is used for object files and the suffix .a for library files, or .so for dynamically linked libraries.
The C Standard Committee places no requirements on the linking process, other than producing a program image. However, translation phase 8 requires, under a hosted implementation, at least a translation unit that contains a function called main to create a program image.

1.1.2. Translation Phases

Standard Committee of C, introduced 8 steps Translation Phases, or we can say the Translation process is completed in 8 steps. These steps are introduce to answer to translation ordering issues that differed among early C implementations. The 8 steps are:
1. Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
This phase maps the bits held on some storage device onto members of the source character set. C requires that sequences of source file characters be grouped into units called lines. Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character. This phase is where a translator interfaces with the host to read sequences of bytes from a source file. A source file is usually represented as a text file. Some hosts treat text files differently from binary files (lines are terminated and end-of-file may be indicated by a special character or trailing null characters). There is no requirement that the file containing C source code have any particular form. Known forms of source file include the following:
  • Stream of bytes. Both text and binary files are treated as a linear sequence of bytes— the Unix model;
  • Text files have special end-of-line markers and end-of-file is indicated by a special character. Binary files are treated as a sequence of bytes
  • Fixed-length records. These records can be either fixed-line length (a line cannot contain more than a given, usually 72 or 80, number of characters; dating back to when punch cards were the primary form of input to computers), or fixed-block length (i.e., lines do not extend over block boundaries and null characters are used to pad the last line in a block).
The replacement of trigraphs by their corresponding single-character occurs before preprocessing tokens are created. This means that the replacement happens for all character sequences, not just those outside of string literals and character constants. Studies of translator performance have shown that a significant amount of time is consumed by lexing characters to form preprocessing tokens. In order to improve performance for the average case, Borland, wrote a special program to handle trigraphs. A source file that contained trigraphs first had to be processed by this program; the resulting output file was then fed into the program that implemented the rest of the translator.
2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
The second step is known as line splicing. The purpose is to allow multiple physical source line to be sliced to form a single logical source line so that preprocessor directives can span more than one line.
A white-space character is sometimes accidentally placed after a backslash. This can occur when source files are ported, unconverted between environments that use different end-of-line conventions; for instance, reading MS-DOS files under Linux. The effect is to prevent line splicing from occurring and invariably causes a translator diagnostic to be issued (often syntax-related). This is an instance of unintended behavior and no guideline recommendation is made. Existing source sometimes uses line splicing to create a string literal spanning more than one source code line. The reason for this usage often is originally based on having to use a translator that did not support string literal concatenation.
A series of backslash characters at the end of a line does not get consumed. This is a requirement that causes no code to be written in the translator, as opposed to a requirement that needs code to be written to implement it.
What should the behavior be if the last line of an included file did not end in a new-line? Should the characters at the start of the line following the #include directive be considered to be part of any preceding pre-processing token? Or perhaps source files should be treated as containing an implicit new-line at their end. This requirement simplifies the situation by rendering the behavior undefined. Lines are important in pre-processor directives, although they are not important after translation phase 4. Treating two apparently separate lines, in two different source files, as a single line opens the door to a great deal of confusion for little utility. While undefined behavior will occur for this usage, instances of it occurring are so rare that it is not worth creating a coding guideline recommending against its use.
3. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
A pre-processing token is the smallest indivisible element of C Language in translation phases from 3 to 6. The categories of the per-processing tokens are: header name, identifiers, pre-processing numbers, character constants, string literals, punctuation and a single non-white-space character. A token is the smallest indivisible element of C Language in the translation phase 7 and 8. The categories of the token are: keywords, identifiers, identifiers, constants, string, literals and punctuators.
If an input stream of characters has been parsed into tokens up to a given character, the next token is the longest sequence of the characters, that could constitute a token. Preprocessing tokens are created before any macro substitutions take place. The C preprocessor is thus a token preprocessor, not a character preprocessor. The base document was not clear on this subject and some implementors interpreted it as defining a character preprocessor. The difference can be seen in:
define a(b) printf(“b=%d\n”, b);
a(var);
The C preprocessor expands the above to:
printf(“b=%d\n”, var);
while a character preprocessor would expand it to:
printf(“var=%d\n”, var);
Linguists used the term lexical analysis to describe the process of collecting characters to form a word before computers were invented. This term is used to describe the process of building preprocessing tokens and in C’s case would normally be thought to include translation phases 1–3. The part of the translator that performs this role is usually called a lexer. As well as the term lexing, the term tokenizing is also used. The term pre-processing token is rarely used by developers. The term token is often used generically to apply to such entities in all phases of translation.
What is a partial pre-processing token? “Partial pre-processing token” is not itself a technical term; it is merely the English Language word “partial” modifying the technical term “pre-processing token”. A pre-processing token is defined by the grammar non-terminal pre-processing-token. A partial pre-processing token is therefore just part of a pre-processing token that is not the entire pre-processing token. Presumably it is a sequence of characters that do not form a pre-processing token unless additional characters are appended. However, it is always possible for the individual characters of a multiple-character pre-processing token to be interpreted as some other pre-processing token. For instance, the two characters .. represents two separate pre-processing tokens (e.g., two periods). The character sequence %:% represents the two pre-processing tokens # and % (rather than ##, had a : followed). The intent is to make it possible to be able perform low-level lexical processing on a per source file basis. That is, an #included file can be lexically analyzed separately from the file from which it was included. This means that developers only need to look at a single source file to know what pre-processing tokens it contains. It can also simplify the implementation.
The statement that “source files shall not end in a partial pre-processing token or in a partial comment” has two implications. First, a pre-processing token may not begin in one file and end in another file. Second, the last pre-processing token in a source file must be well-formed and complete. For example, the last token may not be a string literal missing the close quote. The requirement that source files end in a new-line character means that the behavior is undefined if a line (physical or logical) starts in one source file and is continued into another source file. In this phase a comment is an indivisible unit. A source file cannot contain part of such a unit, only a whole comment. That is, it is not possible to start a comment in one source file and end it in another source file.
New-line is a token in the preprocessor grammar. It is used to delimit the end of preprocessor directives. In this phase the only remaining white-space characters that need to be considered are those that occur between pre-processing tokens. All other white-space characters will have been subsumed into pre-processing tokens. White-space characters only have significance now when preprocessing tokens are glued together, or as a possible constraint violation.
Sequences of more than one white-space character often occur at the start of a line. They also occur between tokens forming a declaration when developers are trying to achieve a particular visual layout. However, white-space can only make a difference to the behavior of a program, outside of the contents of a character constant or string literal, when they appear in conjunction with the stringize operator.
4. Pre-processing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation, the behavior is undefined. A #include pre-processing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
This phase is commonly referred to as pre-processing. Pre-processor directives are lines included in the code of our programs that are not program statements but directives for the pre-processor. These lines are always preceded by a hash sign (#). The pre-processor is executed before the actual compilation of code begins, therefore the pre-processor digests all these directives before any code is generated by the statements. These pre-processor directives extend only across a single line of code. As soon as a newline character is found, the pre-processor directive is considered to end. No semicolon (;) is expected at the end of a pre-processor directive. The only way a pre-processor directive can extend through more than one line is by preceding the newline character at the end of the line by a backslash (\).
A macro is a way of expressing a lot of code or data with a simple shorthand. It’s also usually configurable. Traditional macro systems such as C’s #define mechanic use textual replacement: a macro is expanded before any evaluation or even parsing occurs. Macro invocation:Macros may be invoked in two ways: one that looks like a directive, and one that looks like an instruction. Macro invocation are expanded means that When the arguments, if any, to a macro call have been collected, the macro is expanded, and the expansion text is pushed back onto the input (unquoted), and re-read. The expansion text from one macro call might therefore result in more macros being called, if the calls are included, completely or partially, in the first macro call’s expansion. Taking a very simple example:
define(‘bar’, ‘Hello World!!’);
define(‘foo’, bar);
print foo;
Now, if foo expands to ‘bar’, and ‘bar’ expands to ‘Hello world’, the input ‘foo’ will expand first to ‘bar’, and when this is re-read and expanded, into ‘Hello world’.
_Pragma unary operator expressions: A pragma is an implementation-defined instruction to the compiler. Where character_sequence is a series of characters giving a specific compiler instruction and arguments, if any. The token indicates a standard pragma; consequently, no macro substitution takes place on the directive. The new-line character must terminate a pragma directive. The character_sequence on a pragma is subject to macro substitutions. More than one pragma construct can be specified on a single #pragma directive. The compiler ignores unrecognized pragmas.
Character sequence that matches the syntax of a Universal Character Name (UCN) is produced by token concatenation, the behavior is undefined. The C Standard allows UCNs to be interpreted and converted into internal character form either in translation phase 1 or translation phase 5. If an implementation chooses to convert UCNs in translation phase 1, it makes no sense to require them to perform another conversion in translation phase 4. This behavior is different from that for other forms of preprocessing tokens. For instance, the behavior of concatenating two integer constants is well defined, as is concatenating the two preprocessing tokens whose character sequences are 0x and 123 to create a hexadecimal constant. Once this phase of translation has been reached, the sequence of characters in the source code needed to form that representation are not intended to be manipulated in smaller units than the UCN
The #include pre-processing directives are processed in-depth first order. Once an #include has been fully processed, translation continues in the file that #included it. There is a limit on how deeply #includes can be nested. Just prior to starting to process the #include, the macros __FILE__ and __LINE__ are set and they are reset when processing resumes. The effect of this processing is that phase 5 sees a continuous sequence of pre-processing tokens. These pre-processing tokens do not need to maintain any information about the source file that they originated from. Also, macro definitions have no significance after translation phase 4. Pre-processing directives have their own syntax, which does not connect to the syntax of the C language proper. The pre-processing directives are used to control the creation of pre-processing tokens. These are handed on to subsequent phases; they don’t get past phase 4.
5. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation defined member other than the null (wide) character.
The execution character set is used by the host on which the translated program will execute. Differences between the values of character set members in the translation and execution environments become visible if a relationship exists between two expressions, one appearing in a #if pre-processing directive and the other as a controlling expression. All source character set members and escape sequences, which have no corresponding execution character set member, may be converted to the same member, or they may be converted to different members.
You need do distinguish between the source character set, the execution character set, the wire execution character
set and it’s basic versions:
The basic source character set consists of 96 characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J L L M N O P Q R S T U V W X Y Z
! ” # % & ‘ ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
0 1 2 3 4 5 6 7 8 9
The left 5 characters are space, horizontal tab, vertical tab, form feed and new line.
7 bit Characters
A=>0000000
B=>0100100
C=>0011101
The basic execution wide character is used for wide characters (wchar_t). It basically the same as the basic execution wide character set but can have different binary representations as well.
A=>1011010101010110101010
B=>0000100010110101011111<=basic source char. set
C=>1010100101101000011011
null=>0000000000000000000000
Backspace=>1111110001100000000001
The only fixed member is the null character which needs to be a sequence of 0 bits.
Example:
const char* string0 = “BA\bC”;
const wchar_t string1 = L”BA\bC”;
Since string0 is a normal character it will be converted to the basic execution character set and string1 will be converted to the basic execution wide character set.
string0=>00001000101 10110101010 11111100011 10101011111
string1=>0000100010110101011111 1011010101010110101010
                      1111110001100000000001 1010100101101000011011
There are several kind of file encodings. For example ASCII which is 7 bit long. Windows-1252 which is 8 bit long (known as ANSI). ASCII doesn’t contain non-English characters. ANSI contains some European characters like ä Ö ä Õ ø. Newer file encodings like UTF-8 or UTF-32 can contain characters of any language. UTF-8 is characters are variable in length. UTF-32 are 32 bit characters long.
Characters not included in the basic (wide) character set belong to the execution (wide) character set. Remember that the compiler converts the character from the source character set to the execution character set and the execution wide character set. Therefore there needs to be way how these characters can be converted.
6. Adjacent string literal tokens are concatenated.
This concatenation only applies to string literals. It is not a general operation on objects having an array of char. String literal preprocessing tokens do not have a terminating null character. That is added in the next translation phase.
It was a constraint violation to concatenate the two types of string literals together in C90. Character and wide string literals are treated on the same footing in C99. The introduction of the macros for I/O format specifiers in C99 created the potential need to support the concatenation of character string literals with wide string literals. These macros are required to expand to character string literals. A program that wanted to use them in a format specifier, containing wide character string literals, would be unable to do so without this change of specification.
7. White-space characters separating tokens are no longer significant. Each pre-processing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as translation unit.
White-space is not part of the syntax of the C language. It is only significant in separating characters in the lexical grammar and in some contexts in the preprocessor. This statement could have equally occurred in translation phase 5. These are the tokens seen by the C language parser. It is possible for a preprocessing token to not be convertible to a token. For instance, in:
float f = 1.2.3.4.5;
1.2.3.4.5 is a valid pre-processing token; it is a pp-number. However, it is not a valid token. Pre-processing tokens that are skipped as part of conditional compilation need never be converted to tokens:
#if 0
float f = 1.2.3.4.5;                         /* Never converted. */
#endif
A preprocessing token that cannot be converted to a token is likely to cause a diagnostic to be issued. At the very least, there will be a syntax violation. It is a quality-of-implementation issue as to whether the translator
issues a diagnostic for the failed conversion.
This is the phase where the executable code usually gets generated. Coding guidelines are not just about how the semantics phase of translation processes its input. Previous phases of translation are also important, particularly pre-processing. The visible source, the input to translation phase 1, is probably the most important topic for coding guidelines.
All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.
The last but not the least is normally carried out by program commonly known as linker. The code generated for a single translation unit invariably contains unresolved references to external objects and functions. The tool used to resolve these references, a linker, may be provided by the implementation vendor or it may have been supplied as part of the host environment.
The term library components is a broad one and can include previously translated translation units. The implementation is responsible for linking in any of its provided C Standard library functions, if needed. The wording of this requirement can be read to imply that all external references must be satisfied. This would require definitions to exist for objects and functions, even if they were never referenced during program execution. This is the behavior enforced by most linker’s. Since there is no behavior defined for the case where a reference cannot be resolved, it is implicitly undefined behavior.
The two commonly used linking mechanisms are static and dynamic linking. Static linking creates a program image that contains all of the function and object definitions needed to execute the program. A statically linked program image has no need for any other library functions at execution time. This method of linking is suitable when distributing programs to host environments where very little can be assumed about the availability of support libraries. If many program images are stored on a host, this method has the dis-advantage of keeping a copy of library functions that are common to a lot of programs. Also if any libraries are updated, all programs will need to be re-linked. Dynamic linking does not copy library functions into the program image. Instead a call to a dynamic linking system function is inserted. When a dynamically linked function is first called, the call is routed via the dynamic link loader, which resolves the reference to the desired function and patches the executing program so that subsequent calls jump directly to that routine. Provided there is support from the host OS, all running programs can access a single copy of the library functions in memory. On hosts with many programs running concurrently this can have a large impact on memory performance. Any library updates will be picked up automatically. The program image lets the dynamic loader decide which version of a library to call.
Several requirements can influence how a program image is built, including the following:
  • Developer convenience: Like all computer users, developers want things to be done as quickly as possible. During the coding and debugging of a program, its image is likely to be built many times per hour. Having a translator that builds this image quickly is usually seen as a high priority by developers.
  • Support for symbolic debugging: Here information on the names of functions, source line number to machine code offset correspondence, the object identifier corresponding to storage locations, and other kinds of source-related mappings needs to be available in the program image. Requiring that it be possible to map machine code back to source code can have a significant impact on the optimization’s that can be performed. Research on how to optimize and include symbolic debugging information has been going on for nearly 20 years. Modern approaches are starting to provide quality optimization’s while still being able to map code and data locations.
  • Speed of program execution: Users of applications want them to execute as quickly as possible. The organization of a program image can have a significant impact on execution performance.
  • Hiding information: Information on the internal workings of a program may be of interest to several people. The owner of the program image may want to make it difficult to obtain this information from the program image.
  • Distrust of executable programs: Executing an unknown program carries several risks. It may contain a virus, it may contain a Trojan that attempts to obtain confidential information, it may consume large amounts of system resources or a variety of other undesirable actions. The author of a program may want to provide some guarantees about a program, or some mechanism for checking its integrity. There has been some recent research on translated programs including function specification and invariant information about themselves, so-called proof carrying programs. On the whole the commonly used approach is for the host environment to ring fence an executing program as best it can, although researchers have started to look at checking program images before loading them, particularly into a trusted environment.

1.1.3. Diagnostics

A conforming implementation shall produce at least one diagnostic message (identified in an implementation – defined manner) if a preprocessing translation unit or translation unit contains a violation of any syntax rule or constraint, even if the behavior is also explicitly specified as undefined or implementation-defined. Diagnostic messages need not be produced in other circumstances.
The first violation may put the translator into a state where there are further, cascading violations. The extent to which a translator can recover from a violation and continue to process the source in a meaningful way is a quality-of-implementation issue. The standard says nothing about what constitutes a diagnostic message. Although each implementation is required to document how they identify such messages, playing a different tune to represent each constraint or syntax violation is one possibility. Such an implementation decision might be considered to have a low quality-of-implementation.
The Rationale uses the term erroneous program in the context of a program containing a syntax error or constraint violation. Developers discussing C often use the term error and erroneous, but the standard does not define these terms. Traditionally C compilers have operated in a single pass over the source, with fairly localized error recovery. Constraint violations during preprocessing can be difficult to localize because of the unstructured nature of what needs to be done. If there is a separate program for preprocessing, it will usually be necessary to remove all constraint violations detected during preprocessing. Once accepted by the preprocessor the resulting token stream can be syntactically and semantically analyzed. Constraint violations that occur because of semantic requirements tend not to result in further, cascading, violations.
The production of diagnostics in other circumstances is a quality-of-implementation issue. Implementations are free to produce any number of diagnostics for any reason, but they are not required to do so. A guideline recommendation of the form “The translator shall be run in a configuration that maximizes the likelihood of it generating useful diagnostic messages.” is outside the scope of these coding guidelines. A guideline recommendation of the form “The source code shall not cause the translator to generate any diagnostics, or there shall be rationale (in a source code comment) for why the code has not been modified to stop them occurring.” is also outside the scope of these coding guidelines.
Advertisements
Tagged , , , , , , ,
%d bloggers like this: