In our Part 1, we have discussed about

*Signed Integer and Standard Unsigned Integer*and by combining the standard singed and unsigned Integer types are collectively called the*Standard Integer Types*. In Part 2, we have discussed about*Extended Integer Types, Conversion rank and Modulo Operators – Overflow*.## Real Floating types

According to ISO Standard 6.2.5 paragraph 10 as:

There are three real floating types, designated as float, double, and long double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.

Numbers are represented in two types i.e Numeric Number and Other is Decimal Numbers. Within a computer, numbers are represented in binary notation. There are two types of number (disregarding complex, quaternions, …) :-

- Integer – exact, discrete, countable ordinals/cardinals : +3, -2, 0
- Real – continuous variables, which can be represented in two ways :-
- Fixed-point : -2.345, +31415.9265…, 0.0
- Floating-point : -3.456E-2, +3.14159265…E+4, 0.0E0

As we have already discussed Integer Numbers in our previous Part 2. I will here use “float” as the generic term for fixed and floating point.

Number representations vary across computer types and languages. Float quantities can only be represented to finite accuracy in a computer; few decimal floats can be represented exactly in binary, and vice verse. In decimal text, “E” (or “e”) is taken as meaning “times 10 to the power of”. Conversion between binary float and decimal float is non-trivial; it is a pity that early man began to count on his upper digits, rather than on only his fingers.

Floating point type sizes and mapping vary from one processor to another. Except for the Intel 80×86 architecture, the extended type maps to the IEEE double type if a hardware floating point co-processor is present. Floating point types have a storage binary format divided into three distinct fields : the

*mantissa*, the*exponent*and the*sign bit*which stores the sign of the floating point value.Single

The single type occupies 4 bytes of storage space, and its memory structure is the same as the IEEE-754 single type. This type is the only type which is guaranteed to be available on all platforms (either emulated via software or directly via hardware).

Double

The double type occupies 8 bytes of storage space, and its memory structure is the same as the IEEE-754 double type.

On processors which do not support co-processor operations (and which have the {$E+} switch), the double type does not exist.

Extended

For Intel 80×86 processors, the extended type has takes up 10 bytes of memory space. For more information on the extended type consult the Intel Programmer’s reference.

For all other processors which support floating point operations, the extended type is a nickname for the type which supports the most precision, this is usually the double type. On processors which do not support co-processor operations (and which have the {$E+} switch), the extended type usually maps to the single type.

The organization of the floating types has a similar structure to that commonly seen in the handling of the integer types short, int, and long. The type double is often thought of in terms of the floating-point equivalent of int. On some implementations it has the same size as the type float, on other implementations it has the same size as the type long double, and on a few implementations its size lies between these two types. One difference between integer and floating types is that in the latter case an implementation is given much greater freedom in how operations on operands having these types are handled. The header defines the typedefs float_t and double_t for developers who want to define objects having types that correspond to how an implementation performs operations.

*The type long double was introduced in C90. It was not in K&R C.*The simple approach of using as much accuracy as possible, declaring all floating-point objects as

**type long double**, does not guarantee that algorithms will be well-behaved. There is no substitute for careful thought and this is even more important when dealing with floating-point representation. The type**double**tends to be the floating-point type used by default (rather like the type**int**). Execution time performance is an issue that developers often think about when dealing with floating-point types, sometimes storage capacity (for large arrays) can also be an issue.The type

**double**has traditionally been the recommended floating type, for developers to use by default, although in many cases the type float provides sufficient accuracy. Given the problems that many developers have in correctly using floating types, a more worthwhile choice of guideline recommendation might be to recommend against their use completely. It may be possible to trade execution-time performance against accuracy of the final results, but this is not always the case. For instance, some processors perform all floating-point operations to the same accuracy and the operations needed to convert between types (less/more accurate) can decrease execution performance.For processors that operate on formats of different sizes, it is likely that operations on the smaller size will be faster. The question is then whether enough is understood, by the developer, about the algorithm to know if the use a floating-point type with less accuracy will deliver acceptable results. In practice few algorithms, let alone applications, require the stored precision available in the types double or long double. However, a minimum amount of accuracy may be required in the intermediate result of expression evaluation. In some cases the additional range supported by the exponents used in wider types is required by an application. Given the degree of uncertainty about the costs and benefits in using any floating types, this coding guideline subsection does not make any recommendations.

For 128-bit long double most IEC 60559 implementations use the format of 1–15–113 bits for signexponent-significand (and a hidden bit just like single and double). Some implementations (e.g., GCC on MAC OS X) use two contiguous doubles to represent the type long double. This representation has some characteristics that differ from IEEE representations. For instance, near DBL_MIN no extra precision, compared to the type double, is available; the additional range of values only goes as far up as 2*DBL_MAX; the interpretation and use of LDBL_EPSILON becomes problematic.

However, when using floating point numbers, you can get some unexpected results if the two numbers being compared are very close. Consider:

float fValue1 = 1.345f; float fValue2 = 1.123f; float fTotal = fValue1 + fValue2; // should be 2.468 if (fTotal == 2.468) cout << "fTotal is 2.468"; else cout << "fTotal is not 2.468";

This program prints: fTotal is not 2.468

This result is due to rounding error. fTotal is actually being stored as 2.4679999, which is not 2.468!

For the same reason, the comparison operators >, >=, <, and <= may produce the wrong result when comparing two floating point numbers that are very close.

## Conclusion

To summarize, the two things you should remember about floating point numbers:

- Floating point numbers offer limited precision. Floats typically offer about 7 significant digits worth of precision, and doubles offer about 16 significant digits. Trying to use more significant digits will result in a loss of precision. (Note: placeholder zeros do not count as significant digits, so a number like 22,000,000,000, or 0.00000033 only counts for 2 digits).
- Floating point numbers often have small rounding errors. Many times these go unnoticed because they are so small, and because the numbers are truncated for output before the error propagates into the part that is not truncated. Regardless, comparisons on floating point numbers may not give the expected results when two numbers are close.