An Open Call for more Standardized Floating Point Number Types



Abstract
The current regime of 32 bit, 64 bit and occasionally 128 bit floating point numbers is nearing its end. There are software and hardware costs involved in the current inflexible regime of floating point numbers. In many cases using one type to save memory or computation time ends up having some unintended cost later on.

The global floating point standardization processes have largely died since the end of the Cold War. The current floating point computational infrastructure used globally is a Cold War inheritance, with the only update being with the IEEE 754 standard in 2008.

The current costs of this floating point computing infrastructure are beginning to show -- but have been well know since the beginning. The most noticeable issue is in floating point emulation and the validity of numbers transferred between the current floating point regimes.


A brief history of floating point numbers

Leonardo Torres y Quevedo in 1914 designed an electromechanical version of the Analytical Engine of Charles Babbage which included floating-point arithmetic.

In 1938, Konrad Zuse of Berlin completed the Z1, the first mechanical binary programmable computer. The Z1 was however unreliable in operation.

The Z1 (but primarily the Z2 and Z3) floating point math units worked with 22-bit binary floating-point numbers having a 7-bit signed exponent, a 15-bit significand (including one implicit bit), and a sign bit. The memory used sliding metal parts to store 64 words of such numbers. The relay-based Z3, completed in 1941 had representations for plus and minus infinity. It implemented defined operations with infinity such as 1/∞ = 0 and stopped on undefined operations like 0×∞.  The Z3 also implemented the square root operation in hardware.

Konrad Zuse also proposed, but did not complete, carefully rounded floating–point arithmetic that would have included ±∞ and NaNs. These innovations  anticipated the features of the IEEE Standard floating–point by four decades. By contrast, von Neumann recommended against floating point for the 1951 IAS machine, arguing that fixed point arithmetic was preferable.

The first commercial computer with floating point hardware was Zuse's Z4 computer designed in 1942–1945. The Bell Laboratories Mark V computer implemented decimal floating point in 1946.

The Pilot ACE computer had binary floating point arithmetic which became operational at National Physical Laboratory, UK in 1950. A total of 33 were later sold commercially as the English Electric DEUCE. The arithmetic was actually implemented as subroutines, but with a one megahertz clock rate, the speed of floating point operations and fixed point was initially faster than many competing computers, and since it was only software, all the DEUCE's had it.

The mass-produced vacuum tube-based IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. It was not until the launch of the Intel's i486 and Motorola's 68000 that general-purpose personal computers had floating point capability in hardware available as a standard feature.

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point formats. Single precision used 36 bits, organized into a 1-bit sign, an 8-bit exponent, and a 27-bit significand. Double precision used 72 bits organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats.

Prior to the IEEE-754 standard, computers used many different forms of floating-point. These differed in the word sizes, the format of the representations, and the rounding behaviour of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy.

The IEEE-754 standard was created in the early 1980s after word sizes of 32 bits (or 16 or 64) had been generally settled upon. This was based on a proposal from Intel who were designing the i8087 numerical coprocessor. Prof. W. Kahan was the primary architect behind this proposal, along with his student Jerome Coonen at U.C. Berkeley and visiting Prof. Harold Stone, for which he was awarding the 1989 Turing award.

Among the x86 and 68000 innovations are these:

The Current Standardisation Crisis


Basic formats

The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits).

The binary32 and binary64 formats are the single and double formats of IEEE 754-1985. A conforming implementation must fully implement at least one of the basic formats.

The typical precision of the basic binary formats is one bit more than the width of its significand. The extra bit of precision comes from an implied (hidden) leading 1 bit. The typical floating point number will be normalized such that the most significant bit will be a one. If the leading bit is known to be one, then it need not be encoded in the interchange format.

Name Common name
Base Digits
E min E max
Decimal
digits
Decimal
E max

Notes
binary16 Half precision
2 10+1
−14 +15
3.31 4.51
IEEE
binary24
Float24

2








Proposed
binary32 Single precision
2 23+1
−126 +127
7.22 38.23
IEEE
binary40
Float40

2








Proposed
binary64 Double precision
2 52+1
−1022 +1023
15.95 307.95
IEEE
binary80
Float80

2








Proposed
binary128 Quadruple precision
2 112+1
−16382 +16383
34.02 4931.77
IEEE
Name
Common name

Base
Digits

E min
E max

Decimal digits
Decimal E max

Notes


Decimal digits is digits × log10 base, this gives an approximate precision in decimal.

Decimal E max is Emax × log10 base, this gives the maximum exponent in decimal.



Internal representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right.


For the IEEE 754 binary formats (basic and extended) which have extant hardware implementations, they are apportioned as follows

Type Bytes

Sign Exponent Significand
Total bits
Exponent bias Bits precision Number of expressible

Official Status
Notes
Common Name
Total

Bit
Bits
Bits

Bytes x 8



decimal digits

















Half (IEEE 754-2008) 2
:
1 5 10 =
16
15 11 ~3.3
IEEE-754
Experimental
Float24
3
:
1
5
18
=
24

15
19
~

Proposed

Single 4
:
1 8 23 =
32
127 24 ~7.2
IEEE-754
Float40
5
:
1
10
29
=
40

511 (provisional)

~

Proposed

Double
8
:
1 11 52 =
64
1023 53 ~15.9
IEEE-754 Also IBM
Float72
9
:
1
11
61
=
72

1023 (provisional)

~

Proposed

Double extended 10
:
1 15 64 =
80
16383 64 ~19.2
IEEE-754 Also IBM
Quad 16
:
1 15 112 =
128
16383 113 ~34.0
IEEE-754 Also IBM


While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers, values of all 1s are reserved for the infinities and NaNs. The exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normalised numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has 113.

Multiple forms of floating point representation are possible, and the IEEE 754 (2008) permits both the "Exponent + Significand" form and the "Decimal Representation" number encoding forms.


The IEEE 754-2008 standard defines 32 bit, 64 bit  and 128 bit decimal floating-point representations


Like the binary floating-point formats, the number is divided into a sign, and exponent, and a significand. Unlike binary floating-point, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×102=0.1×103=0.01×104, etc.

When the significand is zero, the exponent can be any value at all.

IEEE 754-2008 decimal floating-point formats
decimal32 decimal64 decimal128 decimal(32k) Format
1 1 1 1 Sign field (bits)
5 5 5 5 Combination field (bits)
6 8 12 w = 2×k + 4 Exponent continuation field (bits)
20 50 110 t = 30×k−10 Coefficient continuation field (bits)
32 64 128 32×k Total size (bits)
7 16 34 p = 3×t/10+1 = 9×k−2 Coefficient size (decimal digits)
192 768 12288 3×2w = 48×4k Exponent range
96 384 6144 Emax = 3×2w−1 Largest value is 9.99...×10Emax
−95 −383 −6143 Emin = 1−Emax Smallest normalized value is 1.00...×10Emin
−101 −398 −6176 Etiny = 2−p−Emax Smallest non-zero value is 1×10Etiny


The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.

Two different representations are defined:

Both alternatives provide exactly the same range of representable values.

The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5-bit field, along with special forms for infinity and NaN.


Extended and extendible precision formats

The standard specifies extended and extendible precision formats, which are recommended for allowing a greater precision than that provided by the basic formats.

An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (b, p, and emax). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.

The standard does not require an implementation to support extended or extendable precision formats.

The standard recommends that languages provide a method of specifying p and emax for each supported base b.

The standard recommends that languages and implementations support an extended format which has a greater precision than the largest basic format supported for each radix b.

For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The x87 80-bit extended format meets this requirement.

Interchange formats

Interchange formats are intended for the exchange of floating-point data using a fixed-length bit-string for a given format.

For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥128 are defined.

The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics).

For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined.

The encoding scheme for the decimal interchange formats similarly encodes the sign, exponent, and significand, but the scheme uses a more complex approach to allow the significand to be encoded as a compressed sequence of decimal digits (using densely packed decimal) or as a binary integer. In either case the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and signalling NaNs have a unique encoding (and the same set of possible payloads).




Floating Point Data Types Needed



Standardization issues

Some new types of floating point numbers will have stadardization issues, and some will mostly be completely immune from them.

Float24
Float32 vs Float24


Float40

Float72, Float80

References
Representation

Hardware




Created by

Initial idea

Initial version

Current version

Last revision

Revision state


Max Power

15 August 2010

12 April 2013

14 June 2014

Remove decimal format table

Initial