Topic: real numbers and floating point numbers

topics > computer science > data > Group: data value


continuum in mathematics
discrete vs. continuous
FOCUS number system
integer values and operations
kinds of numbers
numerical error
science as measurement
type conversion
unbounded precision
value as an abstraction


A real number is the least upper bound of a set or segment of rational ratios. A number is irrational if the segment is open.

Floating point numbers represent a real number with a bounded number of bits. They are widely used for scientific and graphics programming. IEEE Standard 754 is a formal specification of floating point numbers. It is implemented by SANE and Modula-3. The standard carefully defines roundoff, floating point formats, exceptions, infinity, and denormalized numbers. These allow careful analysis of an algorithm.

Fixed point fractions and rational ratio approximation may be used in place of floating point numbers. The EDSAC represented all reals by fractional numbers. EDSAC programmers found scaling difficult and could use a floating point interpreter instead.

Also mentioned: multi-precision arithmetic, real to string conversion, and arithmetic coding. (cbb 4/98)

Subtopic: real numbers up

Quote: construct the real numbers as segments of series of rational ratios in order of magnitude; an irrational number is a segment without boundary [»russB_1919, OK]
Quote: every bounded subset of the reals has a least upper bound [»simmGF_1963]
Quote: every rational and irrational number is symbol for a cut that divides the real numbers into two
Quote: the computable numbers are the real numbers whose decimal expression can be calculable by finite means; because human memory is limited [»turiAM11_1936]

Subtopic: complex numbers up

Quote: instantiating 'complex' instantiates two instance of the form 'real' with the names 'r' and 'i' [»wulfWA4_1974]
Quote: for three numbers to be a vector they must be associated with a coordinate system so that rotating the coordinate system rotates the vector [»feynRP_1963]

Subtopic: floating point up

Quote: students preferred drawing in screen pixels until they had written several custom widgets; floating point easier and more abstract [»pausR10_1992]
Quote: real numbers are unrepresentable ideals which are approximated in a computer [»wegnP10_1986, OK]
Quote: interpreter represents numbers as a 24 bit mantissa and a 6 bit exponent; 8 significant decimal digits [»laniJH1_1954]

Subtopic: floating point scale up

Quote: floating-point support should include the precision of a number, and conversions between the number and its components [»reidJK6_1980]
Quote: Modula-3 provides three fixed floating-point types for efficiency: real, longreal, and extended
Quote: Modula-3's strict conversions requires separate representations for real, longreal, and extended literals; makes it difficult to write generic procedures [»goldD6_1992]
Quote: use one or more 'long' annotations instead of required decimal places; otherwise mismatch between number and its representation [»wirtN6_1966]

Subtopic: floating-point standards up

Quote: tutorial on floating point, rounding error, standards, and improved support of floating point [»goldD3_1991]
Quote: programming languages do not fully support IEEE floating-point arithmetic; e.g., rounding direction and floating-point exceptions [»versD3_1997]
Quote: SANE is a thorough implementation of IEEE Standard 754 for binary floating-point arithmetic [»appl_1988]
Quote: SANE supports extended precision, NaNs, Infinities, unordered comparisons, rounding, and floating point exceptions; no signaling NaNs
Quote: floating-point semantics need to allow efficient implementations with strict error bounds for proving algorithms correct [»goldD6_1992]
Quote: floating point in Modula-3 supports forward error analysis with precisely defined rounding operations and exception handling [»goldD6_1992]

Subtopic: floating-point and optimization up

Quote: an optimizer should not rearrange the order of floating-point evaluation in any way that changes the computed value or side effects
Quote: in C, all floating arithmetic is carried out in double precision [»ritcDM7_1978c]

Subtopic: conversion up

Quote: algorithm for printing floating-point numbers; as free-format, generates shortest string that converts to the same result; multiple rounding modes [»burgRG5_1996]
Quote: efficient algorithm for correctly rounded decimal-to-binary conversion; avoids high-precision arithmetic 99.6% of the time [»clinWD6_1990]

Subtopic: simulating floating point up

Quote: use i*j/k for efficient calibration, scaling, and rational approximation; e.g., multiply by pi with an error of 10^-7 [»rathED_1996]
Quote: CORDIC algorithms compute one-bit-at-a-time using small lookup tables, right shifts, and additions; represents numbers by alternating series; good for microcontrollers [»pashM9_2000]

Subtopic: fixed point up

Quote: Ada has fixed-point numbers since commonly used in peripheral devices such as analog-to-digital converters [»maclBJ_1987]
Quote: an Ada fixed-point constraint gives a range and a maximum delta (the absolute error bound) [»maclBJ_1987]
QuoteRef: rtl2 ;;fixed point fractions ('x' .lt. 1) with double length intermediates eg big integer fine integers and fine fractions etc
QuoteRef: clouMJ7_1983 ;;analysis of single-precision fixed point arithmetic for doing arithmetic means.

Subtopic: multi-precision arthimetic up

Quote: gives a fast O(n^2) algorithm for division of multi-precision floating-point numbers; as fast as multiplication; accuracy to machine epsilon [»ozawK3_1991]
Quote: fast arbitrary-precision addition and multiplication up to a thousand bits; uses floating point numbers; adaptive
Quote: represent arbitrary precision floating point numbers with multiple, non-overlapping terms; e.g., 1100 - 10.1 [»shewJR5_1996]

Subtopic: arithmetic coding up

Quote: arithmetic coding represents a message by an interval of real numbers; allows fractional bits for a symbol [»wittIH6_1987]

Subtopic: history up

Quote: interpreter represents numbers as a 24 bit mantissa and a 6 bit exponent; 8 significant decimal digits [»laniJH1_1954]
Quote: use in-line expansion to extend a machine's order code; use interpretative subroutines to reduce memory; e.g., floating point [»wilkMV_1957]
Quote: the EDSAC used 1024 numbers of ultrasonic memory; 17 or 35 binary digits from -1 to 1 [»wilkMV_1951]
Quote: scaling was the most difficult part of programming the EDSAC [»wilkMV_1951]

Related Topics up

Group: algorithms   (6 topics, 94 quotes)
Group: mathematics   (23 topics, 560 quotes)

Topic: continuum in mathematics (7 items)
Topic: discrete vs. continuous (47 items)
Topic: FOCUS number system (8 items)
Topic: geometry (33 items)
Topic: integer values and operations (13 items)
Topic: kinds of numbers (24 items)
Topic: numerical error (19 items)
Topic: science as measurement (36 items)
Topic: type conversion (33 items)
Topic: unbounded precision (9 items)
Topic: units (23 items)
Topic: value as an abstraction
(25 items)

Updated barberCB 12/04
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.