Scientific Programming (in C)

Steve Summit

C was not used very much for floating-point work at first (i.e. back when it was still being developed and refined), so its support for some aspects of floating-point and scientific work is a little weak. (On the other hand, C's support for string processing and other data structures can make up the difference in a program that does some of each.) There are two significant areas of concern. One is specific to C, and concerns the handling of multidimensional arrays. The other is endemic in floating-point work done on computers, and concerns accuracy, precision, and round-off error.

Functions and Multi-dimensional Arrays

We've seen that when a function receives a one-dimensional array as a parameter:

	f(double a[])
the size of the array need not be specified, because the function is not allocating the array and does not need to know how big it is. The array is actually allocated in the caller (and in fact, only a pointer to the beginning of the array is passed to a function like f). If, in the function, we need to know the number of elements in the array, we have to pass it as a second parameter:
	f(double a[], int n)
	{ ... }
If we wish to emphasize that it is really a pointer which is being passed, we can write it like this:
	f(double *a, int n)
	{ ... }
(In fact, if we declare the parameter as an array, the compiler pretends we declared it as a pointer, as in this last form.)

In the one-dimensional case, it is actually a bit of an advantage that we do not need to specify the size of the array. If f does something generic like ``find the average of the numbers in the array,'' it will clearly be useful to be able to call it with arrays of any size. (If we had to write things like

	f(double a[10])
	{ ... }
then we might find it a real nuisance that f could only operate on arrays of size 10; we'd hate to have to write several different versions of f, one for each size.)

Unfortunately, the situation is not nearly so straightforward for multi-dimensional arrays. If you had an array

	double a2[10][20];
which you wanted to pass to a function g, what would g receive? We could declare g as
	g(double a[10][20])
	{ ... }
and let the compiler worry about it--that is, let it ignore the dimensions it doesn't need, and pretend that g really receives a pointer. But what pointer does it receive? It's easy to imagine that g could also be declared as
	g(double **a)		/* WRONG */
	{ ... }
but this is incorrect. Notice that the array a2 that we're trying to pass to g is not really a multidimensional array; it is an array of arrays (more precisely, a one-dimensional array of one-dimensional arrays). So in fact, what g receives is a pointer to the first element of the array, but the first element of the array is itself an array! Therefore, g receives a pointer to an array. Its actual declaration looks like this:
	g(double (*a)[20])
	{ ... }
(and this is what the compiler pretends we'd written if we instead write a[10][20] or a[][20]). This says that the parameter a is a pointer to an array of 20 doubles. We can still use normal-looking subscripts: a[i][j] is the j'th element in the i'th array pointed to by a. The problem is that no matter how we declare g and its array parameter a, we have to specify the second dimension. If we write
	g(double a[][])
the compiler complains that a needed dimension is missing. If we write
	g(double (*a)[])
the compiler complains that a needed dimension is missing. We have to specify the ``width'' of the array, and if we've got various sizes of arrays running around in our program, we're back with the problem of potentially needing different functions to deal with different sizes of arrays.

Why wouldn't

	g(double **a)		/* WRONG */
work? That says that a is a pointer to a pointer, but since we're passing an array of arrays, the function will receive a pointer to an array. A pointer to an array is not compatible with a pointer to a pointer. (More on double **a below.)

A second problem arises if you need any temporary arrays inside a function. Often, a temporary array needs to be the same size as the passed-in arrays being worked on. Unfortunately, there's no way to declare an array and say ``just make it as big as the passed-in array.''

It's easy to imagine a solution to all of the problems we've been discussing. To write a function accepting a multidimensional array with the size specified by the caller, you might want to write

	g(double a[m][n], int m, int n)		/* won't work */
where the array dimensions are taken from other function parameters. (This is, of course, just how you do it in FORTRAN.) To declare a temporary array, you might try things like
	f(double a[n], int n)
		double tmparray[n];		/* won't work */
	g(double a[m][n], int m, int n)
		double tmparray[m][n];		/* won't work */
But you can't do any of these things in standard C. Every array dimension must be a compile-time constant. Array parameters can't take their dimensions from other parameters, and local arrays can't take their dimensions from parameters.

(It is worth noting that at least one popular compiler, namely gcc, does let you do these things: parameter and local arrays can take their dimensions from parameters. If you make use of this extension, just be aware that your program won't work under other compilers.)

How can we work around these limitations? The most powerful and flexible solution involves abandoning multidimensional arrays and using pointers to pointers instead. This approach has the advantage that all dimensions may be specified at run time, though it also has a few disadvantages. (Namely: you may have to remember to free the arrays, especially if they're local; the intermediate pointers take up more space, and accessing ``array'' elements via pointers-to-pointers may be slightly less efficient than true subscript references. On the other hand, on some machines, pointer accesses are more efficient. Another disadvantage is that our treatise on scientific programming in C is about to digress into a treatise on memory allocation.)

To illustrate, here is a pair of functions for allocating simulated one- and two-dimensional arrays of double. (By ``simulated'' I mean that these are not true arrays as the compiler knows them, but because of the way arrays and pointers work in C, we are going to be able to treat these pointers as if they were arrays, by using [] array subscript notation.)

#include <stdio.h>
#include <stdlib.h>

double *alloc1d(int n)
	double *dp = (double *)malloc(n * sizeof(double));
	if(dp == NULL) {
		fprintf(stderr, "out of memory\n");
	return dp;

double **alloc2d(int m, int n)
	int i;
	double **dpp = (double **)malloc(m * sizeof(double *));
	if(dpp == NULL) {
		fprintf(stderr, "out of memory\n");
	for(i = 0; i < m; i++) {
		if((dpp[i] = (double *)malloc(n * sizeof(double))) == NULL) {
			fprintf(stderr, "out of memory\n");
	return dpp;

These routines print an error message and exit if they can't allocate the requested memory. Another approach would be to have them return NULL, and have the caller check the return value and take appropriate action on failure. (In this case, alloc2d would want to ``back out'' by freeing any intermediate pointers it had successfully allocated.)

Note that these routines allocate arrays of type double. They could be trivially rewritten to allocate arrays of other types. (Note, however, that it is impossible in portable, standard C to write a routine which generically allocates a multidimensional array of arbitrary type, or with an arbitrary number of dimensions.)

When a one-dimensional array allocated by alloc1d is no longer needed, it can be freed by passing the pointer to the standard free routine. Freeing two-dimensional arrays allocated by alloc2d is a bit trickier, because all the intermediate pointers need to be freed, too. Here is a function to do it:

void free2d(double **dpp, int m)
	int i;
	for(i = 0; i < m; i++)
		free((void *)dpp[i]);

	free((void *)dpp);
(The casts in the malloc and free calls in these routines are not necessary under ANSI C, but are included to make it easier to port this code to a pre-ANSI compiler, if necessary.)

With these routines in place, we can dynamically allocate one- and two-dimensional arrays of double to our heart's content. Here is an (artificial) example:

	double *a1 = alloc1d(10);
	double **a2 = alloc2d(10, 20);
	a1[0] = 0;
	a1[9] = 9;
	a2[0][0] = 0.0;
	a2[0][19] = 0.19;
	a2[9][0] = 9.0;
	a2[9][19] = 9.19;

Now, when we pass a two-dimensional array allocated by alloc2d to a function, that function will be declared as accepting a pointer-to-a-pointer:

	g(double **a)		/* okay, for alloc2d "arrays" */
When we need a temporary array within a function, we can allocate one with alloc1d or alloc2d; however, we must remember to free it before the function returns. (Remember that malloc'ed memory persists until explicitly freed, or until the program exits. Local variables disappear when the function they're local to returns, but in the case of pointers to malloc'ed memory, that just means that we lose the pointer, and not that the pointed-to memory has been freed. We have to explicitly free it, before we lose the pointer.)

One last note: since

	g1(double a[10][20])
is not compatible with
	g2(double **a)
we will have to maintain a segregation between arrays-of-arrays and pointers-to-pointers. We can't write a single function that operates on either a true multidimensional array or one of our pointer-to-pointer, simulated multidimensional arrays. (Actually, there is one way, but it's more elaborate, and not strictly portable.)

That's about enough (for now, at least) about multidimensional arrays. We now turn to the subject of...

Floating-Point Precision

It's the essence of real numbers that there are infinitely many of them: not only infinitely large (in both the positive and negative direction), but infinitely small and infinitely fine. Between any two real numbers like 1.1 and 1.2 there are infinitely many more numbers: 1.11, 1.12, etc., and also 1.101, 1.102, etc., and also 1.1001, 1.1002, etc.; and etc. There is no way in a finite amount of space to represent the infinite granularity of real numbers. Most computers use an approximation called floating point. In a floating point representation, we keep track of a base number or ``mantissa'' with some degree of precision, and a magnitude or exponent. The parallel here with scientific notation is close: The number actually represented is something like mantissa x 10<sup>exponent</sup>. (Actually, most computers use mantissa x 2<sup>exponent</sup>, and we'll soon see how this difference can be important.)

The advantage of a floating-point representation is that it lets us devote a finite amount of space to each value but still do a decent job of approximating the real numbers we're interested in. If we have six decimal digits of precision, we can represent numbers like 123.456 or 123456 or 0.123456 or 0.000000123456 or 123456000000. However, we can not accurately represent numbers which require more than six significant digits: we can't talk about numbers like 123456.123456 or 0.1234567 or even 1234567. If we try to represent numbers like these (i.e. if they occur in our calculations), they'll be ``rounded off'' somehow. (``Rounded off'' is in quotes because we don't always know whether the unrepresentable digits will be properly rounded up or down, or randomly rounded up or down, or simply discarded. High-quality floating-point processors and subroutine libraries will try to do a good job of rounding when a result cannot be exactly represented, but unfortunately there are some low-quality ones out there.)

In simple circumstances, the finite precision of floating-point numbers is not a problem. Real-world inputs (i.e. anything you measure) never have infinite precision; in fact they often have only two or three significant digits of precision. Almost everyone has a vague understanding that floating-point representations have finite precision, so it's easy to shrug off slight inaccuracies as ``roundoff error.'' If these errors always stayed confined to the least-significant digit, they wouldn't be much of a problem.

The problem is that error can accumulate. In the worst case, you can lose one (or more than one) significant digit in each step of a calculation. If a calculation involves six steps, and you only have six significant digits to play with, by the end of the calculation, you've got nothing of significance left. As a wise programmer once said, ``Floating point numbers are like piles of sand. Every time you move them around, you lose a little sand.''

How can an error grow to involve more than the least-significant digit? Here is one simple example. Suppose we are computing the sum 1+1+1+1+1+1+1+1+1+1+1000000, and we have only six significant digits to play with. Clearly the answer is 1000010, which has six significant digits, so we're okay. But suppose we compute it in the other order, as 1000000+1+1+1+1+1+1+1+1+1+1. In the first step, 1000000+1, the intermediate result is 1000001, which has seven significant digits and so cannot be represented. It will probably be rounded back down to 1000000. So 1000000+1+1+1+1+1+1+1+1+1+1 = 1000000, and 1+1+1+1+1+1+1+1+1+1+1000000 != 1000000+1+1+1+1+1+1+1+1+1+1. In other words, the associative law does not hold for floating-point addition. Elementary school students often wonder whether it makes a difference which order you add the column of numbers up in (and sometimes notice that it does, i.e. they sometimes make mistakes when they try it). But here we see that, on modern digital computers, the folklore and superstition is true: it really can make a difference which order you add the numbers up in.

If you don't believe the above, I invite you to try this little program:

#include <stdio.h>

#define BIG 1e8

	int i;
	float f = 0;

	for(i = 0; i < 20; i++)
		f += 1;
	f += BIG;
	printf("%f\n", f);

	f = BIG;
	for(i = 0; i < 20; i++)
		f += 1;
	printf("%f\n", f);

	return 0;
You may have to play with the value of BIG (other values to try are 1e6, 1e7, or 1e9), but you should probably be able to get a surprising result.

Because of the finite precision of floating point numbers on computers, you also have to be careful when subtracting large numbers, as the result may not be significant at all. If we try to compute the difference 100000789 - 100000123 (again using only six significant digits), we are not likely to get 666, because neither 100000789 nor 100000123 are representable in six digits in the first place. This may not seem surprising, but what if the difference we were really interested in, and thought was significant, was 789 - 123, with that bias of 100000000 being an artifact of the algorithm which we assumed would cancel out? In this case, it does not cancel out, and in general we cannot assume that

(a + b) - (c + b) = a - c

Another thing to know about floating point numbers on computers is that, as mentioned above, they're usually represented internally in base 2. It's well known that in base 10, fractions like 1/3 have infinitely-repeating decimal representations. (In fact, if you truncate them, you can easily find anomalies like the one we discussed in the previous paragraphs, such as that 0.333+0.333+0.333 is 0.999, not 1.) What's not so well known is that in base 2, the fraction one tenth is an infinitely-repeating binary number: it's 0.00011001100110011<sub>2</sub>, where the 0011 part keeps repeating. This means that almost any number that you thought was a ``clean'' decimal fraction like 0.1 or 2.3 is probably represented internally as a messy infinite binary fraction which has been truncated or rounded and hence is not quite the number you thought it was. (In fact, depending on how carefully routines like atof and printf have been written, it's possible to have something like printf("%f", atof("2.3")) print an ugly number like 2.299999.) Where you have to watch out for this is that if you multiply a number by 0.1, it is not quite the same as dividing it by 10. Furthermore, if you multiply a number by 0.1 three times, it may be different than if you'd multiplied it by 0.001 and different than if you'd divided it by 1000.

My purpose in bringing up these anomalies is not to scare you off of doing floating point work forever, but to drive home the point that floating point inaccuracies can be significant, and cannot be brushed of as simple uncertainty about the least-significant digit. With care, the inaccuracies can be confined to the least-significant digit, and the care involves making sure that errors do not proliferate and compound themselves. (Also, I should again point out that none of these problems are specific to C; they arise any time floating-point work is done with finite precision.)

In practice, anomalies crop up when both large and small numbers participate in the same calculation (as in the 1+1+1+1+1+1+1+1+1+1+1000000 example), and when large numbers are subtracted, and when calculations are iterated many times. One thing to beware of is that many common algorithms magnify the effects of these problems. For example, the definition of the standard deviation is the square root of the mean variance, namely

	sqrt(sum(sqr(x[i] - mean)) / n)
where sum() and sqr() are summing and squaring operators (neither of which are standard C library functions, of course) and where mean is
	sum(x[i]) / n
(The statisticians out there know that the denominator in the first expression is not necessarily n but should usually be n-1.)

Computing standard deviations from the definition can be a bit of a nuisance because you have to walk over the numbers twice: once to compute the mean, and a second time to subtract it from each input number, square the difference, and sum the squares. So it's quite popular to rearrange the expression to

	sqrt((sum(sqr(x[i])) - sqr(sum(x))/n) / n)
which involves only the sum of the x's and the sum of the squares of the x's, and hence can be completed in one pass. However, it turns out that this method can in practice be significantly less accurate. By squaring each individual x and accumulating sum(sqr(x[i])), the big numbers get bigger and the small numbers get smaller, so the small numbers are more likely to get lost in the underflow. When you use the definition, it's only the differences that get squared, so less of significance gets lost. (However, even working from the definition, if the individual variances sum to near zero they can sometimes end up summing to a slightly negative number, due to accumulated error, which is a problem when you try to take the square root.)

Other standard mathematical manipulations to watch out for are:

The quadratic formula:

If b<sup>2</sup> and 4ac are both large, there may be very few significant digits left in the determinant b<sup>2</sup> - 4ac.

Solving systems of equations, inverting matrices, computing the determinant of matrices:

The standard mathematical techniques involve large numbers of compounded operations which can lost a lot of accuracy.

Iterated simulations:

Obviously, any iteration is prone to compounded errors.

Trigonometric reductions

The only sane ways of computing trig functions numerically (i.e. the ways likely to be used by library functions such as sin, cos, and tan) work only for the one period of the function near 0. These functions will of course work on arguments outside the interval (-pi, pi), but the function must first reduce the arguments by subtracting some multiple of 2*pi, and as we've seen, taking the difference of two large numbers (whether you do it or the library function does it) can lose all significance.

Compounded interest

The standard formula is something like

principle x (1 + rate)<sup>n</sup>
but raising a number near 1 to a large power can be inaccurate.

Converting strings to numbers:

[okay, so this is a computer science problem, not a mathematical problem] If you handle digits to the right of the decimal by repeatedly multiplying by 0.1, you'll lose accuracy.

For some of these problems, using double instead of single precision makes the difference, but for others, great care and rearrangement of the algorithms may be required for best accuracy.

Returning to C, you might like to know that type float is guaranteed to give you the equivalent of at least 6 decimal digits of significance, and double (which is essentially C's version of the ``double precision'' mentioned in the preceding paragraph) is guaranteed to give you at least 10. Both float and double are guaranteed to handle exponents in at least the range -37 to +37.

One other point relating to numerical work and specific to C is that C does not have a built-in exponentiation operator. It does have the library function pow, which is the right thing to use for general exponents, but don't use it for simple squares or cubes. (Use simple multiplication instead.) pow is often implemented using the identity

x<sup>y</sup> = e<sup>y ln(x)</sup>
and the inevitable slight inaccuracies in the underlying natural log and e<sup>x</sup> routines can foil you. For example, if you compute (int)pow(2., 3.), on some machines the answer comes out 7, because pow returns a number like 7.99999, and C's floating-to-int conversion simply discards all of the fractional part.

The fact that a floating-to-int conversion truncates the fractional part has nothing to do, by the way, with the problem mentioned earlier of a poor-quality floating-point processor not rounding well. (Back there we were talking about rounding an intermediate result which was still trying to be represented as a floating-point number.) When you do want to round a number off to the nearest integer, the usual trick is

	(int)(x + 0.5)
By adding 0.5 before truncating the fractional part, you arrange that numbers with fractions of .5 or above get rounded up. (This trick obviously doesn't handle negative numbers correctly, nor does it do even/odd rounding.)

* * *

I am not a numerical analyst, so this presentation has been somewhat superficial. On the other hand, I've tried to make it practical and accessible and to highlight the real-world problems which do come up in practice but which too few programmers are aware of. You can learn much more (including the traditional notations for describing and analyzing floating point accuracy, and some clever algorithms for doing floating point arithmetic without losing so much accuracy) in the following references.

David Goldberg, ``What Every Computer Scientist Should Know about Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23 #1, March, 1991, pp. 5-48.

Brian W. Kernighan and P.J. Plauger, The Elements of Programming Style, Second Edition, McGraw-Hill, 1978, ISBN 0-07-034207-5, pp. 115-118.

Donald E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms, second edition, Addison-Wesley, 1981, ISBN 0-201-03822-6, chapter 4.

William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, Numerical Recipes in C, 2nd edition, Cambridge University Press, 1992, ISBN 0-521-43108-5.

This page by Steve Summit // Copyright 1995 // mail feedback