Thursday, August 7, 2014

On the Datatype of Matrix Indices

Computer science purists often try to convince me that size_t is the appropriate type for matrix and vector indices in C or C++. This is always asserted without further justification as if it were commonly accepted. When I say “a matrix index isn't the size of anything, so the burden of establishing that size_t is the ‘one true type’ for these indices lies with you”, I'm generally either insulted or berated, but a strong rational justification has yet to be presented.

My position is that any unsigned integral type is unsuitable for use as a general matrix or array index in computational science.  My argument is essentially a utilitarian one: that pragmatism should win over naïve pedantry.

The most succinct expression of my reasoning is that, from an algorithmic perspective, it is, in general, highly desirable that indices approximate a group under addition (as closely as we can with finite size), and that using any unsigned integer type unnecessarily deprives index variables of the ability to store the additive inverse.

The argument that, formally, matrix indices are natural numbers, {1, 2, 3, ...}, holds very little weight in this argument, since C and C++ already base arrays at zero. I maintain that the ultimate reason for this is the utility of having the additive identity, 0 (zero).

Matrix-oriented languages, like Fortran and Matlab, despite basing indices at 1, do not restrict the datatype of variables used as indices strictly to the set of natural numbers. That is to say that, in these languages that are specifically and explicitly matrix- and vector-oriented, provided that an index variable is a valid index when it is used as an index, the values that it takes at other points in a program are unconstrained. I would take some convincing that the exact opposite should be the case in C or C++, yet this is precisely the position that purists ridicule me for opposing. If you were to suggest to a Fortran or Matlab programmer that assigning zero or a negative number to an index variable at any point in a program should cause an error, I would expect them to be so puzzled by the sheer stupidity of it that they would have extreme difficulty articulating why this is a bad idea.

If unsigned indices were genuinely as good an idea as the pedants would have you believe, one might expect that in 57 years of Fortran and 30 years of Matlab, it would have been tried out and been such a great success that it would have caught on, but that is clearly not the case. Fortran, in fact, allows you to define the index range of arrays to accommodate negative indices. If this were something C-specific, one might expect that the C standard would incorporate this idea, yet it does not (§ “Array subscripting” uses int) and x[-5] is perfectly valid.

The reality of doing computational linear algebra is that, although indices are non-negative integers, the arithmetic manipulation of indices in the course of executing an algorithm often, if not usually, involves not just subtraction, but negative numbers even if the eventual index computed is always non-negative. Consider, for example, the diagonal matrices formed from the stencils of finite difference methods. It would add nightmarish complexity to restrict the relative indices of the stencil to non-negative integers. What would a lower boundary check, now a trivial if(i<0) {... }, become if i were not allowed to be negative?

It might be argued that, in these cases, one “ought” to do these index computations with a signed type so as not to “pollute” the index type with negative numbers, but what exactly is the benefit of doing this? I mean, other than the smug self-satisfaction of being an über-pedant?

In practice, if you attempt to use an unsigned integer type for indices, but a signed type for index calculations, you have to add a lot of casts to silence warnings, and this quickly becomes so onerous that one either ends up either with ill-considered casts that may accidentally mask true errors, or such diligent consideration of whether a cast is truly correct that “cast contemplation” becomes a major, if not the dominant, contributor to development effort.

Again, the pedant might argue that you should have to consider these cases; that it is not a bad thing to have your development time doubled if that means it's correct. But is it “correct”? What does this “correctness” buy me? What are the actual benefits?

Well, the only one I've ever heard is this: “because size_t is the canonical type used to represent sizes in C, such as in calls to malloc(), size_t is the correct type for indexes because it guarantees that they can index any array returned by malloc() irrespective of the platform”. Really? That's it?

Leaving aside the fact that this is really an argument for choosing one of size_t and ssize_t, and I hope I've convinced you that ssize_t would be the better option, this “guarantee” is worthless, because flexibility in the representational range of a datatype used as a matrix index is, at best, useless.

The range of values that you need to use as a vector or matrix index isn't defined by what size objects you can malloc(), it's defined by the problem that you're trying to solve. So, there are actually only two scenarios that merit consideration:
  • size_t is big enough to represent all problem indices; and
  • size_t is too small to represent all problem indices.
In the former case, which is usual, choosing size_t over, say, int32_t doesn't matter, and in the latter case, you have an issue so major that, not only does it not help, but it probably exacerbates.

Suppose that the problem of interest is such that indices need to go up to 1,000,000. You develop a solution under a 32-bit memory model where size_t is 4 bytes. Going to a 64-bit memory model where size_t is 8 bytes is essentially trivial and presents no problem (no matter the choice of datatype for indices) other than doubling the memory requirement for indices, which is not a good thing. However, going to a platform where size_t is 2 bytes is going to be very difficult: you're going to have to do quite a bit of memory management and index segmentation and manipulation to be able to solve your problem under this constraint, and direct indexing is more-or-less off the table. If you have chosen size_t for your indices, you now also have to contend with having 2 bytes automatically hacked off all of them, even in places in your code where no array is indexed; in all likelihood, this is going to make solving the issue more difficult instead of less.

So the correct answer to the question “what datatype should I use for matrix indices” is a signed integer type that is sufficiently large to represent all anticipated indices. Moreover, the correct answer to the question “what datatype should I choose for problem X” is “whatever is required by problem X” and not something provided by the compiler or platform for some other purpose.

No comments:

Post a Comment