Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Almost Always Unsigned (graphitemaster.github.io)
161 points by todsacerdoti on Jan 2, 2022 | hide | past | favorite | 263 comments


I've vascillated between one camp and the other over the years, but in 25 years of professional development, other than pointer arithmetic I can't think of a single time where I've dealt with numbers that exceed a trillion in either direction. Integer overflows are so rare that I can't even remember one happening with 64-bit values. Underflow on the other hand... I've encountered that plenty.

I'm now back in the signed camp. Signed integers make it a lot easier to sanity check integer subtraction when negative values are disallowed (a common situation), and it's easy to reason about. Also, not converting between signed and unsigned avoids accidental conversion overflows due to different expectations, so keeping it signed bypasses that whole can of worms while maintaining maximum flexibility.

Unsigned gives you 2x the positive numeric range, which 99.999% of the time you'll never use (and if you did, underflow detection would become a nightmare).

I can't think of a single time I've heard someone complain "I wish Java's long type supported up to 16 quintillion instead of a paltry 8 quintillion"

At the end of the day, code is for people, so keep it easy for people.


I think types should express intend.

This adds more 'value' to the code for people than allowing them to switch their brains off because something is 'unlikely' to happen. A large class of bugs exists because people assume stuff is unlikely.

I also think one has to distinguish between API and actual code.

If you write a lib that takes inputs that should never be negative you can express this without writing a single line of documentation about it (or if you do, requiring the user to read it) by using unsigned types in these places.

You can still use signed types in your actual library code if you want/must/believe in that but do not pollute the outside API with them if unnecessary.


In most languages that offer this distinction, unsigned integers also don't really express your intent in an effective way.

Imagine you have a bug in your program. If you intend for a value to be non-negative and use a signed integer, you get a negative value. Your code is broken. You didn't want a negative value. But you do have have a very easy way of checking for this at runtime, at least.

Now imagine you instead use an unsigned integer. The bug is still there but now it manifests as overflow or underflow. Your program is still broken. Nothing about using an unsigned integer saved you. And worse, you don't have an easy predicate to check that something has gone wrong.

I think this can only be a reasonable choice if you are working in a language that has dynamic checks for overflow and so using an unsigned integer enforces not only the predicate that the value is nonnegative but also that all operations stay within its expected arithmetic domain.


If my intent is that a certain integer should never be negative, labelling it unsigned does a poor job at conveying this intent because any arithmetic just silently does the wrong thing. 3-4 is some huge number rather than “error”.


In Rust, 3u64 - 4u64 is "error", specifically

thread 'main' panicked at 'attempt to subtract with overflow'


It panics _in debug builds_ but wraps in release builds because including the checks in every arithmetic operation is not zero cost.


Though you can configure your release builds to panic as well. That can be quite useful if you don't expect trivial computations to be a large cost center (and you can always bench to check that assumption).

Even more so now that custom profiles have been added to Cargo (yay), you can enable overflow checking in release, and add a `release-unchecked` or whatever.


Or, better in many cases, you can write what you meant explicitly in Rust.

3u64.checked_sub(4u64).expect("Overflow");

This allows you to spell out that you don't expect this to overflow, and Rust should panic if it does, regardless of your compiler flags.


Of course you can do that, but I would say that doing so defensively for every operation is way more verbose and unwieldy than you'd want.

Much simpler to just toggle on the relevant flag.


I think GP was saying you can do that when you expect that the general case doesn’t need to be checked but a specific case does.


> when you expect that the general case doesn’t need to be checked but a specific case does.

But surely that's almost never the case? Few programs actually desire or care for base 2 modular arithmetics, and when they do it tends to be for very specific tasks (usually cryptographic or cryptography-adjacent e.g. hashing, checksumming, ...).


> But surely that's almost never the case?

Which is why it’s not so impractical to make an occasional exception.


I wish they had "unchecked_sub()" instead, and default "-" would work like checked_sub().


unchecked_sub() exists in Nightly and so may some day come to stable

However, changing - to perform checked_sub() leaves you with an Option, and this makes ordinary arithmetic look pretty clumsy:

let x = (a-b).unwrap()-c;

The intent is to land more types with intrinsic behaviour. Today only the Wrapping type is provided, so Wrapping<i32> is an i32 that definitely has Wrapping arithmetic and won't panic on overflow, but eventually Saturating<i16> will be possible (e.g. for CD audio PCM samples, saturating arithmetic is correct, if you try to make the loudest possible noise louder it just stays the same) and so will Unchecked<i8> if you really sure that you're doing 8-bit arithmetic that can't overflow and doesn't need Debug checks.


But then your function should check if the value is within reasonable parameters.

If it was an int at the function signature and you got -1 what would you expect to happen? It's the same thing.


> But then your function should check if the value is within reasonable parameters

How could you possibly check this using types that cannot represent what you are checking for?


> How could you possibly check this using types that cannot represent what you are checking for?

`f(u32 a, u32 b) { assert a > b ... `

(if the hypothetical following operation is `a-b` as discussed higher in the thread)


Better ensure that assert is not turned off in release mode then.


One of the reasons I don't like asserts.

Log the error and act accordingly, instead of believing this kind of issue only happens during development


How do you "act" if you detected your logic is faulty? In many situations aborting is the best choice.

If input data is faulty and the reason for that is not seen as part of your logic, then sure, log an error and skip.

I used to define my own assert macro to "ensure" my asserts aren't disabled, but I don't bother anymore. There's nothing wrong with assert, and you needn't define NDEBUG. It's important to be aware that there are different situations, and not all warrant aborting, as described above. Another differentation is that there can be asserts that must be disabled in release builds for performance reasons, and others that won't affect performance and can stay enabled.


> because any arithmetic just silently does the wrong thing

The same thing, but worse, can happen with signed integers. (-3)-INT_MAX is either some huge positive number, or something crazy because it's undefined behavior and the compiler is allowed to do anything it wants.


I guess what the parent is saying is that, while the same kind of bugs can happen with signed integers, they are less common than with unsigned integers because they typically involve very big or very small values such as INT_MAX or INT_MIN. This is due to the fact that the "most common values" (close to 0) are at the beginning of the range of unsigned integers, but at the middle of the range of signed integers, so the risk of underflow is lower for signed integers in practice.

I can understand where they come from, especially regarding error handling: for integers whose value must not be negative, it is easier to check for underflow by checking if the result is not negative rather than by checking that the result is not eg smaller than the previous value (eg for addition).

That being said, "number must not be negative" really ought to be encoded in the type system, so that the user of the type knows that they must actually look for underflow.

I guess that part of the problem is that we don't want to check for underflow after each operation, so checking the negativity is a way to "coalesce" several checks after multiple operations. However this is fragile, because the multiple operations could end up producing a positive value, even if some intermediate values where negative.

For full safety, I don't see how we could do better than checking after each operation right now. If we're doing this, I feel like checked_add and friends from rust is a better fit than cramming an unsigned int into a signed one.

I wonder if we could design an integer type with 63 bits of value, plus one bit of "overflow/underflow poison", such that any operation that would under/overflow would saturate that bit to one, but otherwise still perform the operation on the value part. That would allow to coalesce multiple checks while keeping safety even in the presence of multiple faulty operations. I wonder how it could be implemented efficiently though


Your idea for a coalescing overflow type has been explored in the document N2868 [0], a proposal for the C programming language. Instead of reserving a bit from the integer, each integer type is associated with a checked integer type, which holds an integer alongside an overflow bit. Checked integers are opaque and can only be constructed and deconstructed via macros; as I understand it, the idea is that the compiler can use the processor's preexisting overflow flag to represent the checked integer. In particular, the operations on checked integers have the property of coalescing overflow checks that you are looking for.

[0] Supplemental Integer Safety, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2868.pdf


> I wonder if we could design an integer type with 63 bits of value, plus one bit of "overflow/underflow poison", such that any operation that would under/overflow would saturate that bit to one, but otherwise still perform the operation on the value part. That would allow to coalesce multiple checks while keeping safety even in the presence of multiple faulty operations. I wonder how it could be implemented efficiently though

There is something like that already for floating point. Whenever an overflow or underflow happens, it sets a sticky bit in a separate flags register. You can clear these flags, do a sequence of operations, and at the end, see if any of these flags are set. See https://man7.org/linux/man-pages/man3/fenv.3.html for the standard C API for it.


An unsigned integer will never be negative, that intent is expressed clearly and correctly. What the operational semantics should be is unclear, because of the tradeoffs involved in signaling an error.

In terms of the C standard, "3-4" for unsigned ints is modular arithmetic, and the "wrong thing" is assuming that it will do anything other than wrap around. This is very clearly defined, and implied whenever you see an arithmetic expression on unsigned integers.


Conversion from signed to unsigned is implicit and silent. The expression being converted to your "can never be negative" type isn't nonnegative!!! Its value drastically changes; e.g. -3 becomes a huge number.


Not if you pass -Wconversion to GCC


Clearly defined, but very inconvenient and makes it too easy to write buggy code that looks correct on surface.


Inconvenient to whom? It’s pretty inconvenient not to do that if you eg have strict memory/performance requirements.

C made the right tradeoff IMO. You can protect yourself against overflow if you need to, but if it always signals errors you can’t turn that off.


This sounds like an issue with the language. 3-4 is not a large number and should not be by default. It should be a case handled by the developer.

Maybe you want to do some weird bit banging trickery, and that's fine too, but it should require an out-of-the-way function call.


> This sounds like an issue with the language. 3-4 is not a large number and should not be by default

Many languages optimize numeric operations and DX around numeric types for efficiency rather than correctness, and make the latter high-friction. (Wrap-around subtraction, decimal literals being treated as IEEE floats, etc.)

The exceptions (or cases where it is merely less true) tend to be very high level, dynamic languages.


"unsigned" doesn't express intent. It expresses implementation, maybe it expresses assumption.

Ada's "Positive" or "Natural" or

  subtype X is Integer range Y..Z;
expresses intent (well expresses requirement really), it also enforces that constraint, so it's not merely an assumption.


> I think types should express intent. > If you write a lib that takes inputs that should never be negative you can express this without writing a single line of documentation

So, let's make an odd_int_less_than_20587 type and a billion of other types? "Not negative" normally helps only one tiny bit towards documenting anything about the accepted values. Making a separate type for that and complicating everything is ridiculous.

> If you write a lib that takes inputs that should never be negative

Have you ever compared the difference between two unsigned integers? Because unsigned integers don't have "natural" subtraction. Did you ever find the need for signed numbers to interact with unsigned numbers? Because if you did, it's a bad idea to separate them without need. Premature isolation is a prime cause of complexity.


> So, let's make an odd_int_less_than_20587 type and a billion of other types?

Yes, what's wrong with that ? Every variable's type should be defined as precisely as possible when meaningful, that's the whole point (and C/C++ not easily allowing to do that is one of their biggest drawbacks imho.)


There's a tradeoff around strictness of nominal types and evolvability, e.g. changing the threshold on such a strict type would be a breaking change.


Changing the threshold without signaling the change in the type system is also a breaking change. It's just that your compiler won't warn you about it when you inevitably forget to update your code. Instead your code is going to break at runtime.


Obviously if you have at most 8/16/32/64 bit types (only making new types because of physical constraints), you'll have way, way fewer breaking changes than if you try to encode bounds extremely tightly and try to update them whenever you 2 instead of 1 somewhere.

In fact most software is designed such that the physical sizes are chosen first, and then the practical bounds follow from that.


Not necessarily, it depends on what you do with the value.


Are you seriously asking? Because what I suggested rhetorically would be the death of all your programs for the simple cause that line count would quadruple and you would have to touch half your codebase every time you make a tiny change.


When I make a change today like renaming a method I just let my IDE apply it to the codebase, why would that be different ?


Because if you change one type to include the number 235498 it will also affect 377 other types.


why would it ? if you set things such that expected ranges are set where they matter (e.g. where you do arithmetic) you should only have to change the types for which range issues occur, like this (very quick and dirty, likely wrong, but you get the idea): https://gcc.godbolt.org/z/4v6dW9czP


Not to hurt anyone's feelings but no I don't get the idea. I find this irritating and very unconvincing. So much work and boilerplate to prove peanuts. Have you ever seen anyone doing serious software development like that? I mean, not just toy examples of some 10-line bean counting program blown up to 100 lines.

> void foo(ranged_int<1, 5> a, ranged_int<3, 7> b)

Why would I _ever_ write a function like that? Practical proofs need to be about runtime values, not just constants and types. They also need to put multiple values in relation. They need to handle type casts and many more things. Constant integer ranges might be of some use but it's a tiny fraction of the invariants that a programmer juggles in practice, and the source code is already blown up out of proportion just for these ranges. This thing is just not gonna fly.

Just multiply some "good" values a few times and there is no way that you can statically prove that the result is going to be in the range as well, even though it might always be the case due to how the values combine each time (simple example: offset + size that the code checked to be within array bounds). Invariants in practice are relations over the runtime values of multiple variables. The focus on single values is basically already where types fail as a correctness tool.


> Just multiply some "good" values a few times and there is no way that you can statically prove that the result is going to be in the range as well, even though it might always be the case due to how the values combine each time (simple example: offset + size that the code checked to be within array bounds).

I don't understand, there's literally a CVE about that elsewhere in this thread, basically

    struct my_structure { 
      int size;
      int offset;
    };
either you restrict both size and offset to e.g. 2^30 so that you can safely sum both, or you will have an issue because you'll be doing if(size + offset < max_size) and size + offset overflows because absolutely no one can be trusted to write overflow checks, you gotta use a library for that.


If you're not making assumptions about the values stored in your variables, then every little arithmetic operation can lead to unexpected wraps (UB for signed types). What follows is that we need to make assumptions. Don't write an if conditional in front of every operation like a forgetful person - but assume that "my_structure" contains good values that you can safely add.

Usually it's very easy - require that size and offset are validly indexing an object, and require that objects are of a sane size (I rarely have in-memory objects larger than a few megabytes, and for most code I posit that it is a mistake to not chunk and stream huge datasets). Done, no real need to worry.

If you cannot make any assumptions about the values contained in a "my_structure" then that means the code is at an API boundary, which is the perfect place where such check (if-statement) must be placed. At internal code places, checks are wasted effort, and at most a compiler switch to detect wraps (logic errors) would often be a much better choice than paranoid code where each line of code can't trust the previous line, and where one can't see actual work being done because it's all covered in mistrust. Unnecessary mistrust (i.e. mistrust at places strictly inside maintenance boundaries which are not strategic checkpoints) only makes it hard to write code that actually does something, and might possibly cause more bugs than it could ever save.


> So, let's make an odd_int_less_than_20587 type and a billion of other types?

Bounded integer types have been extremely useful for me on a few occassions, and I wish more languages had type systems powerful enough to support them. They also address your concerns about subtraction.


> So, let's make an odd_int_less_than_20587 type and a billion of other types?

Yees?

Bounded integers are a relatively common feature of programming languages (and one which is sorely missing from Rust, it's technically possible currently but not super fun, when fleshed out enough const generics should make them much nicer, possibly even built-in)


What is the point of bounded integers? Most quantities don't have natural bounds, and practical bounds only arise from physical limitations (8/16/32/64 bit etc).


Those practical bounds have a tendency to percolate into bounding other things. So you can only have 2**10 foos because they're indexed by a 10-bit bitfield somewhere? Now maybe you can only have 204 bars, because every bar has at least 5 foos. And so on.

If you genuinely wanted to reason solidly about such limits, you do end up needing arbitrarily bounded integers. Of course, most people don't want to do that, for good reasons. And even if you wanted it, there's a reasonable question whether adding them to the language syntax is really the right thing to do.


Yup, that kinda was my point, and where you say "at least" that is also a good point why setting up some complicated static system to maintain limits is not practical. Hence my philosophy has been for a long time to set the physical constraints that just have to be decided (and do that as best as I can). And then to not worry (more than a tiny bit) about statically deciding the rest, or it'll end in a nightmare.


> So, let's make an odd_int_less_than_20587 type and a billion of other types?

Yes, and at least in C++ it isn't that difficult to trivially generate these types as you need them with a modicum of template-fu. You don't need that many integer type templates to capture most of the common cases and provide contextually sane operators. This might be a little more difficult in other languages.

This is pretty standard type safety practice and usually well worth it in terms of writing robust low-defect code.


Show me where this is standard safety practice?


Eh, what? This has been idiomatic for many years for C++. Not doing it, especially today, is just malpractice. Every company I've worked at since I was doing C++ professionally required this. There is no accounting for how random companies do their software engineering but that doesn't mean competent software engineering organizations don't exist.

I do database engines. This has been a standard part of code standards for a really long time, at least as long as the programming languages we used practically supported it. I've done it ever since. The only type that is allowed to flow through the code is size_t, where it makes sense.


I'm really curious what that means and how it looks. I assume it's casually going beyond simple typedefs for types whose size might change? And what does "flow through the code" mean here? That you can't mention "int" or "uint64_t" anywhere? That you can't pass these things as arguments?

Are there any open source examples to study this style? I don't know where to look, but admittedly studying C++ codebases is not what I do for fun.

In my personal experience, it never worked putting layers of abstractions to hide physical representations. That makes the implementation code really hard to grok, and your hands are bound in the code because you're not allowed to _assume_. What worked for me instead is making sure that certain decisions that are likely to change are contained inside a module and for the rest of the code hidden behind an API or simply not accessible. Runtime assertions can be useful too, but maybe that is part of what you are describing by "template-foo". Finally, I try to stay true to my decisions and am not trying to act like I could have to replace floats by doubles at every corner. Some code that I've seen was littered with templates for that single reason, and I'm pretty sure the developers didn't have an intimate understanding how floating points works.

Interestingly, size_t is one of the things that I tend to avoid, and in my perception a majority of the programmer population agrees it was a mistake to have it be an unsigned type. A type which can represent "the size of an object" is too general for all but maybe OS APIs IMO. Usually I know that the size of my objects is bounded by a certain size, well below what fits in 32 bits.


> So, let's make an odd_int_less_than_20587 type and a billion of other types?

I see your point, but I think that (while not always), it can still be pretty helpful to add your custom types to make invalid states unrepresentable.

This can sound ridiculous on something like C, but if your language has refined types, it's not that bad.

For example, this doesn't look awful to me:

``` type WeirdId = Int Refined (Odd And Less[20587])

val myId: WeirdId = 5 // OK val myInt: Int = myId // Can be used as an Int val myRuntimeId: Either[String, WeirdId] = refineV[Odd And Less[20587]](myId + 2) // Runtime Check val myInvalidId: WeirdId = 6 // Compilation Error ```

https://scastie.scala-lang.org/qoSHgL4PQCW6lC2MUHc5YA


I've spent some time pondering things like that, but ultimately concluded that we should keep these things away because there is a duplication between type-level and code level, and I'm completely fine with sprinkling a few runtime assertions in most cases. Normal code syntax is much more suited to expressing invariants. Obviously I'm not into statically proving the correctness of my programs, so if that's your cup of tea things are probably looking different (and I'm sorry for you).


I would love to be able to do more formal proving of programs in practice, but I 100% agree with you. A lot of what the advanced typing crowd seem to be doing is to re-invent programming but at the type level, and it's unclear to me why that should count as progress. It reminds me uncomfortably of C++ template metaprogramming -- and for good reasons, C++ has made steps towards replacing that with statically evaluated expressions written in regular syntax.

My gut feeling is that the right way forward is to have a fairly standard type system visible in the source language augmented by contracts written in regular code. Basically a slight augmentation of assert(). A formal prover could internally reason about an extended type system where those contracts are automatically lifted to the type level if that helps for some reason, but programmers wouldn't be bothered with it.

This seems unlikely to be a genuinely novel idea, so I'd be curious to hear whether systems like that already exist.


Exactly. It pains me when someone comes up with a new language with dependent types, but then to encode a simple constraint they first define integers via peano axioms using some kind of template metaprogramming. (The example I'm remembering was Coq or a Haskell dialect or something.)

There was a dialect of C#, Sing# or Spec#, that allowed you to specifying pre and post conditions in the same expression language as the main language, and they were checked at compile time. This is what comes closest I think.


Totally!

This is one of the best features of the pascal family:

http://www.delphibasics.co.uk/Article.asp?Name=Sets

(look in "SubRanges")


Almost useless in my opinion (because not a lot of practical invariants can be expressed with simple ranges) and will only create confusion when using non-0, non-1 based indexing and also create confusion about the representation/the space required for representation. The examples shown there are similar to the Animal-Cat-Dog examples that are popular to "prove" that classes and inheritance are useful.

That, plus the type incompatibility that arises from mixing such types with other integer types.


Do you have uses Pascal?

The ones that used it like it...


I have worked on an enterprise application written in Delphi full-time for 6 months.


Not every type pays it’s way, but the distinction between N and Z is well worn and useful. N measures the size of things while Z is a little more abstract. I think u64 can quickly convey a powerful intent.


If we're talking about N and Z (which makes sense since most programs are grounded in a mathematical model with some physical constraints added on top), then please acknowledge that N doesn't have subtraction defined, which illustrates why it's not very useful for most code.

You could also argue that subtraction for unsigned numbers is only partially defined, just like for signed numbers. But that would be missing the point, since most numbers in practical programs are small, and for these numbers the signed range is indeed much more useful since subtraction is defined for signed integers (e.g. i32) for small numbers. This is not the case for unsigned, where 3u-4u is usually not what you want.


Even without subtraction defined, N is useful. Partial subtraction on N, as it is, is also useful. When I'm dealing with counts I pretty explicitly either want truncated subtraction or warnings when I overflow.

Perhaps the biggest issue all together is just that many languages will let you do 3u - 4u instead of outright making the syntax more obviously annoying to force you to think about the caveats and pick between {truncated, throwing, overflow} variants.


I think you assume my comments are regarding C/C++.

They are not. I write mostly Rust these days.


I don't really care, why would Rust be much different in that regard?


It does not express intent. I remember being a junior C programmer and using "unsigned" to express my intent (it's greater than or equal to zero).

In reality, like all C keywords, it's completely misleading and actually really means "weird_arithmetic_mode_that_will_mess_you_up".


I don't know what's misleading about it or how the arithmetic is weird?


Signed and unsigned integer types both are finite fields, rather than Abelian rings like the normal integers used in most math. That difference makes the arithmetic "weird", but going to signed integers doesn't change anything, anything except bigints (or generic arbitrary precision floating point) is a form of "weird_arithmetic_mode_that_will_mess_you_up" if you pretend you're operating over the real numbers.

Not sure what they mean about it being misleading though. They might prefer it if they were called "modular integers" instead of "unsigned integers" but that would lead to confusion with modules IMO.


>Not sure what they mean about it being misleading though. They might prefer it if they were called "modular integers" instead of "unsigned integers" but that would lead to confusion with modules IMO.

Modular integers to me would seem very confusing. They're integers without the sign, and they have some max value determined by the number of bits, I can't really think of a better name for them except for the fixed-width versions which include that bit information.

Extremely intuitive, simple English words, exactly what it says it is (integer without the sign bit).


Imagine a parallel situation. There is a type `NonEmptySet`, with an `intersection` operation. Now, given `NonEmptySet(1, 2).intersection(NonEmptySet(3, 4))`, and the requirement that `intersection` produces another `NonEmptySet` what does that return? How about a set containing everything except 1, 2, 3, 4? Does that seem reasonable? That's analagous to what `(unsigned)3 - (unsigned)4` does.

In short, the operations on `unsigned int` do not really behave like "non-negative integer"; as such it's a weird match to use them for that purpose.


I don't really see it, unsigned says there's no sign. Given that and the number of bits it's exactly the expected behavior that it will not be able to produce a sensible result for subtracting something larger from something smaller.

Whether the operation then errors out or wraps doesn't change anything about unsigned being the right name for it, because it's something you wouldn't expect unsigned to do.

The intersection of your two NonEmptySet produces an empty set. If you write that result into a NonEmptySet it's the same, it doesn't make sense for you to expect a NonEmptySet as a result of that operation.


In a perfect world, this would be so. And if there were types that were performant AND could enforce such invariants without the downsides, I'd welcome them gladly. But primitive types come with baggage and limitations, so you have to make compromises. And the compromises when using unsigned ints are just too costly in my opinion. They're a good tool for specific use cases, but not as good as signed for general purpose code.


Could you list some of the downsides & compromises in the case I made?


- silent unintended underflow behavior

- dangers of mixed signed/unsigned code (signed values MUST exist, but unsigned values don't have to)

- more cognitive load to read and understand unsigned underflow detection code + issues once you exceed the halfway point

This is why Java offered signed integer types only (I think they should have allowed unsigned as well, but for different reasons)


> silent unintended underflow behavior

Please show me an example in the wild that could not be trivially caught/alleviated. With 'trivially' I mean the code that needs to be written to prevent it, not the thinking you need to do to consider it.

As for the latter: this is what I get paid for. Writing the code out is necessary but not what I really get paid for (unless I were a very slow typist, that is).

I find that most of the time I was simply too lazy. Or rather: I wrote C/C++ code for 20+ years. I used int and didn't think (unintended rhyme).

Now I write (mostly) Rust I find myself considering the edge cases regularly and also covering them because the compiler forces me to. See also this comment [1].

I.e. a - 1 + b where a is always positive (incl. 0) and b is always > 1 'just' works with int (and unsigned [overflow] too!) in C/C++.

In Rust you will get a panic when a = 0 which will then make you think what you're doing and simply reorder this to a + b - 1.

I find that this is the 'cognitive load' that needs to be applied commonly and I'm pretty fine with that.

> dangers of mixed signed/unsigned code

Alleviated by simply not mixing. There is a reason languages like Rust simply do not allow this without explicit typecasts. And those, in cases where one type is signed and the other is not and they have equal width translate to "all bets are off".

> more cognitive load to read and understand unsigned underflow detection code [...]

See first point. IMHO this is simply the same point expressed differently. Underflow happens when you subtract.

Ensure you do not subtract a larger value from a smaller. This is as easy as writing max(a, b) - min(a, b) or what author of [1] does.

Same as abs(a - b) but works with unsigned. Cognitive load? I do not see any and I agree with the author of the article that this is easier to read.

[1] https://news.ycombinator.com/item?id=29767877


This is one of those "If everyone would just do X" kinds of situations, which is why I'm against it. Things go wrong all the time - in fact it's a miracle anything works at all. So when I see "code that needs to be written to prevent it", I let out a sardonic laugh; people just don't work that way.

People WILL forget to check, or implement the check wrong (and it won't be discovered until the next black swan event), or some invariant will change in future and nobody will know to update this code. The code doesn't care because the compiler doesn't enforce the invariants, so you're left with bare and likely confusing (and probably eventually undefined) behavior.

I for one would LOVE it if popular languages could be instructed to inject invariant enforcement code (not asserts) at certain boundaries. It would help with a whole host of problems!

When you cross zero with a signed integer calculation, it's easy to find out, and the code has low cognitive overhead. Also, with code generally structured this way, bad input propagation tends to get stopped earlier because of the number of functions that check against negative values.

It's also possible to underflow check with unsigned, but the code is more complicated and can't stop propagation if it fails, and if you are allowing values > the signed positive limit (or unwittingly allowing it), your overflow detection code gets even more complicated (or wrong) due to the reduced overhead space. You COULD do a bunch of checks beforehand, but that's not always feasible due to the combination of operations being done on the operands, and once again people are people.

Even though many modern languages block uncast type conversions, that still doesn't protect you from converting a negative integer to an unsigned integer. Yes, you SHOULD be checking before the cast, buuuuuuut...

What we want to do is reduce the number of ways things can go wrong, reduce the fallout of an errant process, and minimize cognitive load to reduce bugs overall. Making that worse just to double the positive integer space doesn't make sense to me.


Here's an example:

  unsigned int length_a = 10;
  unsigned int length_b = 20;

  if (length_a - length_b < 0)
  {
      printf("negative diff\n");
  }
  else
  {
      printf("positive diff\n");
  }
I would argue that there is plenty of code like that in the wild. It looks innocent and harmless but it gives a completely surprising result.


If you ignore the compiler warning. This is trivial for the compiler to detect.


only in the most trivial cases

https://gcc.godbolt.org/z/YWPP5qdbx


I don't understand why the programming language can't promote the temporary to a larger signed int. Of course you can't change it in C anymore, but its like this in most languages. But you rarely ever want unsigned modulo arithmetic here (and if you do it should be a special syntax).


-Werror


"This value should never be negative, so I'll use an unsigned type." There are two types of programmers: ones who are reassured by that reasoning, and ones who shudder at it.


Yeah, "should not" and "can not" aren't the same at all.


In C or C++, if you write a function void api(unsigned), you can cheerfully call it as api(-42) or api(b - c) where b - c are int.

This is a poor, poor substitute for contract checking.


Types aren't always expressive enough for that. I usually can't specify that a number should always be negative for instance, or that there's no 0.

My preference is to minimize surprise. If everyone is used to using signed types for everything, you better have a good reason for using something else, and leave that justification in a comment


> I think types should express intend.

Yes, and let's start with overflow behavior because it's much more relevant than the type domain.

If overflow isn't encoded by the type, there is just no way to correctly set a strict domain.


What you want is range subtypes of signed integers.


Important: without silent conversion. Languages like Pascal and Ada require explicit conversions in more places to make that sort of thing useful.


It also allows creating type checked "new" numerical types for semantics purposes to help improve interfaces. This helps prevent things like accidentally passing an integer in the wrong units for example. Two of the worst offenders I've seen of this is: "Is this uint64_t (or float) "dt" parameter in ticks, seconds, milliseconds, microseconds or nanoseconds?" and "Is this angle parameter in radians or degrees?"


And why also limit to that? What if I have a function that accepts values in the range [0, 100]? Make a type to express that? What if I have a function that takes only positive integers (i.e. unsigned integers minus 0)? Chances are that you should anyway validate your input. Also what if you have a function that takes an unsigned and a signed is passed? On most situations you only get a warning (or not even that) and it's passed a completely wrong input to the function. Isn't better to take an int and validate it before doing anything with that?

You see, doesn't make a lot of sense to distinguish between positive and negative in the end. From a mathematical point of view negative integers have the same dignity of positive integers since a couple of centuries. The same operations that apply on positive integers apply on negative integers, and not only that, an operation between two positive integers can as well result in a negative integer!

The unsigned type was created just to have one more bit if you *are sure* you have only positive integers. That could had a sense in the era of 8/16 computers. Nowadays that we have 32 or 64 bit computers does really that extra bit count that much? I don't think so... 2 billions is plenty enough for most usages.


> Integer overflows are so rare that I can't even remember one happening with 64-bit values. Underflow on the other hand... I've encountered that plenty.

I totally agree!

After some horrible bugs caused by unsigned integers in innocent looking C/C++ code, I'm 100% in the signed camp. The only time I use unsigned is when I need well-defined overflow behavior (which is rare).

As a bonus, using signed loop counters in C/C++ code enables certain kinds of compiler optimizations (based on the fact that signed integer overflow is undefined behavior): http://blog.llvm.org/2011/05/what-every-c-programmer-should-...


The loop optimization you point to can be done for unsigned loop counters even without exploiting undefined behavior. It’s part not why the standard C practice was to use unsigned loop counters.


Fully agree with this, although one case where Java’s lack of unsigned integers is a quite bad is in representing binary data because the byte type is signed. I recently had cause to write some Java code that did a lot of modular arithmetic with arrays of bytes and oof, what a mess. There were four or five subtle bugs caused by the unexpected signedness of the byte type. I also wrote the same thing in Julia using UInt8 vectors and had no such bugs.


Java technically doesn't even have non-array bytes (or short), all types are just that integers: either 4 or 8 bytes. So if you see a declaration of byte/short (and often times char) is likely a mistake.

The correct way is using ints, e.g. "int x = b[i] & 0xff;". Pretty much you should do math with int and long only. java.util.Integer/Long nowadays have static methods to work with unsigned types if you need them.

Once you remember that you need (the annoying) "0xff", you'll be free from errors. It has been there for 24 or so years. Other than that wrap your arrays into ByteBuffer and use provided utility.


Citation needed? Java primitive fields are padded for alignment, but that's not unique to java.

https://www.baeldung.com/java-memory-layout#layout


It's pretty much a part of the bytecode spec [0]. The stack model has only 4 or 8 bytes long operands. Yes, it can convert int to byte, as method signitures actually need the cast. However a operation involving two bytes (say add, sub, shift) would result into an int, and requires a downcast back to byte. "b+=anotherByte" sums b and anotherByte and then downcasts it automatically.

The memory layout of the classes is quite immaterial here as the operation do happen within CPU registers.

[0]: https://docs.oracle.com/javase/specs/jvms/se12/html/index.ht... [1]: https://en.wikipedia.org/wiki/List_of_Java_bytecode_instruct...


This is true, but if your runtime is actually running stack-based interpretation of your java code (rathe than JIT) you have bigger performance problems.

    static int square(int);
       0: iload_0
       1: iload_0
       2: imul
       3: ireturn
is java; compare to the c++ equivalent (used unsigned to avoid any possibility of undefined behavior):

    unsigned char square(unsigned char num) {
      return num * num;
    }
Which compiles to:

        mov     eax, edi
        imul    eax, edi
        ret
on x86

        mul     w0, w0, w0
        ret
on arm.

By this definition, c and c++ don't have "real" byte types either, because they undergo integer promotion for operations, and the generated code is operating on register sized values, not bytes. Java bytes use 8 bits as part of an object (with a fixed-cost padding); 8 bits as part of an array, etc. The fact that the bytecode has intermediate casts is not relevant to the actual code executed on CPU.


> By this definition, c and c++ don't have "real" byte types either, because they undergo integer promotion for operations, and the generated code is operating on register sized values

They're not actually operating on register-sized values: the code labelled x86 is also the x64 code, meaning GCC operates on 32b when the native registers are 64b. Likewise ARM.

Your assertion is even more debatable when Clang actually operates on 8-bit registers on x64:

        mov     eax, edi
        mul     al
        ret
(it does not on ARM64 as `mul` is only defined for 32 and 64b)


I don't write Java, but a good way to deal with this might be to cast to int (a larger type) immediately after loading a byte and before working with it. So, complexity-wise, the overhead of signed byte vs unsigned byte should be at most two casts, and often casting to a larger type would be required anyway even for operations with unsigned byte values.


> I can't think of a single time I've heard someone complain "I wish Java's long type supported up to 16 quintillion instead of a paltry 8 quintillion"

That Java has no support for unsigned integers is a very common complaint.


It's one of my biggest complaints attempting to do systems programming in the language, that and the lack of 64bit mmap.


I have no issue with the lack of unsigned - that's ok. It's a trivial workaround, as it's rare enough. However the lack of 64bit indices over memory mapped files blows indeed, with no easy (performant) replacement. It's doable but it'd require a shim over an array of ByteBuffer with the double indirection and pretty much inability for the JIT to optimize away all the bound checks.


I've had to resort to rolling my own mapping layer using JNA. Fortunately, 64-bit mmap support is being added as part of the Panama foreign-memory project.


I've hit the 2147483647 limit of indices occasionally in VFX/Graphics with large datasets: memory efficiency was also important, so 32-bit was a necessity (random access is required, so can't dynamically pack), so I tend to use unsigned these days (uint32_t) within arrays, but native 64-bit lengths (normally size_t) as standard local variables, and I normally go with unsigned there as well.

Several APIs in the VFX arena (Pixar OpenSubD, Pixar USD, and Foundry's Katana GeoLib API) all use signed ints for indices which I've occasionally argued against, with the response being "Google's coding standard says don't use them".


Yup, and that's a case of an API not aging well. But even in that case, switching to unsigned is only kicking the can down the road, not solving the problem. You'd get another year or two and then hit the 4294967295 limit.

As for 64-bit values, unless you can foresee requiring > 8 quintillion, unsigned isn't buying you anything.


Resource consumption does matter though. For example CPU caches or the quantity of RAM within a system. If their workset does fit within 32 bits of index the intentional design choice very likely correlates to a measurable improvement in speed due to cache locality and system resource utilization.


Indeed it does matter! As I said this is an example of an API that hasn't aged well (when they devised it, they probably didn't envision this happening).

There are a number of ways to handle this:

- Use unsigned 32-bit (gives you a little bit more runway, but you'll still hit the end hard)

- Use 64-bit values (ends your problem once and for all, but costs double the memory so now you have a new problem)

- Use 40-bit values (Gives you a lot more runway, but costs CPU)


TBH I think this is pure laziness.

I looked at OpenSubdiv extensively. There are no good reasons to use signed in the cases I opened the ticket for and making sure the code inside the lib does the right thing with unsigned arithmetic is trivial.

But someone still needs to do that and writing that "Google (essentially) says you don't need to" in their style guide is much less work. ;-)


It isn't about squeezing out extra range. A silent overflow mechanism with predictable behavior is useful and can't be cheaply replicated with signed numbers.

Timestamp arithmetic is a common case where unsigned integers with silent overflow is superior. You can avoid bounds checks and special cases by always computing a delta between a newer time and an earlier time. As long as you can guarantee that the sample period is less than the clock rollover period there will never be any ambiguity.

Ada tried imposing the purity of signed integers and had to add modular types to get these semantics because of their utility.


Actually I had a colleague that always complained to me that Java didn't have unsigned types. He was of course a C++ programmer.


c'mon java does have an unsigned type, just that it's 16bit only (char)


> For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up.

http://www.gotw.ca/publications/c_family_interview.htm

Gosling on why Java only has signed arithmetic.


I’ve seen projects that use a mix of unsigned int, size_t, and off_t variable declarations with implicit conversions between them. This seems like a security hazard if you are handling user input because just checking the original input may be insufficient. size_t in general seems extra ridiculous because max(size_t) is a common sentinel value returned by methods and if you aren’t careful and add to max(size_t) it wraps around to -1.


A bonus is that you can enable UBSAN to automatically and efficiently catch signed integer overflow, without changing language semantics (since, unlike unsigned overflow, signed overflow is undefined in C/C++). Here's an earlier post of mine showing how: https://news.ycombinator.com/item?id=24578534


A lot of processors around you are still 16 and 32 bits. I’ve hit the overflow a lot too.

> At the end of the day, code is for people, so keep it easy for people.

This is a dangerous heuristic, because people here means exclusively developers. It’s easy for us to trade annoying but mostly harmless bugs for rare but harmful ones.


Yes, I hit it a lot during my Z80 days, too. I also had to do manual memory management using a memory-mapped page register. We all have our constraints we must operate under, and a bag of tools to help deal with those constraints.

But "unsigned by default" won't save you here; all it does is give you a little more breathing room for your special case code (and hey, maybe that doubled positive integer space is all you needed, in which case congrats).

But all of the other problems are still there. That's why I'm saying that "unsigned by default" is backwards. It should be "signed by default", and unsigned for special situations that warrant it.


> I can't think of a single time where I've dealt with numbers that exceed a trillion in either direction.

A big one is dealing with currencies like Japanese yen in financial software.


Absolutely! But how about a quintillion? Signed 64-bit integers give you 8 quintillion in the positive integer space, and unsigned gives 16 quintillion.

The question is: Will the difference between 8 and 16 quintillion matter enough to say "unsigned by default"?

But currency types need to be signed anyway, so I guess that's that ;-)


Unsigned does not mean "not negative", it means "modular arithmetic".

The only few cases you want that are e.g. hash functions, crypto, etc. In all the other cases it's a mistake, and the "patterns" shown in this article to circumvent the issues with unsigned just for the sake of using it are extremely ridiculous I think and much less readable than the normal code using signed integers.

If you have things that must never be negative, define a type that does that and gives an error (there are many good safe_int examples in C++) whenever an operation gives a negative number as in that case you've already lost and your business logic / input filtering / .... is wrong and needs to be fixed.

Anecdotally, I've written a few hundred thousand kloc of c++ so far and I've never ever had a bug due to signed overflow. Unsigned underflow OTOH... I'm just thankful for clang's ubsan to warn on it because of how many issues it caught. For the immense majority of programs you'll never have sizes close to 63 bits anyways since the CPUs we use barely have 52-bits of address space at most (and if you have more you're likely already using 128 bit sizes)

> Where unsigned does benefit here is when these are used as indices into an array. The signed behavior will almost certainly produce invalid indices which leads to memory unsafety issues. The unsigned way will never do that, it’ll stay bounded, even if the index it produces is actually wrong.

I'd take negative indice over silently wrong indice any day of the week. The first will be caught hyper quickly, the second will send money to the wrong account silently for a couple years before anyone notices and then you're in much deeper issues.


The suggestion of using an unsigned loop counter starting at size-1 and decrementing until “i<size” is no longer true starts this article off with a “this is bad advice” flavor to it in my book. If I encountered that code I’d stare at it way longer than necessary because I’d be trying to figure out what cleverness is hidden in the bizarre/backwards way the loop continuation test is written. I don’t know why you’d do this to someone else who has to read your code.


It's an idiom - obviously very little known, likely less known than comparing floats to themselves being a NaN check.

The standard idiom for doing a reverse loop would be: "for (int i = size; i-- > 0; ) ...". As a side benefit it has lower registry pressure. Some folks have issues reading that one as well.

Saying all that not being 'unsigned fan', just the popularity of the idioms comes with their use frequency, and given that 'unsigned; is not popular at all...


reverse iteration seems pretty basic …


> Unsigned does not mean "not negative", it means "modular arithmetic".

In which case it has a terrible name. The name clearly indicates that the primary feature is that it is not signed, and the logical conclusion from that is that it is not negative.

That said, I agree; using signed types in C/C++ does generally seem safer and a majority of the techniques listed in the article are unsafe, logically confused, and/or inhibit optimisation. Using signed types in languages like C# is also generally more pleasant, because otherwise you end up with a lot of casts just to interact with the base types (like arrays), language features and third-party libraries.


> The name clearly indicates that the primary feature is that it is not signed, and the logical conclusion from that is that it is not negative.

No, the logical conclusion is that since it has no sign, you don't know whether it's positive or negative. Otherwise the name would be "positive" or "non-negative", not "unsigned".

And this conclusion would also be correct.


Signed also tends to mean modular arithmetic. It's just with a different definition of modular arithmetic than the Euclidean one. See Boute's 1992 paper "The Euclidean Definition of the Functions div and mod" [1] for details on the various definitions.

Both signed and unsigned arithmetic are using finite fields, instead of Abelian rings as most people expect.

[1] https://biblio.ugent.be/publication/314490/file/452146.pdf


> Both signed and unsigned arithmetic are using finite fields,

This is not correct. They are rings with some additional (not mathematically standard) operations thrown in.

'Finite field' means something very specific, and quite different. https://en.wikipedia.org/wiki/Finite_field

Computer integer arithmetic hardware can be used to implement finite field arithmetic, but even the 2^n case takes quite a bit of trickery. https://en.wikipedia.org/wiki/Finite_field_arithmetic


>The name clearly indicates that the primary feature is that it is not signed, and the logical conclusion from that is that it is not negative.

Years ago, when the names were designated the CPUs didn't even have proper signed instructions. There was a sign flag, and that was all. In that regard the unsigned was the natural CPU sympathetic type.


And yet the default type for variables and function arguments in C was (before it got deprecated) signed int, not unsigned.


signed/unsigned stuff predates C, though. My (lack of) experience with C came much later, when I had used assembly for 3 different CPUs already, so no recollection of the times.


If you mean the default being signed, it's there because B - the predecessor of C - only had a signed integer (machine word) type. But note that this is about the case when the variable has no type at all - e.g. "auto x" is a signed int. They still had the choice of "int x" being equivalent to either "signed int" or "unsigned int" by default, though - since there's no backwards compatibility issue there - and chose signed.

FWIW, signed seems to be the default for most other languages from that time period, and often the only option - e.g. Algol-60 didn't have unsigned integers at all. Pascal kinda sorta did, if you defined an integer subtype with 0 as the lower bound... but the upper would still be that of signed int, and it'd behave as such. And Algol-68 called its unsigned type BITS - a pretty strong hint that they didn't think of it as arithmetic.

I'm actually kinda curious now as to when the concept of unsigned integers first appeared in a high-level programming language first.


Interesting perspective. I dont have the same experience. To me a wraping 32 bit index on a 64bit machine is much more likely to seg fault if it reads 4 billion values forward, rather than a few bytes backwards and therefore i find unsigned index wrapps to be easier to debug. In all fairness wrapping indicies are an extremely rare bug for me, so i cant say i have much experience debugging it.


I think when you use a variable that can have values in the range of millions, it's prudent to use 64-bit types and not 32-bit unless you are very much memory-constrained. And I've never seen a 64-bit signed variable overflow.


> In all fairness wrapping indicies are an extremely rare bug for me, so i cant say i have much experience debugging it.

Do you run your code with -fsanitize=undefined -fsanitize=integer ?


No I do not. I run ISO conformant C without options.


You really should try it on your codebase if you have the occasion


Unsigned means "modular arithmetic with a modulus that you do not choose and that is implementation-defined unless you have pinned down the bit width of the unsigned type, and that has implicit conversions from signed types that drastically alter the value".

Note that when mathematicians work in the area of congruences, they do not use unsigned integers.

For instance, when we look at Euler's Theorem:

https://en.wikipedia.org/wiki/Euler%27s_theorem

all the quantities in the formula can be understood as just integers, not unsigned integers.

The exponentiation of a can be understood as regular exponentiation. The modularity plays out in the triple-equal-sign operator and the (mod n) parenthetical part which says that the two sides of the equation are equivalent in a particular way.

The left side of the equation uses ordinary exponentiation and can produce values >= n; it is not wrapped.

It's crystal clear that we cannot replace the triple equal sign with the regular one, and drop the (mod n).


They also don't use signed integers, in the sense that computer languages use the term. Signed integers are also bounded in extent, and for most systems wrap on over/underflow. Euler's Theorem doesn't hold for them any more than it does for unsigned integers.

Don't treat computer "integers" like mathematical integers. It leads to pain. They're members of a few finite fields.


>Unsigned does not mean "not negative", it means "modular arithmetic".

Signed is modular arithmetic with an offset.

An unsigned char called i can be thought of as "i mod 256"

A signed char called i can be thought of as "((i+128) mod 256) - 128"

All results for over/underflow hold for 'i' cast into a char if you do the above in both cases.

I think the premise of the article is quite reasonable. It's slightly easier to reason about over/underflow for unsigned arithmetic since it's not offset.


> Signed is modular arithmetic with an offset.

it is not, it models integers (Z). The model only works in the bounds of what the platform can offer of course ; when you use "int" you say "this is an integer. In any case I'm supposed to encounter, adding two positive integers will yield a greater positive integer ; if I add numbers so big that my platform cannot represent the sum my program is meaningless anyways".

It is very unlike unsigned which models (much more accurately since it's much easier for our computers with finite memory) Z/pZ.


Signed and unsigned are actually treated internally the same on computers.

The reason you see -2 printed out when you add signed chars -1 and -1 together is because adding 255 and 255 together gives 254 under mod 256. The signed model that allows negative numbers relies on the fact that all numbers on computers are modular arithmetic.

So in that sense any argument that unsigned is modular and signed isn't is incorrect. You're better off choosing the one that forces you to consider how the computer actually operates under the hood.


> Signed and unsigned are actually treated internally the same on computers.

if you are writing code in C or C++ the computer you are programming for is the C / C++ abstract machine. C++ recently sanctified complement-of-two as integer representation, but C does not and supports machines with one's complement representation.


> C++ recently sanctified complement-of-two as integer representation

While that's true, signed integer overflow on arithmetic operations remains undefined behavior.


One's complement is on its way out in C too. The next C standard will require two's complement.


> Unsigned does not mean "not negative", it means "modular arithmetic".

It does mean "not negative" in languages with bounds checks. Like Rust.


Is it called "unsigned" though?

In Racket (and in mathematics) we call those natural numbers. The very notion of "unsigned" was invented by video game programmers at Bell Labs. In mathematics positive numbers are still signed (they just have a positive sign).


> Is it called "unsigned" though?

Yes: https://doc.rust-lang.org/std/primitive.usize.html

> The pointer-sized unsigned integer type.

> In Racket (and in mathematics) we call those natural numbers.

Naturals have no upper bound. Most languages don't have naturals, because they do have upper bounds. Languages which do provide naturals (limited by the host machine) usually don't bother with signing segregation, and give you mathematical integers (again limited by the host machine's capabilities).


> The very notion of "unsigned" was invented by video game programmers at Bell Labs.

PL/I had (has) unsigned integers back in the 60s, picking up the idea from earlier languages. I believe the novel datatype in PL/I was the CHARACTER type; into the 80s I was still programming machines with variable length bytes.

In many ways PL/I heralded the current Ordovician stage of programming languages, with the enormous Cambrian flourishing of hardware-specific (and company-specific) languages slowly dying out. Interestingly the only surviving languages older than it, FORTRAN and LISP, are also machine-independent.


Unsigned types are supported in machine language.

Any language that offers access to features of the machine will support unsigned numeric types.

No language supports natural numbers. The best you can get is bignums. A language that calls its numbers natural is just lying. (Likewise, integers, and reals.)

"int" is not a lie, it is a hint.


Though, in idiomatic Rust I did not find the use of unsigned as mentioned in the article, signed for indexes is the norm.


> in idiomatic Rust I did not find the use of unsigned as mentioned in the article, signed for indexes is the norm

Huh? If you write things[n] in Rust, and n isn't usize that's an error, you can't index things with the other primitive types at all out of the box.


Wait, what?

Pretty much everything for indices (at least in the stdlib) seems to be 'usize' in my experience (i.e. Vec), which is unsigned.

Am I missing something?


> Unsigned does not mean "not negative", it means "modular arithmetic".

Unsigned does mean non-negative (or rather no indication of negatives) as the sign of a number is just the positive or negative factor of a number: 1 or -1.


> almost all code that uses signed integers to represent values that will never be negative, tends to have a cacophony of range assertions and other tests which are just as error-prone as bounds checking to remembering to write, but also maintain when refactoring. It’s truly underappreciated how much those tests can be eliminated if your integer can never actually become negative due to the type system itself.

This is something the Ada language gets right: you define your integer type with a range, and the compiler automagically inserts the runtime range checks. (Unless you switch them off with compiler directives.) It makes far more sense to bake the range-checks into the type than to do it the manual way and just hope you don't miss anything.

You can do this in C++ using a library (thanks to templates) [0] but this is very rarely done.

[0] https://www.boost.org/doc/libs/1_78_0/libs/safe_numerics/doc...


Ada provides integers and floats with range checks, and you can specify the underlying size, but intermediate values don't get checked. It also has a separate form of unsigned integers called "modular types" which expressly indicate wrapping, if that's your intent.

    -- Range checked equivalent of uint64_t;
    type Nibble is range 0 .. 15 with Size => 64;

    -- Every assignment is as if Value := (Value % 16);
    type Wrapped_Nibble is mod 16;
    type Small_Float is digits 4 range 0.0 .. 20.0 with Size => 32;
I agree that the subexpression issue is problematic, IIRC I thought this was checked in SPARK code, but I could be wrong. The big benefit is describing your intent and you know that the stored value is in range, though whether it went outside of that range in calculation might not be known. Another thing is that these types define 'First, 'Last, 'Pred, 'Succ, and 'Range attributes which you can use.

    for N of Nibble'Range loop
        -- do something
    end loop;


> I thought this was checked in SPARK code, but I could be wrong

I believe that's correct, see my other comment. [0] Does still feel a bit ugly though.

[0] https://news.ycombinator.com/item?id=29768848


Range types are not a panacea, especially combined with other arithmetic operators. There are essentially two options: [lo1, hi1) + [lo2, hi2) can result in a new type [lo1 + lo2, hi1 + hi2) and so on, or any operation reverts back to the base type. Only the former provides a static type guarantee but can be extremely annoying to use and you eventually have to constrain the type back with runtime checks. (Interval arithmetic also shares this problem.) Probably for this reason, to my knowledge, both Pascal and Ada implements the latter, which is more usable but also arguably useless: if your intermediate type is ranged you have to manually calculate correct bounds (a failure to do so will cause a runtime error), and if it's unranged there is zero guarantee even in the run time.


Somewhat related: there's apparently a defect in the Ada language spec which permits reordering of arithmetic operations such that out-of-range behaviour may hinge on the compiler's whim: [0]

> non-parenthesized arithmetic operations could be re-ordered by the compiler, which may result in a failing computation (due to overflow checking) becoming a successful one, and vice-versa. By default, GNATprove evaluates all expressions left-to-right, like GNAT.

Presumably one solution would be to use a three-address-code style [2] to completely tie the compiler's hands regarding ordering, but this seems painfully restrictive even by the standards of SPARK.

Also, an Ada compiler's internal choice of base type (with which to implement a range-based integer type) can impact overflow behaviour: [1]

> The choice of base types influences in which cases intermediate overflows may be raised during computation. The choice made in GNATprove is the strictest one among existing compilers, as far as we know, which ensures that GNATprove’s analysis detects a superset of the overflows that may occur at run time.

I don't know if there's a fully portable robust answer to this second problem. (I believe you can generally use hints to force the Ada compiler to use a particular sized type, along the lines of C's uint32_t, but that this isn't portable.)

edit On second thought, I imagine using a three-address-code style would solve this too. Either the result falls within the permitted range of the destination variable, or it doesn't.

[0] https://docs.adacore.com/spark2014-docs/html/ug/en/appendix/...

[1] https://docs.adacore.com/spark2014-docs/html/ug/en/appendix/...

[2] https://en.wikipedia.org/wiki/Three-address_code


For me, the main reason to prefer unsigned integers whenever possible is: less special cases.

When a signed integer is used as an array index, the value can be in one of three ranges: the <0 range, the >=0 && <size range, and the >=size range. To validate the index, you need two comparisons. When an unsigned integer is used as an array index, there are only two ranges: the <size range, and the >=size range, and you need only a single comparison.

And that's not the worst situation with signed integers. From a more general point of view, there are four "classes" of signed integer: positive (>0), zero, negative (<0), and INT_MIN. Everybody tends to forget about that last one, but it's "special" in that it can break things in unexpected ways. Negating it doesn't work (you'd expect negating any number less than zero to result in a number greater than zero, but for INT_MIN that doesn't happen). Dividing it can trap (see for instance https://kqueue.org/blog/2012/12/31/idiv-dos/ which has a couple of examples) or worse.

With unsigned integers, there are only two "classes", zero and non-zero, and it's common to not even need special treatment for zero, reducing the whole thing to a single "class" of values.


>> To validate the index, you need two comparisons.

If your code doesn't have bugs, no validation of indexes is required.


> If your code doesn't have bugs, no validation of indexes is required.

Your code doesn't have bugs because you validated your indexes. What you said makes some sense when the index comes from within the program itself, but not when it comes from outside the program. Consider for instance the case of reading a data structure from a file or the network, where one field is an index or an offset into another part of the structure; you must validate that this index or offset is not outside the bounds, no matter how perfect your code is.


The number of programs ever written without bugs can be counted in a uint8.


While wrapping parts of OpenSubdiv for Rust [1] I noticed the C++ API was littered with uses of int for indexing/lengths (both are always positive) of geometry buffers.

So I opened a ticket with Pixar [2].

Some of the same strange arguments were brought forward (incl. the Google style guide BS). I wrote a longish reply [3] that mentions two comments [4,5] from an ill-guided article promoting use of signed in C++.

I closed the ticket in the end because I felt that if people really believe there are good reasons for signed in the cases at hand they haven't grokked the problem sufficiently – trying to convince them otherwise would not be a good use of my time.

[1] https://crates.io/crates/opensubdiv-petite

[2] https://github.com/PixarAnimationStudios/OpenSubdiv/issues/1...

[3] https://github.com/PixarAnimationStudios/OpenSubdiv/issues/1...

[4] https://www.learncpp.com/cpp-tutorial/unsigned-integers-and-...

[5] https://www.learncpp.com/cpp-tutorial/unsigned-integers-and-...


Quoting one of these comments:

> The argument with substracting 5 from 3 is not an argument against using unsigned. When you use signed you can run into a similar problem when the result underflows. This happens much less frequent and causes UB, at INT_MIN, which is the worst kind of bug. The bug is often not triggered durring testing and you will only notice it when it is too late.

The "When you use signed you can run into a similar problem" part is just not true in our world. 3 - 5 happens much more often than anything that may cause signed .*flow.

    "signed underflow" "bug report"
yields 8 results on google FFS


No such thing as "underflow" for integers. Signed integers overflow in both directions. Underflow is a floating point concept.

Edit: https://en.m.wikipedia.org/wiki/Arithmetic_underflow


So what about an unsigned int going below zero? I think most in the comments here mean that by "underflow", but i could be mistaken (and it should correctly be called "overflow" as well).

I agree with the WP article saying integers more correctly "wrap around" (than overflow).


The C standard only calls certain integer arithmetic to "overflow" if it results in undefined behavior. In that sense undefined integer arithmetic never overflows, but they "wrap around", as the behavior is defined there.

I can't actually find a normative definition for "overflow" in the C or C++ standards, but there are sections and notes that describe this interpretation:

C++: https://eel.is/c++draft/full#basic.fundamental-note-2

C: http://web.archive.org/web/20211231011139/https://cigix.me/c...

There can be subtle differences between various definitions of "overflow", and I imagine there could be other subtly different definitions outside C and C++. But "underflow" is exclusively a floating point concept, as far as I know.


API contracts aside, signed indices do have a major benefit: https://godbolt.org/z/q4eTdPdcd


Thanks! If you don't mind explaining, what are we seeing here?


After the initial setup block,

    c1 = block[i1]; c2 = block[i2];
    if (c1 != c2) return (c1 > c2);
    i1++; i2++;
becomes

        mov     cl, byte ptr [rdx + rcx]
        cmp     byte ptr [rdx + rax], cl
        jne     .LBB0_6
        lea     eax, [rdi + 2]
        lea     ecx, [rsi + 2]
if the indices are unsigned, but

        mov     al, byte ptr [rcx + rdx + 1]
        cmp     byte ptr [rdi + rdx + 1], al
        jne     .LBB1_6
if the indices are signed, that is the compiler just removes the indices and traverses the block directly (starting at the input offsets).

That's because in C (and C++) overflow is UB, so the compiler can assume it doesn't happen. Clang is apparently unable to make such a determination or work around it for unsigned, so for every increment of the indices it actually goes and computes the indices to fetch the items from the array.

Incidentally, GCC does not care and generates the exact same (rather different) code for both signednesses.


In this case the problem is more specifically the use of uint32_t. I never quite mentioned in the article that if you're going to use unsigned integers for this sort of thing to always use the native-word-size unsigned type, e.g size_t, if you actually change that to uint64_t (or size_t) the code generation is better as you can see here: https://godbolt.org/z/11nEz6EPT


I see no reason why the same addressing mode could not be used in both versions. This just seems like an optimization that has been made under the assumption that in the addressing mode

    [reg1 + reg2 + disp]
reg2 refers to a signed integer (Which may result in the effective address being < reg1 if reg2 is negative).

Obviously you would not want to make the same assumption if reg2 refers to an unsigned integer, because if the most significant bit is set, it would result in an effective address below reg1, which is definitely not what we would expect from adding an unsigned integer.

But in this case, 64-bit addressing is being used, and the value of reg2 is a 32-bit integer which has been specifically zero-extended. We can make the assumption that [reg1 + reg2 + disp] can never result in an effective address below reg1 (assuming positive disp).

If you take the signed version, and replace the lines

    movsxd  rdi, edi
    movsxd  rcx, esi
with

    movzx  rdi, edi
    movzx  rcx, esi
I believe you will have something functionally equivalent to the unsigned version.

Of course, this same optimization could not apply if we were using uint64, but it could in the case of int64.

Either way, buggy code with no bounds checking is not a reliable method of determining what optimizations can be done in production.


The code is compiled as it is in order to handle the case where the indexes need to wrap around. Suppose i1=0xffffffff on entry - you then want to access block[0x00000000ffffffff], block[0x0000000000000000], block[0x00000000000000001]. I've written out all 64 bits of the offset, in the hope this makes the problem clearer: there's no addressing mode for getting the 32 bit truncation. That's why there's an LEAs after each access. (Note that they write to the 32 bit part of each register.)

In the int32_t case, suppose i1=-1. Then you want to access block[0xffffffffffffffff], block[0x0000000000000000], block[0x0000000000000001], and so on. You can do an initial sign extension of the indexes, and then the base64+index64+disp32 addressing mode gives you the right result.

Make the unsigned version take uint64_t indices, and you get the same code for both. Or #include <stddef.h>, and make it take size_t - same thing. This is exactly the sort of thing that size_t is there for.


Author here. I believe many of the complaints here are concerned about what is more common rather than what is universally better over all possible inputs which is actually the point of view this article is written from. The issue with the "general case" is that exploits are never actually found there, they're always found in the edge cases such as large, or "pathological" values as it were. Signed integer arithmetic has more of these edge cases than unsigned when it comes to sizes, indices, and offsets used in expressions controlling memory (either directly, or indirectly) which is the most common application of integers in a codebase. The native-word-size unsigned integer type covers the full numeric range for those operations, while signed simply cannot. At the same time, preventing a whole numeric-range that is simply incompatible with those operations (negative values), and a whole class of issues related to undefined operations. The only real edge-case unsigned has that is more error-prone is values close to zero under subtraction and that's relatively easy to account for which I go into great detail to explain.


Which is why the only sensible option is a proper numeric tower.

Then the common cases are fast and the edge cases slightly slower, but everything works the same.


The reason I've always preferred unsigned is that it's easier to write code that is correct for the full input domain.

Consider for example:

  extern void g(int); 
  void f(int x, int y)
  {
      g(x - y);
  }
The function f obviously doesn't work for all values of x and y: Some values will overflow, some values will underflow.

How do you fix that? Writing an explicit check for whether x-y overflows is highly impractical, no one does that.

Or you could document the limitation using pre- and postconditions that narrow down the range of allowed values to something you know will work:

  extern void g(int); 
  void f(int x, int y)
  /*
      PRECONDITION: -1000000 <= x <= 1000000 and -1000000 <= y <= 1000000
  */
  {
      g(x - y);
  }
No one does that either. I don't know why. There's no principled reason why you couldn't meticulously keep track of signed int input and output ranges. But that's how it is, no one does it, valid int ranges are always unstated.

Now for the unsigned version of the same problem:

  extern void g1(unsigned); 
  extern void g2(unsigned); 
  void f(unsigned x, unsigned y)
  {
      if(x >= y)
          g1(x - y);
      else
          g2(y - x); 
  }
Assuming g1 and g2 are well-defined for all unsigned inputs, so is f. That was easy.

It may look like more code, because there are two g functions now. But chances are you probably need to treat the x<y case differently somewhere anyway; in the larger scheme of things, unsigned is not more code.


What you should be doing for checked arithmetic with GCC is use the builtins for those that aren’t aware;

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...


It would be nice to see something like that standardised so you could use it in portable code.

It doesn't solve the problem though. You still need to write some code for what to do if there's an overflow, and you need separate handling for underflow. So the signed case is now:

  extern void g1(int); 
  extern void g2(int); 
  extern void g3(int); 
  void f(int x, int y)
  {
      int x_minus_y;
      if(__builtin_ssubl_overflow(x, y, &x_minus_y))
         g1(x_minus_y);
      else if(x > y)
         g2(...); /* overflow */
      else
         g3(...); /* underflow */
  }
Work in progress. I've given up on figuring out what arguments to pass to g2 and g3. Since the difference won't fit in an int, you would need to offset-adjust the value somehow in order to fit it in an int. Seems messy. Maybe you can think of something simpler.


Tangentially related:

https://docs.microsoft.com/en-us/answers/questions/680467/ex...

Had Microsoft used an unsigned int to store the date - which can't ever be negative in this notation - Exchange wouldn't have stopped working worldwide on 1/1/2022. They would have been safe for another two decades, in which they would hopefully had either gotten rid of that weird code or switched it over to 64 bits.


Had Microsoft used common sense, they would not store the string representation of a date as an integer, period.


It depends on the use case. If this is an index that's commonly sorted by and enumerated, the performance impact of a string compare versus a simple 64 bit integer compare would be enormous. A single compare instruction versus a series of unicode character comparison operations with subsequent jump instructions would not only be more code, it would also be more likely to wreck the branch predictor of the CPU with all the added conditionals.

I think using 64 bit integers for their natural hardware-accelerated sortability is quite a neat trick. You just have to be sure to use actual 64 bit numbers, and you have to think ahead to check if your data type is wide enough to actually contain your data.


> They would have been safe for another two decades, in which they would hopefully had either gotten rid of that weird code or switched it over to 64 bits.

Why? The code would have kept working all along, there would have been no reason to explore it, and lots of other issues to chase.


This ought to be the canonical example for this sort of thing -- in the sense the unsigned value only* 'helps' if you are already doing something terminally silly.

*to some definition of 'only' that excludes the cases like cryptography, hardware, etc stuff.


This is an misguided and dangerous article.

Suppose that a, b and c are small values close to zero and so can be combined additively/subtractively in any combination without overflow. If they are signed, we can rely on algebra, like:

  a - b > c       // this expression
  a - b - c > 0   // can become this expression
  a - c > b       // or this
This breaks for unsigned; you cannot just change the sign of c and move it to the other side of the inequality.

Even when you're not actually doing this algebra in the code, it's useful to be able to do it outside of the code. Sometimes in the code too.

For instance, suppose now that the values are not trivial. We have no assurance that a - b will not overflow, but we do have assurance from somewhere that a - c will not. The rewrite a - c > b is then helpful in avoiding overflow, and easily provable to be equivalent to the a - b > c expression that naturally comes from the problem specification.


> It’s very trivial to check if x * y would overflow though and you should just learn the extremely simple and obvious test.

    if (y && x > (T)-1/y)
I think it's sarcasm, but with C guides like this it's never certain :)


Maybe this is not a problem for calculating the size of a memory allocation, but that division operation is going to be several times slower than the multiplication it is guarding. Is there a faster way to check for overflow, perhaps using compiler builtins?


Maybe I'm just a moron but what makes this test "obvious"? What is T?


i imagine it is the type of x and y


I agree that signed is generally inferior and less often what you really need. For me, in C and C++, the main arguments for 'unsigned by default' are:

(1) integer overflow/underflow is undefined behaviour. You don't want that, it's evil. Unsigned does not have that problem (often it is still a bug if you over- or underflow -- but at least you get sane code from the compiler).

(2) size_t is unsigned and is returned by sizeof() and is the argument to memcpy(), strcpy(), etc., and with -Wconversion, you do not want to handle all the warnings when mixing this with signed loop counters or size variables.

Unfortunately, one problem with using unsigned by default in C is that ptr1-ptr2 yields ptrdiff_t, which is signed. Which is bad, in my opinion, because I usually want to find an array size or index when I subtract two pointers, so I want size_t, and in these cases, I always know which pointer is larger.

Oh well, it's C.


The one thing I really, really hate Java for is its lack of unsigned integers (there are a lot of other minor inconveniences, like in any language, but nothing comes close to the missing unsigned types).

This of course has to do with me frequently writing and maintaining low-level hardware and network protocol code, talking to devices on the other end that had firmware written in C and therefore expect you to be fluent in bit manipulation. But hey, all that nice software is useless if it can't eventually interface with the real world out there, so hardware will stay relevant and necessary and the accompanying close-to-the-metal code can't be optimized out of the system, hence I consider good tools for doing bit-level work an important quality in any general-purpose programming language. And besides bit-level operators that means support for unsigned integers which are needed to easily combine bit-level with classic arithmetic operations.

You can actually live without either of these in practice and still do low-level stuff, but it is very unintuitive, has poor runtime performance and leads to hard-to-read and bug-prone code.


While Java doesn't have unsigned types, it at least with java8 got functions to operate on ints as if they were unsigned.

But Kotlin (also running on the jvm) got unsigned types with built in operator overloading for them, making it easy to work with. "val a = 100u" gives you an unsigned Int. If you use Java, there are few reasons not to use Kotlin. https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-u-int/


> has poor runtime performance

Are you referring to performance issues caused by lack of unsigned types? Do you have a specific example? Based on my observations, the conversions necessary for acting on signed types as if they were unsigned generates the correct machine instructions designed for unsigned types.


I really donnt buy it is a "trivial idiom" to check unsigned undeflowing by writing 'i > size'. I'd have to check that was right all the time.

Also in general saying all we have to do is (I'm paraphrasing a bit) "not make mistakes" is, to me, a bad idea, as all programmers make mistakes.

I sometimes assert after a calculation that a value is >= 0, it's much harder to check that if it's underflowed because it was unsigned.


> for (size_t i = size - 1; i < size; i--) {

Am I alone here feeling that this while working is anything but intuitive and is asking for trouble down the road? Just imagine having to change the countdown for whatever reason in maintenance to 1 instead of zero and you suddenly fiddle with the operator.


You are not alone! This is completely unclear: the next guy reading your code will definitely "fix" the halting condition. I'd either handle i == 0 outside of the loop, or use something like:

    for (size_t j = size; j > 0; --j) {
        size_t i = j - 1;
        // etc.
Or even:

    for (size_t j = 0; j <= size; ++j) {
        size_t i = size - j;
        // etc.


I agree with this almost entierly and i use almost exclsivly unsigned ints in C.

However, its not true that unsigned is faster because the compiler can optimize. Consider:

x = (x * 2) / 2;

If x is unsigned, and overflows, x will not be the same after this operation so the compiler can not optimize away this line. If x is signed and overflow is undefined, the compiler can assume it wont overflow and optimize away the line. UB affords the compiler a lot of optimizations.


> its not true that unsigned is faster because the compiler can optimize.

It’s not false either. For instance, casting int32_t into int64_t requires an instruction like movsxd which does sign extend. Casting uint32_t into uint64_t can often be merged into the instruction who computed the uint32_t value, because the instruction which write 4 bytes registers like eax zero out old data in the higher 4 bytes of the corresponding 8 bytes rax register.


Thanks! I didn't know that!


The article acknowledges this:

> There are some optimizations compilers can make assuming signed integers cannot underflow or overflow that unsigned does not get to participate in.


The argument is really strange. It says that unsigned ints can't become negative and it's a good thing. But they do that by instead having crazy high values. How is that a win?

You could argue that it doesn't matter because an incorrect program is incorrect regardless. True. At least with signed integers you can represent differences between positive values. How do you do that with unsigned?


Signed integers may also have crazy high values so you're already on the hook for dealing with those. The case for unsigned values is that it frees you from having to think about negative inputs. We all know to check for divide by zero; how many remember to check for INT_MIN / -1?

Regarding differences, in practice you usually want those differences to be non-negative. With uints, you compare first. Not:

    int length = end - begin;
    if (length < 0) return error;
but:

    if (begin > end) return error;
    unsigned length = end - begin;
This is also a good habit for signed ints, since it avoids underflow in more (but not all) cases.

An old post with a few fun crashes for negative inputs: https://kqueue.org/blog/2012/12/31/idiv-dos/


Might be of interest: https://www.youtube.com/watch?v=Puio5dly9N8&t=587s IMHO bjarne's comment sums it up nicely ...

"Use ints until you have a reason not to. Don't use unsigned unless you are fiddling with bit patterns, and never mix signed and unsigned.."


Eh, but C++ std likes to use unsigned size_t for sizes...


Several comments in this thread take the POV that "integer overflows never happen in practice" but of course they do! https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=integer+ove...

Of course uints overflow too, which just means arithmetic is hard, and it's often (not always) made harder when you have to consider negative values. Bounds checking, average, division, etc. INT_MIN is a horrible edge case which is easy to forget.


I opened one at random and it's an uint overflow lol.

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-1087...


The patch is completely bogus:

https://github.com/teeworlds/teeworlds/commit/4d529dcd2d0102...

Here it tries to avoid overflow by assigning the result of 32 bit arithmetic to a 64 bit type. That's a common mistake.


What's more important than preaching about signed vs unsigned is raising awareness not to use both.

Most signed/unsigned related bugs (outside numerical/math sensitive domain) are due to using both simultaneously and the side effect is of course going to be overflow/underflow.

On the other hand, I'd favor int over unsigned as it's more practical when dealing with memory address arithmetic & array indices due to address difference being signed.


The argument that most signed integers never use the negatives is a bit misleading, as I'm sure most unsigned integers never use the upper half of their range either


I'm in the "almost always signed camp" - at least in languages without robust over-/underflow checking.

But I sometimes wonder if it would have been better to not carry the 'signedness' in the variable type in high level languages, but instead treat integer variables as 'signless' bags of bits, and signed-vs-unsigned are just different views on those bags of bits, just as in assembly.

Most operations on two's-complement numbers result in the exact same bit pattern, no matter if the involved operands were 'signed' or 'unsigned', and the signedness only matters when printing the result, or for explicitely extending the sign bit when casting to a wider type.


Because when C was standardized in 1989, there were a few vendors that wanted to support non-two's-complement hardware, and C has been stuck with that ever since. But more to the point, even on two's-complement hardware, the CMP instruction doesn't care about the signedness, but it typically sets flags (for those CPUs that have flags) for both signed and unsigned results, and there are different conditional branches one can take for less-than, less-or-equal, greater-or-equal, and greater-than.


I often wished I had access to the flag bits in C (or at least a somewhat portable subset). Would be very handy for writing CPU emulators :)


Having written a CPU emulator [1], I don't thunk it would have been of much use. The Intel CPUs typically don't set the flags on a MOV, but Motorola CPUs typically do.

[1] https://github.com/spc476/mc6809


I haven't seen anyone mention the main reason I end up using signed: to allow for an "out-of-bounds" value to indicate "value not set". As long as the N-1 bit version of the integer is large enough, not having to use an additional variable to indicate "var not set" is really quite lovely. One could potentially replace the use of a negative sentinel value with a maximum sentinel value, and I've done this on occasion too (i.e. if (var == 2^N-1) var-is-not-set). But the cognitive distinction between negative == not set, positive == valid value is easier to deal with (for me, at least).


This the proper place for std::optional<T>.


that has quite the overhead for something that can be done using a single bit of the actual value. roughly half the programming i do is high-performance real-time constrained. not really appropriate in that context.


There is a proper place for everything, and others for everything else.

You can make a zero-overhead analog of optional<int>, for such cases, for clarity and safety. (Unless you are also trying to present a C API for cross-language compatibility...) But it is tricky to determine the exact amount of handholding to provide for such a type, and extra work to remember exactly what it is. We all know how ints work.


> delta = abs(x - y);

> The argument is that unsigned is dangerous here because if y > x then you get underflow. The problem with this argument is it’s not valid because the code itself is simply incorrect regardless of the signedness of x and y. There are values for both x and y which will lead to signed integer underflow. So like before, in languages like C and C++, you just unconditionally invoked undefined behavior since signed integer underflow is undefined.

Yeah because x=1, y=2 is just as likely an input as x=MIN_SIGNED_INT, y=1.


Low-level programming, dealing with hardware registers, is so often unsigned that I use unsigned everywhere I don't need negative numbers. Youtube going 64-bit was inevitable, but overflowing 2^31 should not have been one of them. https://arstechnica.com/information-technology/2014/12/gangn...


> // But it also even has an explicit n * m form as well. > make([]Object, n, m);

make([]T, m, n) does not allocate a slice of length n*m, but of length n and capacity m.


Choose the right tool for the job. Examples:

  - Loop using an iterator: unsigned pointer size
  - Array index: unsigned pointer size
  - Enum used to represent a register value that's 8 bits or less: unsigned 8 bit
  - ... used to represent a register value that's up to 16 or 32 bits: unsigned 16 or 32 bit number
  - Integer value that could be negative: Signed, with the appropriate size
  - Doing numerical operations and you have an FPU: Float. Unless you're careful, you know what you're doing, and it's either performance critical or you're using dedicated a dedicated fixed-pointer hardware peiph... then maybe fixed point with a signed or unsigned integer.
  - Analog value that can only be positive; maybe a voltage: unsigned integer according to precision of the ADC etc
  - Analog value that can be positive or negative; maybe pH, audio, or something using a differential ADC: signed integer
  - Performing mathematical operations including subtraction, don't expect fixed-point errors to come up, and you don't want to use floating point or don't have an FPU: signed integer


I mostly use signed... but one of my primary activities is to use uint8, uint16, and uint32 for indexes and indexes into indexes etc. The workflow was designed around the various uint size limitations so that they are not exceeded. Uint32 is the largest chunk size that I use. The fact that it is twice as big as an Int32 is handy. It allows me to pass bigger chunks to the GPU. Plus I'm already working with uint8 and uint16 so it's consistent. I've not needed Uint64 or Int64 but it would vastly increase memory size for no benefit for my use case. Memory is cheap but I'm bottlenecked by memory bandwidth. I've looked into using variable ints but I haven't needed them yet. For things that I know I'm going to scan over having compressed variable ints of a diffed sorted bag would be great but I would need to write a ton of custom CPU and GPU code for it. You can index into variable ints with a post process provided a maximum number constraint and some padding on the end.

This sort of code is written very carefully and mistakes give completely wrong answers so they are noticed right away.


Signed integer overflow causing undefined behaviour allows the compiler to make significantly better optimisations. Here is Chandler Carruth, and clang developer, talking about it: https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2356s&ab_chann...


I am surprised to see what I consider to be the most important part of signed integers in C and C++ only barely mentioned:

>use of signed integers is better as it’s the only way you can get trap behavior for integers, as using it on unsigned would trigger trap representations for valid code that relies on that behavior.

For video and audio codec code, and really anything integer math heavy, being able to use UBSan to find overflows is a hugely beneficial tool, and something you can normally only do with signed integers due to the C spec (it's less of an issue in Rust as that language makes both signed and unsigned overflow illegal).

>Trap representations are actually quite insufficient as they can only trigger at runtime when those paths are successfully executed with the correct trap-producing inputs. This coverage is impossible to expect in any non-trivial program even with exhaustive unit testing.

This was true before fuzzers existed, but now that we have several very good fuzzer implementations (plus a few smart unit tests), the coverage is well within reach.


> this concept applies to all languages, regardless of their choices

No it doesn't.

Neither signed nor unsigned integers in WUFFS permit overflow or underflow, at runtime the integers behave the way you were taught in school. Adding two positive numbers together always results in another positive number, because duh. WUFFS catches all the cases this programmer worries about at compile time and insists you fix them.

One of the first examples WUFFS presents you is what if we take this correct code for a well known operation and we tweak it so that it can overflow. The compiler rejects the modified code, pointing out that it could now overflow and so isn't valid WUFFS.

Now, WUFFS is a special purpose language, you can't rewrite your 5MLOC C++ system in WUFFS. But then again, this shouldn't make you feel glad (because C++ is so powerful) but instead sad (because C++ can't offer you this valuable defence against your own stupidity). It should also make you determined to use WUFFS everywhere you possibly could, although doubtless it won't.


> It turns out that computing differences safely is actually quite hard for signed integers because of underflow, even in languages which support wrapping behavior for them, e.g INT_MAX - INT_MIN is still going to be incorrect even. There just isn’t a trivial way to do this safely

And that right there is a compelling reason not to use C or C++. If the language is so broken that it doesn't even let me subtract signed integers safely without risking shooting myself in the foot it is worse than useless in today's world because it gives me the illusion of programming in a high-level language when in fact I am not.

When I want to subtract x from y, I want to be able to write y-x and have the language worry about the details. That is the whole point of having a high level language, to abstract away the low-level details and let you write high-level abstractions without having to worry about introducing bugs that could imperil your entire enterprise. C and C++ don't do that.


This is the argument not to use computers at all. It is not persuasive.

There is never a reason to use C.

But C++ is a language for engineers. If you don't have the patience to do engineering, then you should not. If you have a need for engineering anyway, hire one.


Part of engineering is choosing the right tools, and that means avoiding tools that have booby traps built in to their design unless there is a very compelling reason to accept that risk. C was a good tool when computers were much more resource-constrained than they are now and security was much less of a concern, but those days are long gone.


Part of engineering is understanding your tools, and not imagining they are other than they are. There is no risk when there is no delusion. A professional will always choose the sharpest tool, and respect it.

There are no booby traps, only booby programmers.


The problem is that C/C++ has hijacked a common notation used to denote subtraction and used it to denote something that kinda sorta looks like subtraction but actually isn't, and then it provides no backstop for the situations where it isn't. The specification of the language explicitly allows anything to happen if the programmer screws up this thing that the design of the language actively encourages you to screw up. That is terrible engineering practice. The design of engineering tools, like anything else, needs to take into account the fact that humans are fallible and will make mistakes. Good design guides people away from making mistakes and minimizes the consequences when people make mistakes nonetheless (as they inevitably will). The design of C/C++ lures people into making mistakes by making something look like something it is not, and then it amplifies those mistakes into consequences that are potentially catastrophic, all in the name of efficiency. That might have been a reasonable tradeoff in the 1970s. In the 2020s not so much.


Demands on efficiency are every bit as stringent today as in the '70s. Machines are faster, but we expect them to achieve even more. Clock rates have stagnated, but problem sizes have not.

The failure modes of C and C++ ints are less bad than is typical of construction materials. The common mistake is hiring CS people and expecting them to think and act like engineers.


Engineering is the art of making trade-offs between multiple competing concerns. Efficiency is one such concern but it is not the only one. Safety matters too. You can make aircraft more efficient by eliminating all of the safety equipment and redundant systems. That might have made sense for the Wright Flyer but no sane person would advocate that for a modern passenger airliner.

If the subtraction problem were the only issue that might be manageable but it is not. It is just one example of a very long list of such problems. C and C++ are metaphorical minefields of gotchas, any one of which can result in a literal crash or a security breach, which in today's world can be every bit as damaging as a physical failure of a critical system. Actual examples of this occur on an almost daily basis. It is akin to trying to build a modern passenger aircraft out of wood and muslin. It really is unacceptable.


Yet, you would be happy to release a safety-critical system with only run-time type checking?

Engineering is a discipline that requires years of study and supervised practice, not reducible to "the art of making trade-offs". Non-engineers hired to perform engineers' jobs like to make this sort of mistake. Management might not be able to tell the difference. Thence 737-MAX.


> Yet, you would be happy to release a safety-critical system with only run-time type checking?

https://en.wikipedia.org/wiki/Tu_quoque

To say nothing of the fact that your backhanded assertion is completely baseless. Also, C and C++ provide neither run-time safety nor compile-time safety, so you argument is a non-sequitur as well.


Ah, I see you were just trolling. Well played, I guess?


No. I stand by everything I wrote.


To me signed vs. unsigned by default seems like asking which leg to cut off by default, left or right?

The correct default IMHO, is a proper numeric tower paired with a tagged pointer implementation for "small" integers (range in ~ ± 2^62 or so). As others have pointed out, this will essentially never overflow in practice, so memory traffic is the same as for 64 bit primitive integers. AFAICT, we are now good enough at optimising tagger pointer/small integer arithmetic that on modern machines, memory traffic should be the limiting factor.

I was vascillating between different options (for Objective-S, http://object.st) for a long time, because while I love C's "cheap and cheerful" use of int, but I also love Smalltalk's (and LISP etc.) "we do the right thing".

What pushed me over the edge towards the numeric tower as default position was DJB's retrospective paper in Qmail security: https://cr.yp.to/qmail/qmailsec-20071101.pdf

So while "cheap and cheerful" works really well in regular situations, it does not work well at all when you're under attack, because attackers will go 100% for the edge cases, the ones that "never happen". So we really need to, as DJB put it, "simplify integer semantics". Having those overflows just cause a small slowdown seems like a reasonable tradeoff, we don't need attacks to execute at optimal efficiency.

Given the advances we have in software and hardware, there simply is no reasonable case to be made for not having a proper numeric tower as default.

When you need better performance or specialised semantics, the primitive types should be readily available. But not the default.


For languages that don't give you a choice of signed vs. unsigned - giving you signed and sometimes arbitrary-precision integers - this debate is irrelevant.

I'll put it in my same mental bucket of "things that I no longer worry about", along with buffer overflows, wild pointers, etc that are eliminated by memory-safe languages.


Slight nitpick:

> Here’s a somewhat non-exhaustive list of all the undefined behavior of signed integer arithmetic in LLVM which applies to all languages which use LLVM:

> [... division examples ...]

> INT_MAX - INT_MIN

That's not true: 'sub i32 INT_MAX, INT_MIN' is defined perfectly fine in LLVM in the (modular arithmetic) way you'd expect as long as the 'nsw' (no signed wrap) flag isn't set on the instruction. For the most part, LLVM IR by default doesn't care about the signedness of values, integer types are just treated as a bunch of bits with modular twos complement arithmetic. Exceptions to this rule are explicit, e.g. the mentioned nsw flag (and there's also nuw for "no unsigned wrap").

The division examples are correct because if you want a signed divide, you have to explicitly encode it as a signed divide instruction in LLVM IR, which has the listed undefined behaviors baked in.


I wonder, is there any language that automatically promotes ints to more bits to ensure there is no overflow? The sum of two eight-bit ints is a nine-bit int (probably occupying 16 bits in memory, but that is an implementation detail):

    uint8 a = 230;
    uint8 b = 250;
    auto c = a + b;  // typeof(c) == uint9
Maybe it would even be able to detect "c / 2" fits in an uint8.

I guess this could lead to an explosion of bits required, but only if you chain multiple operations and use type inference. If you explicitly assign to a fixed-width integer, it would truncate.

I know Python used bignums under the hood, but is there any compiled and/or strictly typed language that does this? Because it seems modulo arithmethic and silent under/overflow is most of the time not what you want.


That is generally not possible in a strongly-typed language unless you just use a bignum type to begin with.

Most numbers are small, and you aren't worried about overflow.

With big numbers, std::int64_t is hard to overflow. If you have numbers from an untrusted source, you can and should range-check them immediately on input.


Why not? Ada, Pascal, Nim and others have ranged integer types. Many modern languages use some kind of type inference. You could easily define that the type of an integer expression has to be large enough, for most expressions. It doesn't have to be general (I wouldn't be surprized if the general case is theoretically impossible). But you could fix the most obvious cases easily with this (for loops that can't go below 0 because of unsigned counters, calculating the average but the temporary overflows, ...).

And if the compiler gives you a warning... "Inferred variable would require a bignum (int128)" then you could either silence it by specifying the type explicitly, or maybe you are missing a manual range-check?

> If you have numbers from an untrusted source, you can and should range-check them immediately on input.

You should, but you are going to forget in some places. The point of a stongly-typed language to me is that the compiler will make sure you do. It is the edge cases and the cases where you accidentially don't sanitize properly where you get evil bugs.


And if you do "forget in some places", Ada, Pascal, and Nim will not produce the correct response.

What will they do instead? We don't need to guess, we saw in the first Ariane 5: catastrophic loss of vehicle and payload, as a result of sending a crash dump report to the engine steering gimbals. $500M payload, in that case.


The problem isn't that signed or unsigned is the accepted default. The problem is that bounded integers are the default, and worse yet, that their behavior at the edges is counter-intuitive from a common sense perspective.

This was perfectly reasonable back when RAM was measured in kilobytes, and clock speed in megahertz. But these days, we write apps in HTML and JS, and package them with a browser to run. And, conversely, security is much more of a problem than it was decades ago - and integer overflow is a very common cause of security issues.

So, why aren't unbounded integers becoming the norm? I'm not saying we should throw int32 etc out, but rather treat them as low-level optimization tools - much like we treat, say, raw pointers in C# or Rust. Surely the default should be safety over performance?


Unfortunately if I were to try to use unsigned ints extensively, I'd be casting them back and forth constantly, because nearly every standard library function that I'd want to use expects int inputs or outputs ints.

A small improvement in correctness is washed out by a huge decrease in ergonomics.


You do know that signed and unsigned ints convert implicitly, without casts?


Which standard library?


I've seen plenty of bugs due to misuse of unsigned types, mostly coming from beginners that are trying to fix type conversion during comparison compiler warnings. Then they go on and do some arithmetic with the loop index, which results in an underflow or two. We now have hard rules against unsigned usage if it is part of an arithmetic operation.

Where things get a bit more tricky is when working with data that has some native unsigned type - e.g. when reading an image from file, it will mostly consist of unsigned chars. There you also need conversion rules and everyone aligned on algorithms, APIs etc.

Mostly though it's a matter of maintainability - it's easier to wrap your head around signed integers (no pun intended), even for beginners.


This is an artifact of typing. Is your 16 bit value all ones? Then it might represent -1 or 65535 depending on what you believe the type to be. Perhaps it represents some sort of colour instead.

When working in an untyped environment, this sort of question doesn't come up. If the value is intrinsically unsigned then that is what you and hence the program assume those bits mean. If the value is intrinsically signed then that is your illusion. Any ideas about overflow are entirely in your mind. You tend to see techniques used in an untyped environment that would never occur to anyone in a signed environment. It's all just bits.


One thing with array indices is that using signed means that you need to always check for negative in addition to checking for too large. Unsigned indices means you can just check against too large.

I think a lot of the scenarios where signed is safer are also situations where the application logic would be wrong, so I am a bit skeptical of that argument.

But Ultimately the biggest factor to consider is what your team or prospective team members in your industry are used to, different industries tend to have people who never expect unsigned values vs those who always assume unsigned will be used.


> Of course this is dangerous as it’s a narrowing conversion, with a cast which silences a legitimate warning. In C and C++ it invokes undefined behavior when given specific large values and most certainly is exploitable. Most applications would just crash on inputs >= 0x7ffffffffffffffff.

No, it's not. Overflow on arithmetic is undefined, but on conversion it's implementation defined. In C++20 it's defined.


> Here’s a somewhat non-exhaustive list of all the undefined behavior of signed integer arithmetic in LLVM which applies to all languages which use LLVM:

> These are certain to produce invalid results in languages like Go, Rust, and Odin.

This is clearly untrue, since an operation (or indeed a type) at the language level need not translate to the same operation/type within LLVM.

For example, none of these are "undefined behaviour" in Rust.


The author makes good points, but the takeaway I get is that signed under/overflow should always be defined, maybe in a platform specific way. The one good argument for leaving it undefined was for better loop optimisation, but if we're going to use unsigned for loops, then what's the point.


I prefer unsigned in Rust but signed for C/C++. This is mostly because I'm always afraid that the number will be silently converted to a signed type at some point. Rust is much more explicit about when these conversions happen so it worries me less.


What I find interesting is that the implementation of `midpoint` is incorrect (should be `U difference = (U)y - (U)x;`), which is not a good look when you want to argue that unsigned makes everything more correct.


I sometimes use signed ints to indicate an error condition. Then you know if the result is < 0 some error occurred, otherwise you can continue with normal processing.


Wouldn't it be better to write

for (size_t i = size; i > 0; i--) { // ... }

and use i-1 inside the loop instead of using underflow?


I foresee two problems with that:

Problem 1:

If i is used multiple times inside the loop, then you might need to make a temporary variable for it, making it confusing:

for (size_t i_plus_one = size; i_plus_one > 0; i_plus_one--) {

size_t i = i_plus_one - 1;

f(i);

g(i);

h(i);

...

(The alternative is a bunch of i-1 all over the place, which is also confusing.)

Problem 2:

If operations inside the loop is short, then the extra "i-1" arithmetic could result in unacceptable performance penalties.


or: "for (size_t i = size; i-- > 0;){...}"


I thought given the title that it was going to be tips and list of services to make accountless web use.


What fans of unsigned integers seem to really want is array access in the form of x[i %mod% LEN]. While signed integers "kind of" give you an error on overflow (Except your unlucky or somebody was malicious). Is there a way to combine both? As in give me an error on overflow for "debug builds" and always mod-LEN on "release builds"?


“In many ways signed integers are the null pointers of integers.”

What a great article.


I would have really loved to have different number types in JavaScript and to be able to define them with TypeScript. Not from the performance perspective (like different sizes) but integer vs float and unsigned vs signed can be really useful




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: