Unsigned Sizes: A Five Year Mistake

64 points • by lerno • yesterday at 6:40 PM • 71 comments • view on HN

Comments

kevin_thibedeau • yesterday at 7:02 PM

Systems programmers love to hate on unsigned integers. Generations have been infected with the Java world model that integers have to be pretend number lines centered on zero. Guess what, you still have boundary conditions to deal with. There are times when you really really need to use the full word range without negative values. This happens more often with low level programming and machines with small word sizes, something fewer people are engaged in. It doesn't need to be the default. Ada has them sequestered as modular types but it's available to use when needed.

➕ show 4 replies

Groxx • yesterday at 7:20 PM

>If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts. With C’s loose semantics, the problem is largely swept under the rug, but for Rust it meant that you’d regularly need to cast back and forth when dealing with sizes.

TBH I've had very little struggle with this at all. As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next. Needing casting then becomes a very clear sign that you're mixing sources and there be dragons, back up and fix the types or stop using the wrong variable. It's a low-cost early bug detector.

Implicitly casting between integer types though... yeah, that's an absolute freaking nightmare.

➕ show 1 reply

LegionMammal978 • yesterday at 7:14 PM

> But what about the range? While it’s true that you get twice the range, surprisingly often the code in the range above signed-int max is quite bug-ridden. Any code doing something like (2U * index) / 2U in this range will have quite the surprise coming.

Alas, (2S * signed_index) / 2S will similarly result in surprises the moment the signed_index hits half the signed-int max. There's no free lunch when trying to cheat the integer ranges.

➕ show 1 reply

ok123456 • yesterday at 8:18 PM

Bjarne agrees.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...

➕ show 1 reply

deathanatos • yesterday at 8:38 PM

> The former is easier to define, but has the downside of essentially “silencing warnings”. Let’s say the code was originally written to cast an u16 to u32, but later the variable type changes from u16 to u64 and the cast is now actually silently truncating things. Here we have casts becoming a sort of “silence all warnings”.

Well … we even mention Rust in the paragraph right before this. In Rust, you can up a u16 to a u64 this way:

  let bigger: u32 = x.into();

  let bigger = u32::from(x);

The conversion `from` is infallible, because a u16 always fits in a u32. There is no `from(u64) -> u32`, because as the article notes, that would truncate, so if we did change the type to u64, the code would now fail to compile. (And we'd be forced to figure out what we want to do here.)

(There are fallible conversions, too, in the form of try_from, that can do u64 → u32, but will return an error if the conversion fails.)

Similarly, for,

  for (uint x = 10; x >= 0; x--) // Infinte loop!

This is why I think implicit wrapping is a bad idea in language design. Even Rust went down the wrong path (in my mind) there, and I think has worked back towards something safer in recent years. But Rust provides a decent example here too; this is pseudo-code:

  for (uint x = 10; x.is_some(); x = x.checked_sub(1))

Where `checked_sub` is returns `None` instead of wrapping, providing us a means to detect the stopping point. So, something like that. (Though you'd probably also want to destructure the option into the uint for use inside the loop.) Of course, higher-level stuff always wins out here, I think, and in Rust you wouldn't write the above; instead something like,

  for x in (0..=10).rev()

(And even then, if we need indexes; usually, one would prefer to iterate through a slice or something like that. The higher-level concept of iterators usually dispenses with most or all uses of indexes, and in the rare cases when needed, most languages provide something like `enumerate` to get them from the iterator.)

ximm • yesterday at 7:19 PM

Is the text on this page really #bbbdc3 on #ffffff? How is anyone supposed to be able to read that?

➕ show 2 replies

alberto-m • yesterday at 9:00 PM

I might be a contrarian in that I actually like using unsigned integers for sizes and indexes. In my experience, most of their trappings can be prevented by treating any subtraction involving them as a `reinterpret_cast`: i.e.

* Do your utmost to rewrite the code in order to avoid doing that (e.g. reordering disequations to transform subtractions into additions). * If not possible, think very hard about any possible edge case: you most certainly need an additional `if` to deal with those. * When analyzing other people's code during troubleshooting merge reviews, assume any formula involving an unsigned integer and a minus sign is wrong.

EdSchouten • yesterday at 7:32 PM

> If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts.

I don’t really get this claim. Indexing should just look up the element corresponding to the value provided. It’s easy to come up with semantics that are intuitive and sound, even if signed integers or ones smaller than size_t are used.

➕ show 1 reply

ks2048 • yesterday at 7:14 PM

I know language designers have a lot of trade-offs to consider... But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.

The potential bugs listed would be prevented by, e.g. "x--" won't compile without explicitly supplying a case for x==0 OR by using some more verbose methods like "decrement_with_wrap".

The trade-off is lack of C-like concise code, but more safe and explicit.

➕ show 2 replies

Validark • yesterday at 8:25 PM

I am personally moving in the opposite direction. I haven't meaningfully used a signed integer in years, and I see signed integers as being for more niche use-cases. I mainly only use a signed types when I want to do a "signed shift right". If there was a >>> operator in Zig I wouldn't even think of signed integers.

Given your examples, I think you'd have fewer issues if you were working with unsigned integers exclusively. Although I'm curious about what other code you were referencing with this: "But seeing how each change both made the code easier to reason about and more correct, I couldn’t deny the evidence."

With regards to modulo, in Zig if you try to use it with a signed integer it will tell you to specify whether you want `@mod` or `@rem` semantics. In my case, I'd almost never write `x % 2`, I'd write `x & 1`. I do use unsigned division but I'd pretty much never write code that would emit the `div` instruction.

I'm not saying you're wrong though! Everyone has a different mind. If you attain higher correctness and understandability through using signed integers, that's great. I'm just saying I'm in the opposite camp.

larsnystrom • yesterday at 8:49 PM

I don’t understand how dealing with numbers correctly is not a solved problem in computer engineering by now.

➕ show 1 reply

cperciva • yesterday at 7:49 PM

I don't get it. Is this a parody of poor design decisions?

Sure, it's possible to write bugs in C. And if you really want to, you can disable the compiler warnings which flag tautologous comparisons and mixed-sign comparisons (a common reason for doing this is to avoid spurious warnings in generic-type code).

But, uhh, "people can deliberately write bugs" has got to be the weakest justification I've ever seen for changing a language feature -- especially one as fundamental as "sizes of objects can't be negative".

➕ show 1 reply

IshKebab • yesterday at 7:52 PM

It seems like they've identified common bugs patterns in C that would have been ameliorated by using signed, but come to the wrong conclusion that signed is the correct answer rather than that C is poorly designed for making the broken code the easy option.

Fix the language. Don't hack around it by using the wrong type.

➕ show 1 reply

jonstewart • yesterday at 7:25 PM

I hate using languages that only have signed integers. Using integers that can’t be negative fits many problems nicely and avoids the edge case of having to check for negative.

➕ show 2 replies

alt Hacker News

Unsigned Sizes: A Five Year Mistake

Comments