> Python essentially bet on UTF-32 (with space-saving optimisations) How so? Python3 strings ar...

diziet_sma • today at 12:32 AM • 3 replies • view on HN

> Python essentially bet on UTF-32 (with space-saving optimisations)

How so? Python3 strings are unicode and all the encoding/decoding functions default to utf-8. In practice this means all the python I write is utf-8 compatible unicode and I don't ever have to think about it.

Replies

sheept • today at 12:54 AM

UTF-32 allows for constant time character accesses, which means that mystr[i] isn't O(n). Most other languages can only provide constant time access for code units.

pansa2 • today at 1:07 AM

> all the encoding/decoding functions default to utf-8

Languages that use UTF-8 natively don't need those functions at all. And the ones in Python aren't trivial - see, for example, `surrogateescape`.

As the sibling comment says, the only benefit of all this encoding/decoding is that it allows strings to support constant-time indexing of code points, which isn't something that's commonly needed.

➕ show 1 reply

cloudbonsai • today at 3:43 AM

Internally Python holds a string as an array of uint32. A utf-8 representation is created on demand from it (and cached). So pansa2 is basically correct [^1].

IMO, while this may not be optimal, it's far better than the more arcane choice made by other systems. For example, due to reasons only Microsoft can understand, Windows is stuck with UTF-16.

[1] Actually it's more intelligent. For example, Python automatically uses uint8 instead of uint32 for ASCII strings.

➕ show 2 replies

alt Hacker News

Replies