Nice writeup. I suspect you're measuring the cost of abstraction. Specifically, routines that can handle lots of things (like locale based strings and utf8 character) have more things to do before they can produce results. This was something I ran into head on at Sun when we did the I18N[1] project.
In my experience there was a direct correlation between the number of different environments where a program would "just work" and its speed. The original UNIX ls(1) which had maximum sized filenames, no pesky characters allowed, all representable by 7-bit ASCII characters, and only the 12 bits of meta data that God intended[2] was really quite fast. You add things like a VFS which is mapping the source file system into the parameters of the "expected" file system that adds delay. You're mapping different character sets? adds delay. Colors for the display? Adds delay. Small costs that add up.
1: The first time I saw a long word like 'internationalization' reduced to first and last letter and the count of letters in between :-).
2: Those being Read, Write, and eXecute for user, group, and other, setuid, setgid, and 'sticky' :-)
Other people attribute i18n to DEC circa 1985.
> the 12 bits of meta data that God intended
Naw, that's Dennis Ritchie. You're thinking of the other white-bearded guy that hangs out in heaven.