logoalt Hacker News

Why does C have the best file API

122 pointsby maurycyzyesterday at 7:25 PM100 commentsview on HN

Comments

amlutotoday at 5:37 AM

I can’t entirely tell what the article’s point is. It seems to be trying to say that many languages can mmap bytes, but:

> (as far as I'm aware) C is the only language that lets you specify a binary format and just use it.

I assume they mean:

    struct foo { fields; };
    foo *data = mmap(…);
And yes, C is one of relatively few languages that let you do this without complaint, because it’s a terrible idea. And C doesn’t even let you specify a binary format — it lets you write a struct that will correspond to a binary format in accordance with the C ABI on your particular system.

If you want to access a file containing a bunch of records using mmap, and you want a well defined format and good performance, then use something actually intended for the purpose. Cap’n Proto and FlatBuffers are fast but often produce rather large output; protobuf and its ilk are more space efficient and very widely supported; Parquet and Feather can have excellent performance and space efficiency if you use them for their intended purposes. And everything needs to deal with the fact that, if you carelessly access mmapped data that is modified while you read it in any C-like language, you get UB.

show 5 replies
seba_dos1yesterday at 9:32 PM

mmap is not a C feature, but POSIX. There are C platforms that don't provide mmap, and on those that do you can use mmap from other languages (there's mmap module in the Python's standard library, for example).

show 2 replies
Dwedityesterday at 9:57 PM

Using mmap means that you need to be able to handle memory access exceptions when a disk read or write fails. Examples of disk access that fails includes reading from a file on a Wifi network drive, a USB device with a cable that suddenly loses its connection when the cable is jiggled, or even a removable USB drive where all disk reads fail after it sees one bad sector. If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.

show 3 replies
Const-meyesterday at 9:53 PM

I think C# standard library is better. You can do same unsafe code as in C, SafeBuffer.AcquirePointer method then directly access the memory. Or you can do safer and slightly slower by calling Read or Write methods of MemoryMappedViewAccessor.

All these methods are in the standard library, i.e. they work on all platforms. The C code is specific to POSIX; Windows supports memory mapped files too but the APIs are quite different.

show 1 reply
zahlmanyesterday at 10:00 PM

Aside from what https://news.ycombinator.com/item?id=47210893 said, mmap() is a low-level design that makes it easier to work with files that don't fit in memory and fundamentally represent a single homogeneous array of some structure. But it turns out that files commonly do fit in memory (nowadays you commonly have on the order of ~100x as much disk as memory, but millions of files); and you very often want to read them in order, because that's the easiest way to make sense of them (and tape is not at all the only storage medium historically that had a much easier time with linear access than random access); and you need to parse them because they don't represent any such array.

When I was first taught C formally, they definitely walked us through all the standard FILE* manipulators and didn't mention mmap() at all. And when I first heard about mmap() I couldn't imagine personally having a reason to use it.

show 1 reply
teo_zerotoday at 6:27 AM

What a bizarre conclusion to draw! It's like saying that cars are the best means of transportation because you can travel to the Grand Canyon in them and the Grand Canyon is the best landscape in the world, and yes you could use other means to get there, but cars are what everybody's using.

If the real goal of TFA was to praise C's ability to reinterpret a chunk of memory (possibly mapped to a file) as another datatype, it would have been more effective to do so using C functions and not OS-specific system calls. For example:

  FILE *f = fopen(...);
  uint32_t *numbers;
  fread(numbers, ..., f);

  access numbers[...]

  frwite(numbers, ..., f);
  fclose(f);
show 1 reply
alanfranzyesterday at 10:09 PM

Well...

I'm not sure what the author really wants to say. mmap is available in many languages (e.g. Python) on Linux (and many other *nix I suppose). C provides you with raw memory access, so using mmap is sort-of-convenient for this use case.

But if you use Python then, yes, you'll need a bytearray, because Python doesn't give you raw access to such memory - and I'm not sure you'd want to mmap a PyObject anyway?

Then, writing and reading this kind of raw memory can be kind of dangerous and non-portable - I'm not really sure that the pickle analogy even makes sense. I very much suppose (I've never tried) that if you mmap-read malicious data in C, a vulnerability would be _quite_ easy to exploit.

show 2 replies
okanatyesterday at 10:47 PM

I guess the author didn't use that many other programming languages or OSes. You can do the same even in garbage collected languages like Java and C# and on Windows too.

https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByt...

https://learn.microsoft.com/en-us/dotnet/api/system.io.memor...

https://learn.microsoft.com/en-us/windows/win32/memory/creat...

Memory mapping is very common.

show 1 reply
nickelproyesterday at 10:01 PM

> Why does C have the best file API

> Look inside

> Platform APIs

Ok.

I agree platform APIs are better than most generic language APIs at least. I disagree on mmap being the "best".

castralyesterday at 9:56 PM

I think OP and I have very divergent opinions on what makes a file API "best". This may have been the best 30 years ago. The world has moved on.

Suractoday at 10:01 AM

After reading the comments here it boils down to: But my language is better then yours. mmap is not a feature of C. Some more modern languages try to prevent people form shooting in there feet and only allow byte wise access to such mmaped regions. The have a point doing this, but on the other hand also the C-Users have a valid point. Safety and Speed are 2 Factors you have to consider using the tools you use. From a Hardware point of view C might be more direct but it also enables you to make "stupid" errors fast. More Modern languages prevent you from the "stupid" errors but make you copy or transform the data more. Scotty from the Enterprise sayed once: Allways use the fitting tool

ibejoebyesterday at 10:32 PM

The article only touches on `open` and `close` and doesn't deal with any of the realities of file access. Not a particularly compelling write-up.

sreanyesterday at 9:38 PM

> However, in other most languages, you have to read() in tiny chunks, parse, process, serialize and finally write() back to the disk. This works, but is verbose and needlessly limited

C has those too and am glad that they do. This is what allows one to do other things while the buffer gets filled, without the need for multithreading.

Yes easier standardized portable async interfaces would have been nice, not sure how well supported they are.

show 1 reply
chuckadamsyesterday at 10:18 PM

It has the best API for the author, that's for sure. One size does not fit all: believe it or not, different files have different uses. One does not mmap a pipe or /dev/urandom.

FrankWilhoityesterday at 9:08 PM

A file API is not the same thing as a filesystem API. The holy grail is still a universal but high(-enough)-level filesystem API.

pmontratoday at 6:57 AM

I think that I open files in very few cases in my job. I read and write PDF, xlsx, csv, yaml and I write docx. Those have their own formats and we use them to communicate with other apps or with users. Everything else goes in a PostgreSQL database or in sqlite3 because of many reasons and among them because of interoperability with other apps and ease of human inspection. A custom file format could be OK for data that only that app must use or for performance reasons, if you know how to make data retrieval performant.

mmastracyesterday at 9:38 PM

"best file API" and the man page for the O_ flags disagree.

kelnostoday at 1:44 AM

mmap() is useful for some narrow use-cases, I think, but error-handling is a huge pain. I don't want to have to deal with SIGBUS in my application.

I agree that the model of mmap() is amazing, though: being able to treat a file as just a range of bytes in memory, while letting the OS handle the fussy details, is incredibly useful. (It's just that the OS doesn't handle all of the fussy details, and that's a pain.)

raincoletoday at 7:00 AM

At first glance, it's a quite weird article. But at the bottom:

> This simply isn't true on memory constrained systems — and with 100 GB files — every system is memory constrained.

I suppose the author might have a point in the context of making apps that constantly need to process 100GB files? I personally never have to deal with 100GB files so I am no one to judge if the rest of the article makes sense.

show 1 reply
karel-3dtoday at 11:20 AM

In go, you can do mmap with some help of external library :) you can mmap a file - https://github.com/edsrzf/mmap-go - and then unsafe-retype that to slice of objects, and then read/write to it. It's very handy sometimes!

It's unsafe though.

You also need to be careful to not have any pointers in the struct (so also no slices, maps); or if you have pointers, the must be nil. Once you start unsafe-retyping random bytes to pointers, thing explode very quickly.

So maybe this article has a point.

Perentitoday at 5:51 AM

It may have a tidy mmap api, but Smalltalk has a much better file api through its Streams hierarchy IMHO. You can create a stream on a diskfile, you can create a stream on a byteArray, you can create a stream on standard Unix streams, you can create a stream on anything where "next" makes sense.

swaminarayantoday at 7:48 AM

If mmap-style file access is this powerful, why do most higher-level languages avoid exposing typed, struct-level mappings directly instead of just byte buffers?

qalmakkatoday at 8:27 AM

> C has the best API

POSIX has the best API. C has `fopen` which, while not terrible, isn't what I'd call "great"

jim33442today at 7:03 AM

You can't do dynamic memory with that, right? Not without a custom malloc implementation. So it's not all that comparable to pickle.

show 1 reply
charcircuityesterday at 10:01 PM

C's API does not include mmap, nor does it contain any API to deal with file paths, nor does it contain any support for opening up a file picker. This paired with C's bad string support results in one of it being one of the worst file APIs.

Also using mmap is not as simple as the article lays out. For example what happens when another process modifies the file and now your processes' mapped memory consists of parts of 2 different versions of the file at the same time. You also need to build a way to know how to grow the mapping if you run out room. You also want to be able to handle failures to read or write. This means you pretty much will need to reimplement a fread and fwrite going back to the approach the author didn't like: "This works, but is verbose and needlessly limited to sequential access." So it turns out "It ends up being just a nicer way to call read() and write()" is only true if you ignore the edge cases.

andersmurphyyesterday at 10:35 PM

mmap is nice. But, I find sqlite is a better filesystem API [1]. If you are going to use mmap why not take it further and use LMDB? Both have bindings for most languages.

[1] - https://sqlite.org/fasterthanfs.html

teunispetersyesterday at 11:13 PM

technically yes, because there's a failure path for every single failure that an OS knows about. And most others aren't so resilient. However, mmap bypasses a lot of that....

koakuma-chanyesterday at 9:53 PM

How do you handle read/write errors with mmap?

show 2 replies
TZubiriyesterday at 11:50 PM

Here's a negative signal I'm seeing often:

When a developer that usually consumes the language starts critiquing the language.

I could go on as to why it's a bad signal, psychologically, but let's just say that empirically it usually doesn't come from a good place, it's more like a developer raising the stakes of their application failing and blaming the language.

Sure one out of a thousand devs might be the next Linus Torvalds and develop the next Rust. But the odds aren't great.

nice_byteyesterday at 11:14 PM

mmap is not a language feature. it is also full of its own pitfalls that you need to be aware of. recommended reading: https://db.cs.cmu.edu/mmap-cidr2022/

jccx70yesterday at 10:51 PM

This is like: I discovered the wheel and want to let you know!

userbinatoryesterday at 10:05 PM

It still works if the file doesn't fit in RAM

No it doesn't. If you have a file that's 2^36 bytes and your address space is only 2^32, it won't work.

On a related digression, I've seen so many cases of programs that could've handled infinitely long input in constant space instead implemented as some form of "read the whole input into memory", which unnecessarily puts a limit on the input length.

show 3 replies