In an ideal world you could just write endian-independent code (i.e. read byte by byte) and leave the compiler optimizer to sort it out. This has the benefit of also not tripping up any alignment restrictions.
I have a relatively large array of uint16_t with highly repetitive (low entropy) data. I want to serialize that to disk, without wasting a lot of space. I run compress2 from zlib on the data when serializinsg it, and decompress it when deserializing. However, these files make sense to use between machines, so I have defined the file format to use compressed little endian 16-bit unsigned ints. Therefore, if you ever want to run this code on a big-endian machine, you need to add some code to first flip the bytes around before compressing, then flipping them back after decompressing.
You're right that when your code is iterating through data byte for byte, you can write it in an endian-agnostic way and let the optimizer take care of recognizing that your shifts and ORs can be replaced with a memcpy on little-endian systems. But it's not always that simple.
Here's my most recent use case:
I have a relatively large array of uint16_t with highly repetitive (low entropy) data. I want to serialize that to disk, without wasting a lot of space. I run compress2 from zlib on the data when serializinsg it, and decompress it when deserializing. However, these files make sense to use between machines, so I have defined the file format to use compressed little endian 16-bit unsigned ints. Therefore, if you ever want to run this code on a big-endian machine, you need to add some code to first flip the bytes around before compressing, then flipping them back after decompressing.
You're right that when your code is iterating through data byte for byte, you can write it in an endian-agnostic way and let the optimizer take care of recognizing that your shifts and ORs can be replaced with a memcpy on little-endian systems. But it's not always that simple.