I was explaining why e2ee has important upsides, not how e2ee works. With Ente (and I think Immich as well), facial recognition and generating new CLIP embeddings are done on-device[0], usually right when the photo is taken / before they're uploaded to the server.
Immich does it on the server.
What happens if there’s a new, better model? You’d need to re-download, decrypt, and run inference on all your past media, which is in terabytes for many.
I understand the benefit of e2ee in a situation where there is no trust between user and admin. In personal self-hosting, that’s the same person (or family), and the upsides are not as relevant. The downsides (possibility of data loss for, e. g., kids who are not very good with passwords/keys; difficulties with updating models / thumbs; …) remain important, and outweigh the benefits, even assuming the e2ee is implemented well.