It's a pretty deep rabbit hole. For semantic search CLIP and cosine similarity are just fine. SmolVLM(2) mentioned by spacecadet looks interesting though. I haven't integrated face recognition myself, but [deepface] seemed pretty complete.
I focused more on fast rendering in [photofield] (quick [explainer] if you're interested), but even the hacked up basic semantic search with CLIP works better than it has any right to. Vector DBs are cool, but what is cooler is writing float arrays to sqlite :)
[deepface]: https://github.com/serengil/deepface
[photofield]: https://github.com/SmilyOrg/photofield
[explainer]: https://lnar.dev/blog/photofield-origins/