This may not interest you, but Ente checks most of these boxes for me. It has face recognition and AI-based object search out of the box, and you can self-host their open-source server without any restrictions. The models they used might be useful for your project.
Their pricing page doesn't say anything as far as I can find but do you still pay pay Ente if you self host the server as well as the photos ("S3-compatible object storage")?
The Ente self-hosting proposition seems strange. Why would I want to e2e encrypt my photos that I self-host? Sounds like it will only make life more difficult.
Ente is a tremendous proposal. I don't know why I hadn't heard of it before, but I don't think it meets what I'm looking for. But the fact that the software is completely open is impressive.