If I understand correctly companies like OpenAI could run LLMs without having access to the users new inputs. It seems to me new users data are really useful for further training of the models. Can they still train the models over encrypted data? If this new data is not usable, why would the companies still want it?
Let's assume they can train the LLMs over encrypted data, what if a large number of users inject some crappy data (like it has been seen with the Tay chatbot story). How can the companies still keep a way to clean the data?
> Can they still train the models over encrypted data?
Yes but then the model becomes encrypted.
IMO ML training is not a realistic application for FHE, but things like federated training would be the way to do that privately enough.