RAG is broken when you have too much data.
Specifically when the document number reaches around 10k+, a phenomenon called "Semantic Collapse" occurs.
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...
Gemini with Google search is RAG using all public data, and it isn't broken.
Cant you make thresholds higher?
Hmm... I guess not, you might want all that data.
Super interesting topic. Learning a lot.
Specifically when the document number reaches around 10k+, a phenomenon called "Semantic Collapse" occurs.
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...