Your retrieval works on test data. Breaks on real user queries.
The chunking strategy that scored well in evaluation falls apart when real users ask questions in their own language — not the language of your documents. You debug the embedding model. The problem is the retrieval design.
