This visualization represents the CommonSenseQA dataset.
To clarify the performance disparity, we have grouped the questions based on model outcomes. The Blue Blocks represent the questions LLaMA-3 answers correctly in English.
It establishes a strong baseline, solving 78.0% of the tasks.
When prompted in Hindi, the model struggles. It fails on a significant portion of questions it previously solved in English.
The Red Band represents this "Performance Gap." These are not necessarily "harder" questions; they are simply questions where the model lacks the multilingual representation to map the concept from English to Hindi.
We generate synthetic "Hinglish" data using the CoCoa Model (Mondal et al., 2022).
Unlike standard translation, CoCoa allows us to enforce specific Code-Mixing Indexes (CMI), ensuring a precise ratio of English grammar to Hindi vocabulary. This acts as a semantic bridge during fine-tuning.
*We found CMI 2 optimal. It maintains the English sentence structure (Blue) while injecting dense Hindi tokens (Red).
Fine-tuning on this synthetic data forces the model to align its internal manifolds.
The results are visualized in Green. We recover the vast majority of the performance gap. Hindi accuracy jumps significantly, proving that we don't need massive native datasets to achieve equity—we just need better geometric alignment.