LLMs: new landmark article from the MICS laboratory
A team from the MICS laboratory has just announced the publication of a landmark paper on the evolution of large language models (LLMs).
This study is entitled: 'When Does Reasoning Matter? Unpacking the Contribution of Reasoning to LLM Performance.'
In recent years, reasoning ability has become one of the central themes of debates around large language models (LLMs). These models, capable of explicitly generating Chains of Thought (CoT), regularly demonstrate cutting-edge performance, particularly in complex areas such as mathematics and programming.
However, despite their empirical success, several crucial questions remain little explored:
- What tasks actually benefit from reasoning?
- At what scale of model, and at what cost compared to the'Instruction Fine-Tuning (IFT) classic?
To address this, the MICS team designed a controlled environment that isolates reasoning signals using so-called synthetic data. We then analyzed the effect of reasoning on five models of varying sizes and rigorously evaluated these trained models on 12 diverse benchmarks, covering both math-centric and general tasks.
This controlled environment allowed for direct comparison of the performance of IFT models and reasoning models across different scales and task types.
Key points to remember:
- Reasoning improves performance: Reasoning models can match the performance of much larger IFT models.
- Scale matters: Reasoning excels at 7B+ models, exceeding the limits of IFT. However, increasing the size of IFT models below this threshold achieves similar performance.
- Task sensitivity: Open-ended and mathematical tasks benefit most from reasoning, while general multiple-choice tasks remain less sensitive.
Read also: Largest paired reasoning datasets–IFT.
A big congratulations to the whole team behind this project: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El Haddad, Céline Hudelot et Pierre Colombo. A big thank you also to the partners Diabolocom et Artifact.