Skip to yearly menu bar Skip to main content


Poster

Revisit the Essence of Distilling Knowledge through Calibration

Wen-Shu Fan · Su Lu · Xin-Chun Li · De-Chuan Zhan · Le Gan


Abstract:

Knowledge Distillation (KD) has evolved into a practical technology for transferring knowledge from a well-performing model (teacher) to a weak model (student). A counter-intuitive phenomenon known as capacity mismatch has been identified, wherein KD performance may not be good when a better teacher instructs the student. Various preliminary methods have been proposed to alleviate capacity mismatch, but a unifying explanation for its cause remains lacking. In this paper, we propose \textit{a unifying analytical framework to pinpoint the core of capacity mismatch based on calibration}. Through extensive analytical experiments, we observe a positive correlation between the calibration of the teacher model and the KD performance with original KD methods. As this correlation arises due to the sensitivity of metrics (e.g., KL divergence) to calibration, we recommend employing measurements insensitive to calibration such as ranking-based loss. Our experiments demonstrate that ranking-based loss can effectively replace KL divergence, aiding large models with poor calibration to teach better.

Live content is unavailable. Log in and register to view live content