Knowledge distillation is the process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient one (student) while maintaining performance. This enables deployment on resource-constrained devices.
Key Components
Teacher Model: Large, accurate but computationally expensive
Student Model: Smaller, faster but needs to learn from teacher
Distillation Loss: Combines hard targets and soft probabilities
Temperature Parameter: Controls softmax smoothness