The code is heavily borrowed from : 1.Distillation for faster rcnn in classification,regression,feature level http://papers.nips.cc/paper/6676-learning-efficient-object-detection-models-with-knowledge-distillation.pdf 2.Distillation for faster rcnn in feature level +mask http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Distilling_Object_Detectors_With_Fine-Grained_Feature_Imitation_CVPR_2019_paper.pdf