Compressing Features for Learning with Noisy Labels

IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022

Yingyi Chen1Shell Xu Hu2Xi Shen3Chunrong Ai4Johan A.K. Suykens1

1 ESAT-STADIUS, KU Leuven, Leuven   2 Samsung AI Center, Cambrige
3 Tencent AI, Shenzhen   4 CUHK, Shenzhen

Comparisons of regression between standard MLP and MLP trained with Nested Dropout and Dropout on a synthetic noisy label dataset. (a) MLP with standard training; (b-d) predictions of MLP+Nested using only the first k\in\{1,10,100\} channels; (e-h) predictions of MLP+Dropout with drop ratio p_{\text{drop}}\in\{0.9,0.7,0.5,0.3\}.

Paper Code


Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this over-fitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations w.r.t. feature importance. Moreover, the trained models with compression regularization are further combined with Co-teaching for performance boost.

Theoretically, we conduct bias-variance decomposition of the objective function under compression regularization. We analyze it for both single model and Co-teaching. This decomposition provides three insights: (i) it shows that over-fitting is indeed an issue in learning with noisy labels; (ii) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; (iii) it gives explanations on the performance boost brought by incorporating compression regularization into Co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N.



In stage one, the hidden activation \tilde{Z} is computed by a feature extractor f. Dropout/Nested Dropout is applied to \tilde{Z} by masking some of the features to zeros, i.e., Z=M\odot \tilde{Z}. The compressed feature Z is then fed into the network structure d, which can simply be a fully connected layer (FC), to perform the final prediction. In stage two, the two base networks are fine-tuned with Co-teaching.


Please refer to our paper for more experiments.

Clothing1M with real-world label noise

ANIMAL-10N with real-world label noise


Test accuracy (%) of state-of-the-art methods on Clothing1M (noise ratio ∼38%). All approaches are implemented with ResNet-50 architecture. Results with ``*" use a balanced subset or a balanced loss.

Average test accuracy (%) with standard deviation (3 runs) of state-of-the-art methods on ANIMAL-10N (noise ratio ~8%). All approaches are implemented with VGG-19 architecture. Results with ``*" use two networks for training.






Workshop Paper



If you find this work useful for your research, please cite:
              author={Chen, Yingyi and Hu, Shell Xu and Shen, Xi and Ai, Chunrong and Suykens, Johan A. K.},
              journal={IEEE Transactions on Neural Networks and Learning Systems}, 
              title={Compressing Features for Learning with Noisy Labels}, 


We appreciate Qinghua Tao for helpful comments and discussions.
This work is jointly supported by EU: The research leading to these results has received funding from the European Research Council under the European Union's Horizon 2020 research and innovation program / ERC Advanced Grant E-DUALITY (787960). This paper reflects only the authors' views and the Union is not liable for any use that may be made of the contained information. Research Council KU Leuven: Optimization frameworks for deep kernel machines C14/18/068 Flemish Government: FWO: projects: GOA4917N (Deep Restricted Kernel Machines: Methods and Foundations), PhD/Postdoc grant This research received funding from the Flemish Government (AI Research Program). EU H2020 ICT-48 Network TAILOR (Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization) Leuven.AI Institute

© This webpage was in part inspired from this template.