
SimCLIC: A Simple Framework for Contrastive Learning of Image Classification
Han YANG, Jun LI
系统科学与信息学报(英文) ›› 2023, Vol. 11 ›› Issue (2) : 204-218.
SimCLIC: A Simple Framework for Contrastive Learning of Image Classification
Contrastive learning, a self-supervised learning method, is widely used in image representation learning. The core idea is to close the distance between positive sample pairs and increase the distance between negative sample pairs in the representation space. Siamese networks are the most common structure among various current contrastive learning models. However, contrastive learning using positive and negative sample pairs on large datasets is computationally expensive. In addition, there are cases where positive samples are mislabeled as negative samples. Contrastive learning without negative sample pairs can still learn good representations. In this paper, we propose a simple framework for contrastive learning of image classification (SimCLIC). SimCLIC simplifies the Siamese network and is able to learn the representation of an image without negative sample pairs and momentum encoders. It is mainly by perturbing the image representation generated by the encoder to generate different contrastive views. We apply three representation perturbation methods, namely, history representation, representation dropoput, and representation noise. We conducted experiments on several benchmark datasets to compare with current popular models, using image classification accuracy as a measure, and the results show that our SimCLIC is competitive. Finally, we did ablation experiments to verify the effect of different hyperparameters and structures on the model effectiveness.
contrastive learning / representation learning / image classification {{custom_keyword}} /
Algorithm 1: SimCLIC Pseudocode, PyTorch-like |
Require: image dataset |
Require: |
1: For |
2: |
3: |
4: |
5: |
6: |
7: update( |
8: end for |
9: |
10: |
11: |
12: |
13: return |
Table 1 The statistics of datasets |
Aircraft | Caltech101 | Cars | CIFAR10 | CIFAR100 | DTD | Flowers | Food | Pets | SUN397 | VOC2007 | |
Classes | 100 | 101 | 196 | 10 | 100 | 47 | 102 | 101 | 37 | 397 | 20 |
Total number | 10000 | ≈9000 | 16185 | 60000 | 60000 | 5640 | 8189 | 101000 | ≈5000 | 108000 | 9963 |
Table 2 Linear evaluation and fine-tuning error rates (%) |
ImageNet | Aircraft | Caltech101 | Cars | CIFAR10 | CIFAR100 | DTD | Flowers | Food | Pets | SUN397 | VOC2007 | |
Linear evaluation: | ||||||||||||
InsDis | 40.50 | 63.13 | 28.88 | 71.02 | 19.72 | 40.03 | 31.54 | 16.56 | 36.61 | 31.22 | 50.53 | 25.63 |
MoCo v1 | 39.40 | 64.45 | 24.67 | 72.01 | 19.84 | 42.29 | 31.17 | 17.90 | 37.90 | 30.16 | 48.98 | 24.07 |
PCL | 38.50 | 78.39 | 23.10 | 87.07 | 18.16 | 44.26 | 37.13 | 35.27 | 51.98 | 24.66 | 54.30 | 21.69 |
PIRL | 38.30 | 62.92 | 25.52 | 71.28 | 17.47 | 38.74 | 31.01 | 16.40 | 35.35 | 28.64 | 46.11 | 23.39 |
PCL v2 | 32.40 | 62.97 | 13.58 | 69.49 | 8.09 | 26.46 | 29.41 | 14.66 | 35.12 | 17.21 | 43.75 | 18.86 |
SimCLR v1 | 30.70 | 55.10 | 9.95 | 56.27 | 8.82 | 27.27 | 25.80 | 9.13 | 32.53 | 16.67 | 40.79 | 19.23 |
MoCo v2 | 28.90 | 58.21 | 12.08 | 60.69 | 7.72 | 25.10 | 26.12 | 9.93 | 31.05 | 16.70 | 39.68 | 17.31 |
SimCLR v2 | 28.30 | 53.62 | 10.37 | 49.63 | 7.47 | 23.22 | 23.62 | 7.10 | 26.92 | 15.28 | 38.53 | 18.43 |
SeLa v2 | 28.20 | 62.71 | 12.80 | 63.14 | 7.27 | 25.19 | 25.85 | 9.78 | 28.92 | 16.78 | 37.29 | 17.27 |
InfoMin | 27.00 | 61.42 | 12.16 | 58.96 | 8.51 | 26.57 | 25.27 | 12.82 | 30.47 | 13.76 | 39.00 | 16.76 |
BYOL | 25.70 | 46.13 | 8.54 | 43.60 | 6.74 | 22.14 | 23.09 | 5.5 | 26.99 | 10.90 | 40.01 | 18.86 |
DeepCluster v2 | 24.80 | 45.51 | 8.67 | 41.40 | 5.98 | 20.39 | 21.38 | 5.28 | 22.06 | 10.64 | 34.52 | 16.06 |
SwAV | 24.70 | 45.96 | 9.16 | 45.94 | 6.01 | 20.42 | 22.98 | 5.38 | 23.38 | 12.40 | 34.42 | 16.32 |
SimCLIC (ours) | history | 56.15 | 12.12 | 54.26 | 5.97 | 25.96 | 26.40 | 9.52 | 28.76 | 18.78 | 40.18 | 18.56 |
dropout | 57.09 | 12.35 | 54.52 | 6.00 | 26.18 | 27.02 | 9.47 | 29.82 | 18.93 | 40.61 | 18.89 | |
noise | 55.86 | 11.92 | 53.89 | 5.92 | 25.65 | 26.04 | 9.50 | 29.51 | 18.75 | 40.12 | 18.55 | |
Fine-tuned: | ||||||||||||
InsDis | 26.62 | 27.96 | 38.44 | 6.68 | 31.74 | 36.01 | 10.49 | 23.22 | 23.78 | 48.16 | 28.10 | |
MoCo v1 | 24.39 | 25.05 | 34.88 | 6.11 | 28.48 | 34.63 | 10.55 | 22.72 | 23.04 | 46.65 | 25.09 | |
PCL | 25.03 | 12.38 | 26.76 | 3.65 | 20.38 | 30.00 | 9.17 | 21.70 | 13.02 | 41.60 | 17.92 | |
PIRL | 27.32 | 29.17 | 38.98 | 7.77 | 33.52 | 35.74 | 10.19 | 25.04 | 23.74 | 49.62 | 30.10 | |
PCL v2 | 20.63 | 11.96 | 28.32 | 3.50 | 19.74 | 28.24 | 7.05 | 19.66 | 14.61 | 41.18 | 17.80 | |
SimCLR v1 | 18.94 | 9.65 | 16.22 | 2.93 | 15.47 | 28.46 | 6.25 | 17.60 | 15.90 | 36.69 | 17.42 | |
MoCo v2 | 20.13 | 15.62 | 24.80 | 3.55 | 28.67 | 30.53 | 5.65 | 23.22 | 20.20 | 44.23 | 28.29 | |
SimCLR v2 | 21.29 | 17.06 | 20.16 | 3.78 | 20.95 | 29.84 | 5.68 | 17.78 | 16.80 | 38.88 | 21.81 | |
SeLa v2 | 18.01 | 11.01 | 14.38 | 3.20 | 15.63 | 25.64 | 4.20 | 13.76 | 11.45 | 34.16 | 15.15 | |
InfoMin | 19.76 | 16.08 | 21.24 | 3.06 | 28.85 | 28.88 | 4.76 | 21.07 | 14.72 | 42.34 | 23.37 | |
BYOL | 20.55 | 10.60 | 15.40 | 2.99 | 16.05 | 26.38 | 5.52 | 14.45 | 10.38 | 36.04 | 17.30 | |
DeepCluster v2 | 17.48 | 9.25 | 12.73 | 2.94 | 14.85 | 25.16 | 4.69 | 12.49 | 10.57 | 33.58 | 15.10 | |
SwAV | 16.92 | 10.15 | 13.24 | 3.22 | 15.63 | 24.84 | 4.54 | 12.78 | 10.95 | 33.76 | 15.34 | |
SimCLIC (ours) | history | 22.86 | 11.75 | 19.56 | 3.12 | 19.65 | 29.58 | 4.86 | 20.84 | 17.46 | 42.15 | 21.21 |
dropout | 22.24 | 12.27 | 20.12 | 3.11 | 20.48 | 29.46 | 5.68 | 20.97 | 17.79 | 42.68 | 21.86 | |
noise | 22.46 | 11.72 | 19.89 | 2.94 | 19.62 | 28.92 | 4.51 | 21.18 | 17.28 | 42.38 | 21.37 |
Table 3 Training details reported in the original paper |
Epochs | Batch size | Momentum encoder | Momentum bank | Projection head | Data augmentation | |
InsDis* | 200 | 256 | √ | √ | ||
MoCo v1 | 200 | 256 | √ | √ | ||
PCL | 200 | 256 | √ | √ | ||
PIRL* | 200 | 1024 | √ | √ | ||
PCL v2 | 200 | 256 | √ | √ | √ | |
SimCLR v1 | 1000 | 4096 | √ | √ | ||
MoCo v2 | 800 | 256 | √ | √ | √ | |
SimCLR v2 | 800 | 4096 | √ | √ | √ | |
SeLa v2 | 400 | 4096 | √ | √ | √ | |
InfoMin | 800 | 256 | √ | √ | √ | |
BYOL | 1000 | 4096 | √ | √ | ||
DeepCluster v2 | 800 | 4096 | √ | √ | √ | |
SwAV | 800 | 4096 | √ | √ | ||
SimCLIC (ours) | 1000 | 128 | √ |
Table 4 Effect of predictor h |
No predictor | Fixed random init | 1-layer MLP | 2-layer MLP | |
Error rate (%) | 90.17 | 89.16 | 8.65 | 5.92 |
Table 5 Effect of stop-gradient operator |
With stop-gradient | Without stop-gradient | |
Error rate (%) | 5.92 | 91.04 |
Table 6 Effect of loss function |
Cosine | Cross-entropy | |
Error rate (%) | 5.92 | 10.69 |
1 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
2 |
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556, 2014.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
3 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
4 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
5 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
6 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
7 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
8 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
9 |
Li J N, Zhou P, Xiong C M, et al. Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv: 2005.04966, 2020.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
10 |
Asano Y M, Rupprecht C, Vedaldi A. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv: 1911.05371, 2019.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
11 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
12 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
13 |
Chen X L, Fan H Q, Girshick R, et al. Improved baselines with momentum contrastive learning. arXiv preprint arXiv: 2003.04297, 2020.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
14 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
15 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
16 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
17 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
18 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
19 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
20 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
21 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
22 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
23 |
Iandola F N, Han S, Moskewicz M W, et al. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv: 1602.07360, 2016.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
24 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
25 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
26 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
27 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
28 |
Kingma D P, Welling M. Auto-encoding variational Bayes. arXiv preprint arXiv: 1312.6114, 2013.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
29 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
30 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
31 |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30. Doi: 10.48550/arXiv.1706.03762.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
32 |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
33 |
Chen X L, Xie S N, He K M. An empirical study of training self-supervised visual transformers. arXiv e-prints, 2021: arXiv-2104.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
34 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
35 |
Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv: 1412.6572, 2014.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
36 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
37 |
Maji S, Rahtu E, Kannala J, et al. Fine-grained visual classification of aircraft. arXiv preprint arXiv: 1306.5151, 2013.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
38 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
39 |
Krause J, Deng J, Stark M, et al. Collecting a large-scale dataset of fine-grained cars. 2013.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
40 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
41 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
42 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
43 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
44 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
45 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
46 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
47 |
Ericsson L, Gouk H, Hospedales T M. How well do self-supervised models transfer? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 5414-5423.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
48 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
49 |
Kornblith S, Shlens J, Le Q V. Do better imagenet models transfer better? Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2661-2671.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
50 |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
/
〈 |
|
〉 |