Abstract:The resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\mathcal{G})$ and discriminative $(\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\mathcal{D}$ models. However, designing a $\mathcal{D}$ model could be over-complicated; also, some studies hinted that a $\mathcal{D}$ model might not be as general and interpretable as a $\mathcal{G}$ model. In this paper, we switch from $\mathcal{D}$ models to $\mathcal{G}$ models using the classical auto-encoder $(AE)$ . Note that a vanilla $\mathcal{G}$ model was far less efficient than a $\mathcal{D}$ model in self-supervised computer vision tasks, as it wastes model capability on overfitting semantic-agnostic high-frequency details. Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics 1 1 Following [26], we refer to semantics as visual concepts, e.g., a semantic-ware model indicates the model can perceive visual concepts, and the learned features are efficient in object recognition, detection, etc., we propose a novel $AE$ that could learn semantic-aware representation via cross-view image reconstruction. We use one view of an image as the input and another view of the same image as the reconstruction target. This kind of $AE$ has rarely been studied before, and the optimization is very difficult. To enhance learning ability and find a feasible solution, we propose a semantic aligner that uses geometric transformation knowledge to align the hidden code of $AE$ to help optimization. These techniques significantly improve the representation learning ability of $AE$ and make selfsupervised learning with $\mathcal{G}$ models possible. Extensive experiments on many large-scale benchmarks (e.g., ImageNet, COCO 2017, and SYSU-30k) demonstrate the effectiveness of our methods. Code is available at https://github.com/wanggrun/Semantic-Aware-AE.

Semi-Supervised Seq2seq Joint-Stochastic-Approximation Autoencoders with Applications to Semantic Parsing

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Joint Stochastic Approximation and Its Application to Learning Discrete Latent Variable Models.

Semantic-Aware Auto-Encoders for Self-supervised Representation Learning

SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

Variational Autoencoders for Semi-supervised Text Classification

E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation

Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility

Dual-View Variational Autoencoders for Semi-Supervised Text Matching

Unleashing the True Potential of Sequence-to-Sequence Models for Sequence Tagging and Structure Parsing

Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation

Disentangled Variational Auto-Encoder for Semi-supervised Learning

Variational Autoencoder for Semi-Supervised Text Classification

Advancing Semi-Supervised Task Oriented Dialog Systems by JSA Learning of Discrete Latent Variable Models.

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder

Joint Parsing and Generation for Abstractive Summarization

Syntax-Directed Variational Autoencoder for Structured Data

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Language as a Latent Sequence: deep latent variable models for semi-supervised paraphrase generation

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders