Taming Transformers for High-Resolution Image Synthesis
Patrick Esser, Robin Rombach, Bjorn Ommer
23 Jun 2021
Transformer: Exploiting Its Highly Promising Learning Capabilities
Since the paper <Attention Is All You Need> has been published, many tasks relied on transformer architecture in various fields such as natural language processing or computer vision. However, because of its complex network which adapts complex relationships among the inputs, transformer is computationally infeasible for long sequences, such as high-resolution images. Thus, autoregressive generative models which adapts convolutional approach were predominant on image synthesis tasks which requires long sequences.
In order to utilize transformer in such tasks, this paper maintains that thecombination of transformer architecture and a convolutional approach outperforms current state of the art model. Convolutional approach provides contextually rich visual parts and global compositions while transformer provides long-range interactions within these compositions.
Integrating the Effectiveness of Convolutional Architectures
The paper suggests a two-stage approach based on transformer architecture with a little help of convolutional approach. The former stage is the convolutional part, which provides the global composition of images. Instead of representing the images with pixels, in this stage, images are represented as a composition of perceptually rich image constituents from a codebook.
The paper utilizes an encoder E and a decoder D such that they could learn to represent images with codes form the discrete codebook. Also, the corresponding convolutional model could be learned end-to-end through proper reconstruction loss.
In order to learn a perceptually richer codebook, the paper uses VQGAN, a variant of the original VQVAE, which adapts an adversarial training procedure in addition to its own perceptual loss. This procedure significantly reduces the sequence length when unrolling the latent code and thereby enables the application of powerful transformer models.
Lastly, generating high resolution images are done by patch-wise and crop images which restricts the sequence length to a maximally feasible size. To sample images, sliding-window method is used for transformer.
Outperforming State-Of-The-Art Convolutional Approaches
There are four experiments that supports this approach: (1) retaining the advantages of transformers, (2) integrating the effectiveness of convolutional architectures to enable high-resolution image synthesis, (3) how codebook quality affects the performance, (4) a quantitative comparison to a wide range of existing approaches for generative image synthesis. Throughout these experiments, the paper demonstrates that the approach retains the advantages of transformers by outperforming previous codebook-based state-of-the-art approaches based on convolutional architectures.
'Else' 카테고리의 다른 글
Attention is All You Need (0) | 2023.01.05 |
---|---|
Recommendation System for Las Vegas Citizens : YELP (0) | 2022.08.17 |
Denying the Legacy System of Reviewing Scientific Papers (0) | 2022.08.07 |