Attention Is All You Need
Ashish Vaswani, Noam Shazeer etc.
Dec 6, 2017
Abstract
The paper suggests an unique machine learning architecture for natural language processing called Transformer. Thisarchitecture is solely based on attention algorithm, dispensing with former network architectures such as RNN (Recurrent Neural Network) and CNN (Convolution Neural Network). Despite of the difference, Transformer model seems to have higher performance (SOTA; State-Of-The-Art) than any traditional model in all kinds of task such as machine translation or sentiment analysis.
What is Attention Mechanism
Let's say we want to build an artificial intelligence model which classifies a name of an animal based on a picture. As a matter of course, the input should be a RGB value of each pixel on the picture. The model assigns a weighted value called attention weight for each pixel respectively. Based on these attention weight, we calculate the attention score, and ultimately classifies what animal it is. Surprisingly, when we visualize the attention weight from a picture of a bird, features that only the birds have, such as wings, beaks and claws gains the largest attention score.
For Sequence-to-Sequence model, given a stream words, the model returns its output as another stream of words. For example, a translation task requires a model which translates a stream of words in one language to another. While training, machine learns to assign attention weights for each word in encoder, and calculates attention score respectively for each word in decoder. For a simple example, let us translate the following English sentence to a Korean sentence.
"I go to school." -> "나는 학교에 간다."
For '나는', which is a first word in decoder, attention weight of 'I', which is a first word in encoder appears to be the highest. However, for "학교에", which is a second word in decoder, attention weights of "to" and "school", which is second and third word in encoder appear to be the highest. This is how attention works in sequence-to-sequence model.
Why Attention? Why Self-Attention?
Before the Transformer has emerged, recurrent networks and encoder-decoder architectures achieved huge success on most of the transduction problems. These architectures trains the model based on hidden states, which is used to carry historical value. The problem of these networks is that as the sequence length gets longer, the hidden state becomes dim as the sentence goes. Hence the Transformer which is solely based on attention can enhance the task performance by drawing a global dependencies between encoders and decoders.
Every component of encoder and decoder is delicately made. The key part of the architecture is self-attention. Transformer takes advantage of self-attention in two aspects: computational complexity and long-range dependencies. For the popular sentence representations used to achieve SOTA, word-pair and byte-pair etc., self-attention could be computed faster than reccurent models in terms of time complexity. Also the self-attention mechanism helps the model learn long-range(long-sentence) dependencies which sequence transduction tasks have been struggling with.
How Transformer Affected NLP
The paper shows the performance of Transformer for machine translation task, model vatiations and english constituency parsing. Even though Transformer itself could achieve SOTA in numerous tasks in NLP, by substituting small components of the model or applying task-aware representations could enhance performance even more.
NLP is one of the most hottest theme for machine learning tasks these days. The bright side of NLP is that it will never deappear. In other word, there will always exist a huge demand for natural language processing ability since all kinds of language data will be stacking endlessly. However, the dark side of NLP is that it has achieved more than sufficient SOTA so early, mainly because of Transformer. And it have made new NLP machine learning architectures somewhat meaningless.
Despite of this critical flaw, the reason I want to be engaged in NLP is because solving tasks by applying suitable machine learning model and data structure is fun. Also the progress involved linguistic comprehension such as sematics, syntax and language analysis. Understanding and constructing this whole complex structure is one of the greatest charm in programming.
<Image reference; https://www.ulatus.com/translation-blog/the-dangers-of-machine-translation/>
'Else' 카테고리의 다른 글
How Transformers Learn Long Sequences (0) | 2023.04.05 |
---|---|
Recommendation System for Las Vegas Citizens : YELP (0) | 2022.08.17 |
Denying the Legacy System of Reviewing Scientific Papers (0) | 2022.08.07 |