본문 바로가기
  • Machine Learning Paper Reviews (Mostly NLP)
BERT

Representation Learning Basic (BERT)

by wlqmfl 2023. 4. 25.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Google AI Language

24 May 2019

 

A New Language Representation Model

 

"A good representation is one that makes a subsequent learning task easier."

 

  The paper presents BERT(Bidirectional Encoder Representations from Transformer) which is designed to deeply learn the representations from unlabeled text on both left and right context. This conceptually simple but empirically powerful model was devised in order to overcome the unidirectionality constraints of previous studies, especially for fine-tuning approaches. BERT is composed of two major stages: (1) pre-training, where the deep representation learning happens, and (2) fine-tuning, where the pre-trained model is virtually fitted into several down-stream tasks.

 BERT advance the state of the art for eleven NLP tasks, exceeding the performance of each heavily-engineered task-specific architectures. It's model architecture is based on multi-layer bidirectional Transformer encoder with two model sizes BERT\base and BERT\large. BERT uses WordPiece embeddings with a 30,000 token vocabulary, which can handle diverse down-stream tasks by allowing both sentence and a pair of sentence(e.g. <Question, Answer>) as a input sequence.

 

Pre-training BERT

 "A good representation" is learnt through a well-made pre-training process. BERT is pre-trained using two unsupervised tasks. First task is called Masked LM, which is focused on training deep bidirectional representation. It is referred to as Cloze task, that masks some percentage(in this case, 15%) of the tokens in the input sequence at random, and predicts those masked tokens. Unlike the prior works, Masked LM allows BERT to learn both the left and the right context.

 Second task is called Next Sentence Prediction(NSP), which feeds the understanding of relationship between two sentences(i.e. text-pair representations) to BERT. These information is absolutely necessary for some down-stream tasks such as Question Answering(QA) and Natural Language Inference(NLI). This unsupervised task is a simple binary classification whether sentence A is followed by sentence B or not. Despite its simplicity, the paper demonstrates that Next Sentence Prediction is beneficial to both QA and NLI.

 

Fine-tuning BERT

 Compared to pre-training, which takes four days on 4 to 16 Cloud TPUs, fine-tuning only takes atmost 1 hour on a single cloud TPU. Besides some weird tasks, Google maintains that NLP researchers would never pre-train BERT from the scratch, but rather fine-tune it task specifically. For each task, the paper simply plugged in the task-specific  inputs and outputs into BERT and fine-tuned all the parameters end-to-end.

 BERT based on Transformer encoder is a fine-tuning approach on a representation learnt model. However, some of the tasks require a task-specific model architecture or show computational benefits by pre-computing expensive representation of the training data, and then adapt cheaper models on top of it. Therefore those tasks cannot be implemented by fine-tuning approach, but by feature-based approach. By an ablation study, the paper maintains that BERT is effective for both fine-tuning and feature-based approach.

 

Influence on a Broad Set of NLP Tasks

 Throughout 6 years since this paper has been published, so many BERT-wise models have achieved state of the art in various area. I will further introduce sentational papers which uses BERT to pull off its performance, such as BEiT and SpanBERT. Also, it has opened up so many possibilities of fine-tuning and feature-based approach on top of pre-trained model with representation learning.

'BERT' 카테고리의 다른 글

XLNet = BERT + AR, in Permutation Setting  (0) 2024.05.05
Predicting Spans Rather Than Tokens On BERT  (0) 2023.04.29
Morphological Capability of BERT  (0) 2023.04.16
Robustly Optimized BERT Approach  (0) 2023.01.26