본문 바로가기
  • Machine Learning Paper Reviews (Mostly NLP)
BERT

Robustly Optimized BERT Approach

by wlqmfl 2023. 1. 26.
RoBERTa: A Robustly Optimized BERT Pretraining Approach

Paul G. Allen School of Computing Science & Engineering, University of Washington, Seattle, WA

26 Jul 2019


Abstract
RoBERTa (A Robustly Optimized BERT Approach) is a replication study of BERT which enhances the performance of BERT by modifying hyperparameters and training data size. The paper points out training objectives (MLM, NSP) which makes performance even worse, or could be improved. Also, RoBERTa learns more data than BERT does. Lastly, it takes hyperparameters, such as learning rate, optimized with Adam (Kingma and Ba, 2015). By aggregating the improvements above, RoBERTa has achieved higher performance on three different benchmarks: GLUE, SQuAD, and RACE. Additionally, the paper went through the investigation of how data used for pre-training and the number of training which passes through the data affects the performance. The result claims that the pre-training objectives, hyperparameter decisions, and data choosing remains competitive while the result of RoBERTa on GLUE (which RoBERTa outperforms both BERT_Large and XLNet_Large) maintains that dataset size and the time of training might have higher importance than model architecture and pre-training objectives.

Robust Optimization
Analyzing the training procedure and reimplementing the architecture is the key process of robustly optimizing BERT. First of all, while BERT rely on ramdomly masking and predicting tokens, RoBERTa uses dynamic masking which generates masking pattern everytime when feeding a sequence to the model. This is based on the performance on tasks compared to the original BERT, and the fact that dynamic masking has advantage on pre-training more steps with larger datasets. Secondly, the paper found out that removing NSP (Next Sentence Prediction) loss and adapting Full-Sentence format, which is kind of complicated to explain, has better performance with high utility on comparison. Moreover, the paper prefers training with large batches because of the improved perplexity for both the masked language modeling objective and end-task accuracy. Lastly, RoBERTa adapts byte-level BPE vocabulary instead of character-level BPE vocabulary.

Optimizing the Machine Learning Architecture
It sounds like a contradiction if someone is literally optimizing a machine learning model. A machine learning architecture is perfectly optimized when it has all the data, or all the answers of a task, which we call it an over-fitting. This way of optimizing is meaningless, however, RoBERTa is an optimization of BERT which is a knowledge of natural language. In other words, RoBERTa is an optimized approach of involving the natural language network in general, not the answers of specific tasks.
Then, can we hit the limit of optimizing the knowledge of natural language with some mathematical methods without over-fitting on a single task? To be specific, someday, optimizing all the value of hyperparameter would be possible. Which means the resulting model can achieve the highest scores on main NLP tasks. Can we surely say that this model is not over-fitted? Can we say that this model does not hurt the basis of Artificial Intelligence? Well… yet I can’t tell, but in the future, a whole new sensational architecture will appear and give a clear answer for the problem.

'BERT' 카테고리의 다른 글

XLNet = BERT + AR, in Permutation Setting  (0) 2024.05.05
Predicting Spans Rather Than Tokens On BERT  (0) 2023.04.29
Representation Learning Basic (BERT)  (0) 2023.04.25
Morphological Capability of BERT  (0) 2023.04.16