본문 바로가기
  • Machine Learning Paper Reviews (Mostly NLP)
Methodologies

STILTs: Supplementary Training on Pretrained Sentence Encoders

by wlqmfl 2023. 8. 26.
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Jason Phang, Thibault F'evry, Samuel R. Bowman

27 Feb 2019

 

A Second Stage of Pretraining

 In order to overcome the flaws of existing encoders, transfer learning has allowed us to enhance the performance of various tasks by starting with a pretrained model that has already learned relevant features or representations from a different dataset. However, particularly when confronted with small training data, this method is potentially brittle because it must both learn enough during fine-tuning and avoid catastrophic forgetting or overfitting. This paper suggests an approach called STILTs, which investigates whether incorporating an additional phase of pretraining using data-rich supervised tasks could address this fragility issue.


 This approach inserts an extra stage between (i) the pretraining stage and (iii) the fine-tuning stage:

(i) A model is first trained on an unlabeled-data task like language modeling that can teach it to reason about the target language.
(ii) The model is then further trained on an intermediate, labeled-data task for which ample data is available.
(iii) The model is finally fine-tuned further on the target task and evaluated.

 

Methods

(i) Pretrained Sentence Encoders: The paper selected three sentence encoders: BERT, GPT, and ELMo. The study followed an inductive transfer learning approach where the pretrained model parameters were made learnable during fine-tuning for all three models. This approach applies well to BERT and GPT, while ELMo's setup freezes weights and introduces an extra encoder during fine-tuning. The paper adapted ELMo's setup accordingly.

(ii) Intermediate Task Training: The paper used four intermediate tasks, which are data-rich sentence-level tasks similar to those in GLUE. These tasks encompassed textual entailment with various datasets and a tailored fake-sentence-detection task.

(iii) Target Tasks and Evaluation: The study evaluated each model on nine target tasks within the GLUE benchmark.

 

Limited Target-Task Data & Fine-Tuning Stability

 The paper also investigated the impact of the approach in scenarios with limited data. By reducing the fine-tuning training examples to 5k and 1k within the same models, they surprisingly observed that the advantages of supplementary training were notably more pronounced in this context.

 Furthermore, the paper demonstrates that the application of STILTs brings about enhanced fine-tuning stability in the 24-layer version of BERT, known as BERT-LARGE. When BERT-LARGE encounters instability with small datasets, the study employs random restarts to select the best model. These restarts employ the same pre-trained checkpoint, shuffle the fine-tuning data, and initialize the classification layer. STILTs effectively reduces performance variance across random restarts, underscoring its role in improving fine-tuning stability.

 

#

Set the state-of-the-art on the GLUE benchmark with BERT on STILTs.

 

Reference

https://arxiv.org/pdf/1811.01088.pdf