A Paradigm Shift for Non-english(Korean) Language Processing

KR-BERT: A Small-Scale Korean-Specific Language Model

Sangah Lee, Hansol Jang, etc.

11 Aug 2020

Abstract
The world’s dominant machine learning model for language processing called Bidirectional Encoder Representation from Transformer (BERT) have also made various task-specific models throughout the history. This paper suggests one of the language model which is also derived from BERT, called KR-BERT. KR-BERT is a Korean-specific language model which utilizes smaller vocabulary and dataset. Several independent variables such as tokenizers and the minimal span of tokens has been tested in order to raise performance.
The paper introduces three fundamental reasons why they scaled the model down for Korean-specific *downstream tasks. First of all, unlike English, Korean is an *agglutinative language, which is morphologically richer than Engish. Second, its writing system is composed of more than 10,000 syllable charactes. Lastly, BERT is too large to be applied to a single-language task.
There are already some Korean-specific BERT models which returns prominent results such as KorBERT and KoBert. However, by utilizing the sub-character based Korean *BPE vocabulary, BidirectionalWordPiece tokenizer and high quality pre-training had made KR-BERT achieve better performance on several downstream tasks.

*Downstream task: Natural language processing tasks which requires pre-trained model and fine-tuning in order to be solved.
*Some features of agglutinative language:
1) All morphemes tend to remain unchanged after their unions.
2) Tend to have more deducible word meanings.
3) Have one grammatical category per affix.
*BPE(Byte Pair Encoding): Tokenization algorithm used in BERT, GPT, etc.

Small-scale Language Model
We can’t rely on multilingual BERT for every single NLP tasks. Unsatisfying performance it showed on language-specific tasks proves that BERT itself isn’t omnipotent. Also, language-specific BERT models have turned out to have better performance. The paper lists some disadvantages of multilingual BERT model compared to language-specific ones.
1) Multilingual BERT model is pre-trained by limited corpus domain, while language-specific models such as french and german gains its domain including legal data, news articles, and so forth.
2) It lacks language-specific properties (Rare characters, morphologically rich properties, meaningful tokens)

As written above, this paper suggests a two main method that deals with Korean texts and the morphological properties of Korean: Language-specific vocabulary and corresponding tokenizers. By utilizing these methods, KR-BERT outputs comparable performance using smaller vocabulary, fewer parameters, and less training data.

Sub-characters and Tokenizers
Hangul(Korean writing system) can be decomposed into not only characters, but also sub-characters. Thus unlike the previous Korean-specific models, KR-BERT utilizes sub-character representation in addition to BPE algorithm to obtain Korean-specific vocabulary set. This way, the vocabulary was able to capture Korean verb/adjective conjugation forms without a morphological analyzer.
Secondly, KR-BERT utilizes two different tokenizers: the baseline (the original WordPiece tokenizer used in the multilingual BERT) and the BidirectionalWordPiece tokenizer. One of the latter one’s strength is when both forward match and backward match is viable. For example, it is advantageous to match the longer subword unit from the left in case of noun phrase, on the other hand, backward match is better in case of verb phrase.
To shorlty analyze the performance of each model (BERT, KorBERT, KoBERT, KR-BERT character WordPiece, KR-BERT character BidirectionalWordPiece, KR-BERT sub-character WordPiece, KR-BERT character BidirectionalWordPiece), BidirectionalWordPiece seemed to perform better for noisy dat such as NSMC (the data for sentiment classification) while WordPiece showed better performance on relatively formal data such as NER (Named Entity Recognition), KORQUAD, paraphrase detection. Also, sub-character representation were efficiently tackling the OOV (out-of-vocabulary) problems.

The Limits Are Still Clear, But..
Although there have been lots of creative papers to figure out how to improve non-english language processing, the ultimate solution is still ‘the amount of data’. Especially these days, numerous software scientists have enhanced the performance of NLP models to ‘fully sufficient’ level. Hence the key is the quality and quantity of data, and this is why non-english language processing is still much weaker than English processing skills. The quantity of data is what Korean can never catch up with English. English has historically spreaded out and settled itself in billions of cultures. However, in my opinion, since Korean is morphologically more richer, it is much more beautiful than English or any other language. That means it is able to capture the higher quality features of Korean in either syntactic or semantic way, and put Korean processing skill onto the next level. KR-BERT is a step towards achieving the goal.

'Task Specific Research' 카테고리의 다른 글

KMMLU: A Korean Benchmark for LLMs (0)	2024.03.29
Detecting Whether A Text is Written in GPT (0)	2023.02.27
LUKE: Language Understanding with Knowledge-Based Embeddings (0)	2023.02.04

Break-Into-Pieces

A Paradigm Shift for Non-english(Korean) Language Processing

'Task Specific Research' 카테고리의 다른 글

티스토리툴바

A Paradigm Shift for Non-english(Korean) Language Processing

'Task Specific Research' 카테고리의 다른 글

관련글

티스토리툴바