본문 바로가기
  • Machine Learning Paper Reviews (Mostly NLP)
Task Specific Research

LUKE: Language Understanding with Knowledge-Based Embeddings

by wlqmfl 2023. 2. 4.
LUKE: Deep Contextualized Entity Representation with Entity-aware Self-attention

Ikuya Yamade, Akari Asai, etc.

2 Oct 2020


Abstract
As the title exposes, this paper, which indicates LUKE, proposes a deeply contextualized entity representations based on bidirectional transformer. Luke has achieved state-of-the-art on five well-known entity related tasks, such as NER (Named Entity Recognition) and Entity Typing. While taking RoBERTa as a base model, LUKE was trained by a new pretrained task which involves BERT's MLM (Masked Language Model). Moreover, LUKE learns richer entity-centric information by making use of entity-aware self attention mechanism and entity-annotated corpus retrieved from Wikipedia.
In the past, entity related tasks were settled by KB (Knowledge Base) or CWR (Contextualized Word Representation). However, KB couldn't make it's way out because it wasn't able to link entities which do not exist in the KB. Also, although transformer based CWR has been dominating the general-purpose word representation tasks, it isn't suitable for entity representation tasks because CWR: (1) do not output span-level representations for entities, (2) is difficult to perform reasoning about relationships between entities, and (3) is a word based pretrained task which isn't suitable for entity based predictions. Unlike KB and CWR, LUKE has achieved the perfect reasoning for its entity selection.

LUKE
The architecture of LUKE adopts multi-layer bidirectional transformer. Main structure of the model composes (1) input representation (including token embedding, position embedding, and entity type embedding) (2) entity-aware self attention (which uses 4 different query matrices for each pair of token types) (3) pretraining task. The model configuration follows RoBERTa_Large with two special entities, [MASK] and [UNK], added to its vocabulary. The model goes through randomly ordered Wikipedia pages for 200K steps for training.

Performance
To estimate the performance of LUKE, the paper conducted extensive experiments using five entity related tasks: entity typing, relation classification, NER, cloze-style QA, and extractive QA. While the specific model architecture varies across each task, they share similar network based on a simple linear classifier on top ofwords, entities, or both. Input word sequence is created by inserting [CLS] and [SEP] tokens as the first and the last token respectively. Furthermorer, the paper uses [MASK] token for input entity sequence. As mentioned above, by applying the methods plus the entity-aware self-attention mechanism, LUKE has outperformed the baseline models that used to be state-of-the-art on each task.
Additionally, the paper fulfills an ablation experiment on three parts to provide a detailed analysis: Effects on (1) entity representation (w) entity-aware self-attention (3) extra pretraining. Based on the experiments, the paper demonstrates the effectiveness of the entity representation and entity-aware self-attention, on the other hand proving that the superior performance of LUKE was not owing to its longer pretraining.

Entity Representations in Expert Knowledge
By adapting self-attention mechanism and other methods, LUKE detects hidden features that entity representations in context have in common. That detection ability is what makes LUKE superior to knowledge base, CWR, or any other architecture. The question is, can LUKE prove its reasong for entity selection on expert knowledge? The answer is yes. Since the deeply contextualized entity representation that LUKE provides is well-structured, maybe most of the expert knowledge involved task, even if it is not in english, should be solved. Especially with the help of language-specific models and additional knowledge base, LUKE can be utilized on infinite types of tasks.