Albert

๊ฐœ์ธ study๋ฅผ ์œ„ํ•œ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‹ค๋ณด๋‹ˆ ๋‚ด์šฉ์— ์ž˜๋ชป๋œ ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

1.INTRODUCTION

๋…ผ๋ฌธ์˜ ์‹œ์ž‘์€ 2018๋…„ Bert(Bidirectional Encoder Representations from Transformers)์˜ ๋“ฑ์žฅ์œผ๋กœ ์‹œ์ž‘๋œ, Larger Dataset, Lager Model, Pre-trained, Fine-tuning์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance (Devlin et al., 2019; Radford et al., 2019). It has become common practice to pre-train large models and distill them down to smaller ones (Sun et al., 2019; Turc et al.,2019) for real applications. Given the importance of model size, we ask: Is having better NLP models as easy as having larger models?

Large Model์€ 1). Memory limitations, 2). Training speed ๊ฐ€ ๋‹จ์ ์ด ๋ฉ๋‹ˆ๋‹ค. ์ด 2๊ฐ€์ง€ ์ ์€ NLP๋ฅผ ๊ฐœ์ธ PC๋‹จ์œ„์—์„œ ํ•™์Šตํ•  ๊ฒฝ์šฐ, ์ง๋ฉดํ•˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ์œ„ํ•ด์„œ๋Š” 1) model parallelization, 2) clever memory management ๊ณผ ๊ฐ™์€ ํ•œ์ •๋œ ์ž์›์„ ์ž˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ ์ฒ˜๋Ÿผ, model architecture๋ฅผ ์ƒˆ๋กœ ๋””์ž์ธํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

A Lite BERT (ALBERT) ์—์„œ ์‚ฌ์šฉํ•œ main ๋ฐฉ๋ฒ•์€ ์•„๋ž˜ 2๊ฐ€์ง€ ์ž…๋‹ˆ๋‹ค.

The first one is a factorized embedding parameterization.

The second technique is cross-layer parameter sharing

์œ„ ๋‘ ๋ฌธ์žฅ์„ ์„ค๋ช… ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ„๋‹จํžˆ BERT์˜ paramter์ˆ˜๋ฅผ ์‚ดํŽด ๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.[1]

  • Embedding layer = 23M

  • Transformer layer = 7M * 12 = 85M

  • Pooler layer = 0.6M

  • Total = 110M

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์•„๋ž˜์™€ ๊ฐ™์ด parameter์˜ ์ˆ˜๋ฅผ ์•ฝ 90% ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.~!

  • Embedding layer = 4M (factorized embedding)

  • Transformer layer = 7M (cross-layer parameter sharing)

  • Pooler layer = 0.6M

  • Total = 12M

์ด๋Ÿฌํ•œ model ์••์ถ•์— ๋Œ€ํ•ด์„œ๋Š” [3]์— ์ž˜ ์ •๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

2.1 SCALING UP REPRESENTATION LEARNING FOR NATURAL LANGUAGE

Larger Dataset, Lager Model, Pre-trained, Fine-tuning ์—์„œ large model์ด performance์— ์ค‘์š”ํ•˜์ง€๋งŒ, memory limitation๊ณผ training speed์—์„œ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‹ค์‹œ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

2.2 CROSS-LAYER PARAMETER SHARING

paramter sharing์„ ์ ์šฉํ•œ ์ด์ „ ๋…ผ๋ฌธ๋“ค์˜ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

2.3 SENTENCE ORDERING OBJECTIVES

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” 3๋ฒˆ์งธ element ์ค‘ ํ•˜๋‚˜์ธ SENTENCE ORDER PREDICTION (SOP) ์— ๋Œ€ํ•ด ์„ค๋ช… ํ•ฉ๋‹ˆ๋‹ค. 2 sentence pair ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์—์„œ, false ์˜ˆ์ œ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ฐฉ์‹์— ์ฐจ์ด๋ฅผ ๋‘ก๋‹ˆ๋‹ค. BERT ๋Š” ๋‹ค๋ฅธ ๋ฌธ์„œ์˜ ๋ฌธ์žฅ์„ 2nd sentence๋กœ ์‚ฌ์šฉ(NSP)ํ–ˆ์ง€๋งŒ, ALBERT์—์„œ๋Š” ์ธ์ ‘ํ•œ ๋‘ ๋ฌธ์žฅ์˜ order๋ฅผ swap(SOP)ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ 3์žฅ์— ๋‹ค์‹œ ์–ธ๊ธ‰๋ฉ๋‹ˆ๋‹ค.

ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.

BERT (Devlin et al.,2019) uses a loss based on predicting whether the second segment in a pair has been swapped with a segment from another document.

3.THE ELEMENTS OF ALBERT

ALBERT์˜ main idea๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

3.1 MODEL ARCHITECTURE CHOICES

Factorized embedding parameterization

BERT์™€ ๊ทธ ํ›„์† model๋“ค์—์„œ, WordPices embedding size $E$์™€ hidden layer size $H$๋Š” ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋Š” modeling์ด๋‚˜ ์‹ค์ œ ๊ตฌํ˜„ ์ธก๋ฉด์—์„œ ์ข‹์€ ์„ ํƒ์ด ์•„๋‹™๋‹ˆ๋‹ค.

WordPiece embedding layer์˜ ๊ฒฝ์šฐ, context-independent ํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๊ทธ ์ดํ›„์— ์Œ“์ด๋Š” Transformer layer๋Š” context-dependent ํ•œ layer์ด๋ฉฐ, BERT์˜ ๋›ฐ์–ด๋‚œ performance๋Š” ์ด ๋ถ€๋ถ„์—์„œ ๋‚˜์˜ต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, E์™€ H๋ฅผ ๋ถ„๋ฆฌํ•˜๊ณ  $H >> E$๊ฐ€ ๋˜๋„๋ก settingํ•˜๋Š” ๊ฒƒ์ด ๋” ํ•ฉ๋ฆฌ์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ vocabulary size $V$๊ฐ€ ๋งค์šฐ ํฌ๊ธฐ ๋•Œ๋ฌธ์—, $E=H$์˜ ์ œํ•œ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด, H๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ๊ฒฝ์šฐ, embedding matrix์˜ size๋ฅผ ๋งค์šฐ ํฌ๊ฒŒ ๋งŒ๋“ค์–ด ๊ตฌํ˜„์„ ์–ด๋ ต๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ALBERT์—์„œ๋Š” ์ด๋ฅผ ์œ„ํ•ด, one hot vector์ธ vocabulary $V$๋ฅผ embedding size $E$๋กœ mapping ์‹œํ‚จํ›„, ์ด๋ฅผ ๋‹ค์‹œ hidden size $H$๋กœ mappingํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, embedding layer parameter๋Š” 83% ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.

  • BERT: $V\times H\ =\ 23M$

  • ALBERT: $V\times E(128)\ +\ E\times H\ =\ 4M$

Cross-layer parameter sharing

Cross-layer parameter sharing์„ Transformer layer์—์„œ parameter๋Š” 92%๊ฐ€๋Ÿ‰ ์ค„์˜€์Šต๋‹ˆ๋‹ค.

  • BERT: $7M\ \times 12\ layers\ =\ 85M$

  • ALBERT: $7M = 7M$

Inter-sentence coherence loss

Masked Language Modeling(MLM) loss ์— ์ถ”๊ฐ€ํ•˜์—ฌ, BERT์—์„œ๋Š” Next-Sentence Prediction(NSP) loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ NSP๋Š” NLI์™€ ๊ฐ™์€ downstream task์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๊ณ ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, ์ดํ›„ ์—ฐ๊ตฌ์—์„œ, NSP์˜ ํšจ๊ณผ๊ฐ€ ๋ณ„๋กœ ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋‚˜์˜ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ NSP ๋น„ ํšจ์œจ์— ๋Œ€ํ•œ ์›์ธ์œผ๋กœ ๋ณธ๋…ผ๋ฌธ์—์„œ๋Š” lack of difficulty ๋ฅผ ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ document์˜ sentence๋ฅผ ์‚ฌ์šฉํ•œ NSP๋Š”, โ€œtopic prediction"๊ณผ โ€œcoherence predictionโ€ 2 task์˜ ํ•ฉ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. โ€œtopic prediction"์€ โ€œcoherence prediction"์— ๋น„ํ•ด ์‰ฌ์šฐ๋ฉฐ, ์ด ๋‘˜๋‹ค ๋ชจ๋‘ MLM loss์— ํฌํ•จ๋œ๋‹ค๊ณ  ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” inter-sentence modeling์„ ์œ„ํ•ด์„œ, Sentence-Order prediction(SOP)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. SOP๋ฅผ ํ†ตํ•ด, topic prediction ๋ณด๋‹ค๋Š” inter-sentence coherence๋ฅผ ํฌ์ปค์Šค๋ฅผ ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‹คํ—˜์ ์œผ๋กœ, NSP๋กœ SOP ๊ด€๋ จ task๋ฅผ ์ž˜ ํ’€์ˆ˜ ์—†์ง€๋งŒ, SOP๋กœ๋Š” NSP ๊ด€๋ จ task๋ฅผ ์ž˜ ํ’€์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

3.2 MODEL SETUP

4 EXPERIMENTAL RESULT

4.1 EXPERIMENTAL SETUP

BERT์—์„œ ์‚ฌ์šฉํ•œ ํ™˜๊ฒฝ์„ ์ตœ๋Œ€ํ•œ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ tokenizer๋กœ SentencePiece๋ฅผ ์‚ฌ์šฉํ•œ ์ ์ž…๋‹ˆ๋‹ค. ๊ฐ„๋‹จํžˆ tokenizer์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.[2]

  • BPE(Byte-Pair Encoding): characters ๋‹จ์œ„์—์„œ ์‹œ์ž‘ํ•ด์„œ, desired size ๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ merge ํ•ฉ๋‹ˆ๋‹ค. merge์˜ ๊ธฐ์ค€์€ ๋‚˜์˜ค๋Š” ๋นˆ๋„์ˆ˜ ์ž…๋‹ˆ๋‹ค.

  • WordPiece: BPE์™€ ๋น„์Šทํ•˜์ง€๋งŒ, merge์˜ ๊ธฐ์ค€์ด likelihood์˜ ์ตœ๋Œ€ํ™” ์ž…๋‹ˆ๋‹ค. BERT, DistilBERT์—์„œ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Unigram: ์‹œ์ž‘์ด BPE, WordPiece์™€ ๋‹ค๋ฅด๊ฒŒ, corpus์—์„œ ์‹œ์ž‘ํ•˜์—ฌ word๋ฅผ ์ชผ๊ฐœ์„œ๋ฉด์„œ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ชผ๊ฐœ๋Š” ๊ธฐ์ค€์€ loss๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • sentencePiece: ์œ„์— ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘ pretokenization์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ผ๋ถ€ ์–ธ์–ด์—์„œ๋Š” space๊ฐ€ ์—†์–ด pretokenization์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. sentencePiece์—์„œ๋Š” space๋„ ํ•˜๋‚˜์˜ character๋กœ ์ธ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ de-coding์ด ์‰ฌ์›Œ ์žฅ์ ์ด ๋˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ALBERT, XLNet์—์„œ setencePiece + unigram์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

4.2 EVALUATION BENCHMARKS

4.3 OVERALL COMPARISON BETWEEN BERT AND ALBERT

4.4 FACTORIZED EMBEDDING PARAMETERIZATION

4.5 CROSS-LAYER PARAMETER SHARING

4.6 SENTENCE ORDER PREDICTION (SOP)

5 DISCUSSION

ALBERT-xxlarge๊ฐ€ less parameter๋กœ BERT-large ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, structure์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— computationally ๋” ๋น„์‹ผ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ, training๊ณผ imference speed up์€ ๋‹ค์Œ ๊ฐœ์„ ์˜ ์ค‘์š” ํฌ์ธํŠธ ์ง€์ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ SOP๊ฐ€ language representation์— ์žˆ์–ด์„œ ์กฐ๊ธˆ ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ–ˆ์ง€๋งŒ, ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์žˆ์„ ๊ฑฐ๋ผ๊ณ  ์ด์•ผ๊ธฐํ•˜๋ฉด์„œ ๋…ผ๋ฌธ์„ ๋๋ƒˆ์Šต๋‹ˆ๋‹ค

Reference

Last updated