NLP - Chapter 11
Transformers decoder - answer takes its own previous output and the encoder's output
as input
BERT - answer Bidirectional Encoder Representations from Transformers
BERT Paper - answer Pré-training of deep bidirectional transformers for language
understanding
How are input embeddings calculated in BERT? - answer the sum of the token
embeddings, the segmentation embeddings, and the position embeddings
BERT_BASE and BERT_LARGE – answer BERT_BASE has 12 layers, hidden size of
768, and 110M total parameters. BERT_LARGE has 24 layers, hidden size of 1024,
and 340M total parameters
difference between Bert and Distilbert - answerDistilbert is a distilled version of Bert,
with a reduced number of layers, parameters, and weights. It aims to maintain 97% of
Bert's performance while reducing its size by 44%. Both use a vocabulary of 30K
subwords.
Bidirectional encoders - answerhave access to tokens both before and after the current
one, allowing for greater context understanding
masked language model (MLM) - answerInstead of next word prediction, we use a cloze
task where a word is masked and we ask what word fits best. BERT training
What is the purpose of the next sentence prediction (NSP) task in training BERT? -
answerThe NSP task determines if a given sentence naturally follows another sentence.
Explain how contextual embeddings are extracted from BERT. - answerwe feed input
tokens into a bidirectional self-attention model like BERT, and the output layer provides
the contextual vectors for each token.
pretraining - answerspend a long time training the LM on massive corpora; roughly 40
epochs (passes over full training set) minimum
domain-adaptive pretraining - answerafter pretraining, continue pretraining using a
domain-specific corpus
keep pretraining - answermany studies show that continuing pretraining beyond
standard limits still shows downstream benefits
Transformers decoder - answer takes its own previous output and the encoder's output
as input
BERT - answer Bidirectional Encoder Representations from Transformers
BERT Paper - answer Pré-training of deep bidirectional transformers for language
understanding
How are input embeddings calculated in BERT? - answer the sum of the token
embeddings, the segmentation embeddings, and the position embeddings
BERT_BASE and BERT_LARGE – answer BERT_BASE has 12 layers, hidden size of
768, and 110M total parameters. BERT_LARGE has 24 layers, hidden size of 1024,
and 340M total parameters
difference between Bert and Distilbert - answerDistilbert is a distilled version of Bert,
with a reduced number of layers, parameters, and weights. It aims to maintain 97% of
Bert's performance while reducing its size by 44%. Both use a vocabulary of 30K
subwords.
Bidirectional encoders - answerhave access to tokens both before and after the current
one, allowing for greater context understanding
masked language model (MLM) - answerInstead of next word prediction, we use a cloze
task where a word is masked and we ask what word fits best. BERT training
What is the purpose of the next sentence prediction (NSP) task in training BERT? -
answerThe NSP task determines if a given sentence naturally follows another sentence.
Explain how contextual embeddings are extracted from BERT. - answerwe feed input
tokens into a bidirectional self-attention model like BERT, and the output layer provides
the contextual vectors for each token.
pretraining - answerspend a long time training the LM on massive corpora; roughly 40
epochs (passes over full training set) minimum
domain-adaptive pretraining - answerafter pretraining, continue pretraining using a
domain-specific corpus
keep pretraining - answermany studies show that continuing pretraining beyond
standard limits still shows downstream benefits