Natural Language Generation
NLG
- subcomponent of
- Machine Translation
- summarization
- dialogue
- Freeform question answering(not only from the context)
- Image Captioning
Recap
Language modeling? the task of predicting the next word: \[P(y_t|y_1, ..., y_(t-1))\]
Language model
RNN-LM
Conditional Language Modeling \[P(y_t|y_1, ..., y_(t-1), x)\]
- what is x? condition.
- Examples:
- Machine Translation (x = source sentence, y = target sentence)
- Summarization (context and summarized)
- Dialogue (dialogue history and next utterance)
training a RNN-LM? \[J = \dfrac{1}{T}\sum\limits_{t=1}^T J_t\]
- "Teacher Forcing": always use the gold to feed into the decoder
decoding algorithms
- Greedy decoding: argmax each step
- Beam search: aims to find a high prob seq.
- k most probable partial seqs (hypotheses)
- k is the beam size (e.g. 2)
- when reaching some stopping criterion, output
- what's the effect of changing k?
- k=1: greedy decoding
- larger k: more hypotheses, computationaly expensive
- for NMT, increasing k too much decreases BLEU, reason: producing shorter translations
- for chit-chat dialogue, producing too generic responses
- sampling-based decoding
- pure sampling: randomly sample, instead of argmax in greedy
- top-n sampling, randomly sample from top-n. truncate. n is another hyperparameter
- increasing n, diverse and risky
- decreasing n, generic and safe
- Softmax teperature -- not actually a decoding algorithm, but a technique applied at test time in conjunction with decoding algorithm
- temperature hypparam \(\tau\) to the softmax:
- larger \(\tau\): \(P_t\) becomes more uniform, more diverse output(probability is spread around vocab)
Decoding algorithms: summary
- Greedy
- Beam search
- Sampling methods
- Softmax temperature
Section 2: NLG tasks and neural approaches
Summarization
- definition: x -> y, y is shorter and contains main info of x
- examples:
- Gigaword: headline -> headline. sentence compression
- LCSTS (Chinese microblogging), paragraph -> sentence summary
- ...
- Sentence simplification:
- different but related
- rewrite, simpler & shorter
- examples:
- simple wikipedia
- Newsela: news rewriting for children
- summarization: 2 mains strategies
- extractive: highlighter
- abastractive: writing
- summarization evaluation: ROUGE
like BLEU, based on n-gram overlap
but no brevity penalty
ROUGE based on recall while BLEU based on precision
BLEU is a single number combining the precisions for n=1,2,3,4 n-grams
ROUGE: ROUGE-1/ROUGE-2/ROUGE-L(Largest common subseq overlap)
- Neural summarization:
- seq2seq + attention NMT
- Reinforcement learning
- neural: copy mechanism
- probability of generation and probability of copying
- Pgen: hard(0/1) or soft?
- Problem:
- copy too much: extractive to abstractive
- bad at overall content selection, if input is long
- no overall strategy for selecting content
- better content selection
- 2 stages: content selection & surface realization
- seq2seq+att, mixed, word-level content selection(attention)
- but no global content selection strategy
- One solution: bottom-up summarization
- Bottom-up summarization
- content selection stage: neural sequence tagging
- masked, attention
- Neural summarization via RL
- main idea: directly optimize Rouge-L
- Better practice(both ROUGE & human judgement): ML&RL