Decoding

The task of choosing a word to generate on the basis of the probabilities that the model assigns to the possible words is called decoding. Decoding from a language model by repeatedly choosing the next word conditioned on the previous choices is called autoregressive generation or causal LM generation. Some of the common methods for decoding are described below:

  1. Greedy Decoding
  2. Beam Search
  3. Random Sampling
  4. Top-k Sampling
  5. Nucleus or Top-p Sampling
  6. Temperature Sampling

Greedy Decoding

In greedy decoding, the word among the possible words for which the model assigns the largest probability is chosen as the next word. It is a greedy algorithm because it makes the choice which is locally optimal, and doesn’t take into concern whether the choice will turn out to have been the best choice.

wt^=argmaxwVP(ww<t)\begin{align} \hat{w_t} = \text{argmax}_{w \in V} P(w | \textbf{w}_{<t}) \end{align}

In practice, it is not used with large language models because the word it chooses are extremely predictable, which makes the resulting text generic and often quite repetitive.

Random Sampling

In random sampling, the next word ww is chosen by sampling from the distribution p(w)p(w). It means that for generating a sequence of words W=w1,w2,,wNW = w_1, w_2, \ldots, w_N until the end-of-sequence token is reached, wp(w)w \sim p(w) if performed. Random sampling also doesn’t work well because even though it mostly generates sensible, high-probable words, there are many odd, low probability words in the tail of the distribution, and even though each one is low probability, if all those rare words are added up, they constitute a large enough portion of the distribution that they get chosen often enough to result in generating weird sentences.

Top-k Sampling

Top-kk sampling is the generalizatoin of greedy decoding. It first selects the words with top kk probabilities, renormalizes their distribution and performs random sampling on them. More formally:

  1. Choose a number of words kk.
  2. For each word in the vocabulary VV, use the language model to compute the likelihood of this word given the context p(wtw<t)p(w_t | \textbf{w}_{< t})
  3. Sort the words by their likelihood, and select the top-kk most probable words.
  4. Renormalize the scores of the selected kk words to be a legitimate probability distribution.
  5. Randomly sample a word from the renormalized distribution.

Nucleus or Top-p Sampling

A problem with top-kk sampling is that kk is fixed, but the shape of the probability distribution over words differ in different contexts. An alternative that solves this problem is top-pp sampling or nucleus sampling which selects the top pp percent of the probability mass instead of the top kk words. The hope of top-pp sampling is that the measure will be more robust in very different cotnexts, dynamically increasing and decreasing the pool of word candidates.

Given a distribution P(wtw<t)P(w_t | \textbf{w}_{<t}), the distribution is sorted by their likelihoods, and then the top-pp vocabulary V(p)V^{(p)} is the smallest set of words such that

wV(p)P(ww<t)p\begin{align} \sum_{w \in V^{(p)}} P(w | \textbf{w}_{<t}) \ge p \end{align}

Temperature Sampling

In temperature sampling, we reshape the distribution by simply dividing the logit by a temperature parameter τ\tau before we normalize it by passing it through a softmax. Thus, the probability vector y\textbf{y} is calculated as

y=softmax(u/τ)\begin{align} \textbf{y} = \text{softmax}(\textbf{u} / \tau) \end{align}