Isaac's Beam Search learning notes
Contents
LinksEdit
MotivationEdit
Beam Search is used in the context of AI as a final decision making layer for many NLP and speech recognition models.
Example: Sequence to Sequence (Seq2Seq) Modeling
Task: Translate a sentence from Arabic to English.
Overview:
- 1. Tokenization/Normalization: Arabic sentence is split into tokens designed for the LLM and normalized
- 2. Encoding: LLM encodes the tokens into numerical tokens and generates a sequence of hidden states to represent the input sentence
- 3. Initialize Beam Search: Determine the parameters of Beam Search and the decoder's initial states
- 4. Decoding: Begin with start-of-sequence token. Model generates probabilities for the next token in each sequence and passes them through an output layer, including a softmax function that normalizes probabilities. Beam Search chooses which paths to continue following and prunes the rest.
- 5. Finalize Output: Select Beam with highest probability as final translation, convert tokenized output back into a readable sentence with punctuation, capitalization, etc.
Conditional Probability Notes:
- Sequence to Sequence search algorithms are based on Conditional Probability -> describe the likelihood of an event happening given that another event or sequence of events has already occurred
- i.e the probability of a new token given the existing sequence is NOT independent and can be calculated where is the probability of occurring given that have already occurred
- As a result, the graph we run the algorithm on is NOT a Markov Chain Graph, so states are dependent.
Naive Approach #1: GreedyEdit
- Greedy Search takes the best solution at each state in the graph, regardless of previous leaves or future leaves in the sequence. In the context of sequence to sequence, greedy search takes the highest probability word at each position in the sequence and takes it as part of the output.
Let's take as possible words for our sequence with initial probabilities to be
The greedy search will take B, the word with highest probability, to be the first word in our sequence. It will continue as such until the end of the sentence is reached.
This greedy strategy may be optimal at the current spot of the sequence, but as one might predict, the algorithm struggles with larger outputs where such a path is not the optimal.
If we have at most words in our final sentence and each state has word options, this algorithm runs with a time complexity of
Naive Approach #2: Breadth First SearchEdit
The BFS approach considers every possible sequence of words and outputs the highest probability sequence among all.
For example, let's take as possible words for our sequence. (Assuming the length of the sequence is fixed at 3) BFS will find the maximum of and output that respective sequence
BFS is guaranteed to find the optimal sequence, but its runtime is far too large to be feasible for use.
If we have at most words in our final sentence and each state has word options, this algorithm runs with a time complexity of
Beam SearchEdit
Beam Search is a heuristic algorithm which combines the ideas of the Greedy and BFS algorithms.
- Maintains a 'beam width' amount of the most promising sequences
- At each iteration add all possible continuations of candidate sequences, and takes the 'beam width' best from those to repeat the algorithm
- After the algorithm is finished we take the best option from the candidate sequences as our output
If we have at most words in our final sentence, each state has word options, and we maintain a maximum bin width size of , this algorithm runs with a time complexity of
Top-K Sampling for Beam Search:
- -> maximum width of beams during search
- all other probabilities are set to 0
- allows for a more consistent runtime
- lower value => faster runtime, less diversity, less optimal results
Top-P Sampling for Beam Search (Nucleus Sampling):
- -> maximum sum of probabilities among all sequences
- usually dynamically determined based on the cumulative probabilities of the tokens in the probability distribution -> ensures a certain proportion of the probability mass is considered
- top-p sampling allows for more diversity in generated sequences
- lower value => faster runtime, less diversity, less optimal results
Other NotesEdit
Softmax Function:
- ensures that all probabilities at each state are in the range (0,1) and sum to 1
Log semiring trick
- Using log semiring to deal with very small probability values trick: Given as the cumulative probability score of all words in the sequence so far,
- Since the value of can become very small, we can run into computational rounding errors. One strategy around this is to take the natural log of the summation of all p values and use this to compare values of . This works because
- if , so values can still be compared in this form