Isaac's Algorithm Notes

Isaac's Beam Search learning notes

1. Tokenization/Normalization: Arabic sentence is split into tokens designed for the LLM and normalized
2. Encoding: LLM encodes the tokens into numerical tokens and generates a sequence of hidden states to represent the input sentence
3. Initialize Beam Search: Determine the parameters of Beam Search and the decoder's initial states
4. Decoding: Begin with start-of-sequence token. Model generates probabilities for the next token in each sequence and passes them through an output layer, including a softmax function that normalizes probabilities. Beam Search chooses which paths to continue following and prunes the rest.
5. Finalize Output: Select Beam with highest probability as final translation, convert tokenized output back into a readable sentence with punctuation, capitalization, etc.

Now, to break down the steps of Beam Search

Conditional Probability Notes:

Sequence to Sequence search algorithms are based on Conditional Probability -> describe the likelihood of an event happening given that another event or sequence of events has already occurred
i.e the probability of a new token given the existing sequence is NOT independent and can be calculated $Prob(ABC)=Prob(AB)*Prob(C|AB)$ where $Prob(C|AB)$ is the probability of $C$ occurring given that $AB$ have already occurred
As a result, the graph we run the algorithm on is NOT a Markov Chain Graph, so states are dependent.

Naive Approach #1: Greedy

Greedy Search takes the best solution at each state in the graph, regardless of previous leaves or future leaves in the sequence. In the context of sequence to sequence, greedy search takes the highest probability word at each position in the sequence and takes it as part of the output.

Let's take $A,B,C$ as possible words for our sequence with initial probabilities to be $A_{0}=0.2,B_{0}=0.5,C_{0}=0.3$

The greedy search will take B, the word with highest probability, to be the first word in our sequence. It will continue as such until the end of the sentence is reached.

This greedy strategy may be optimal at the current spot of the sequence, but as one might predict, the algorithm struggles with larger outputs where such a path is not the optimal.

If we have at most $m$ words in our final sentence and each state has $n$ word options, this algorithm runs with a time complexity of $O(nm)$

Naive Approach #2: Breadth First Search

The BFS approach considers every possible sequence of words and outputs the highest probability sequence among all.

For example, let's take $A,B,C$ as possible words for our sequence. (Assuming the length of the sequence is fixed at 3) BFS will find the maximum of $Prob(AAA),Prob(AAB),...Prob(CCC)$ and output that respective sequence

BFS is guaranteed to find the optimal sequence, but its runtime is far too large to be feasible for use.

If we have at most $m$ words in our final sentence and each state has $n$ word options, this algorithm runs with a time complexity of $O(n^{m})$

Beam Search

Beam Search is a heuristic algorithm which combines the ideas of the Greedy and BFS algorithms.

Maintains a 'beam width' amount of the most promising sequences
At each iteration add all possible continuations of candidate sequences, and takes the 'beam width' best from those to repeat the algorithm
After the algorithm is finished we take the best option from the candidate sequences as our output

If we have at most $m$ words in our final sentence, each state has $n$ word options, and we maintain a maximum bin width size of $k$ , this algorithm runs with a time complexity of $O(nmklog(k)$

Top-K Sampling for Beam Search:

$k$ -> maximum width of beams during search
all other probabilities are set to 0
allows for a more consistent runtime
lower value => faster runtime, less diversity, less optimal results

Top-P Sampling for Beam Search (Nucleus Sampling):

$p$ -> maximum sum of probabilities among all sequences
usually dynamically determined based on the cumulative probabilities of the tokens in the probability distribution -> ensures a certain proportion of the probability mass is considered
top-p sampling allows for more diversity in generated sequences
lower value => faster runtime, less diversity, less optimal results

Other Notes =

Softmax Function:

ensures that all probabilities at each state are in the range (0,1) and sum to 1

Using semi-log plot to deal with very small probability values trick: Given $P$ as the cumulative probability score of all words in the sequence so far, $P=p_{0}*p_{1}*p_{2}...p_{n}$
Since the value of $P$ can become very small, we can run into computational rounding errors. One strategy around this is to take the natural log of the summation of all p values and use this to compare values of $P$ . This works because
$P=p_{0}*p_{1}*p_{2}...p_{n}$
$log(P)=log(p_{0})+log(p_{1})+log(p_{2})...+log(p_{n})$
if $(P_{1}>P_{2})=>log(P_{1})>log(P_{2})$ , so values can still be compared in this form

Isaac's Algorithm Notes

Contents

Links

Motivation

Naive Approach #1: Greedy

Naive Approach #2: Breadth First Search

Beam Search

Other Notes =

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools