Changes

Jump to: navigation, search

Isaac's Algorithm Notes

6,179 bytes added, 22 May
no edit summary
Isaac's algorithm Beam Search learning notes === Links === * [https://www.width.ai/post/what-is-beam-search#:~:text=Beam%20search%20is%20an%20algorithm,probability%20or%20next%20output%20character. Width.ai Beam Search]* [https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24 towarddatascience.com visuals]*[https://en.wikipedia.org/wiki/Log_probability Log probability wiki]*[https://en.wikipedia.org/wiki/Softmax_function Softmax_function wiki] [[Category:Beam Search]] === Motivation ===  Beam Search is used in the context of AI as a final decision making layer for many NLP and speech recognition models.  Example: Sequence to Sequence (Seq2Seq) Modeling Task: Translate a sentence from Arabic to English. Overview:* '''1. Tokenization/Normalization:''' Arabic sentence is split into tokens designed for the LLM and normalized* '''2. Encoding:''' LLM encodes the tokens into numerical tokens and generates a sequence of hidden states to represent the input sentence* '''3. Initialize Beam Search:''' Determine the parameters of Beam Search and the decoder's initial states * '''4. Decoding:''' Begin with start-of-sequence token. Model generates probabilities for the next token in each sequence and passes them through an output layer, including a softmax function that normalizes probabilities. Beam Search chooses which paths to continue following and prunes the rest.* '''5. Finalize Output:''' Select Beam with highest probability as final translation, convert tokenized output back into a readable sentence with punctuation, capitalization, etc. Now, to break down the steps of Beam Search Conditional Probability Notes:* Sequence to Sequence search algorithms are based on '''Conditional Probability''' -> describe the likelihood of an event happening given that another event or sequence of events has already occurred* i.e the probability of a new token given the existing sequence is NOT independent and can be calculated <math>Prob(ABC) = Prob(AB) * Prob(C|AB)</math> where <math>Prob(C|AB)</math> is the probability of <math>C</math> occurring given that <math>AB</math> have already occurred* As a result, the graph we run the algorithm on is NOT a Markov Chain Graph, so states are dependent. === Naive Approach #1: Greedy === * Greedy Search takes the best solution at each state in the graph, regardless of previous leaves or future leaves in the sequence. In the context of sequence to sequence, greedy search takes the highest probability word at each position in the sequence and takes it as part of the output. Let's take <math>A, B, C</math> as possible words for our sequence with initial probabilities to be <math>A_0=0.2, B_0=0.5, C_0=0.3</math> The greedy search will take B, the word with highest probability, to be the first word in our sequence. It will continue as such until the end of the sentence is reached. This greedy strategy may be optimal at the current spot of the sequence, but as one might predict, the algorithm struggles with larger outputs where such a path is not the optimal. If we have at most <math>m</math> words in our final sentence and each state has <math>n</math> word options, this algorithm runs with a time complexity of <math>O(nm)</math> === Naive Approach #2: Breadth First Search === The BFS approach considers every possible sequence of words and outputs the highest probability sequence among all. For example, let's take <math>A, B, C</math> as possible words for our sequence. (Assuming the length of the sequence is fixed at 3) BFS will find the maximum of <math>Prob(AAA), Prob(AAB), ... Prob(CCC)</math> and output that respective sequence BFS is guaranteed to find the optimal sequence, but its runtime is far too large to be feasible for use. If we have at most <math>m</math> words in our final sentence and each state has <math>n</math> word options, this algorithm runs with a time complexity of <math>O(n^m)</math> === Beam Search === Beam Search is a heuristic algorithm which combines the ideas of the Greedy and BFS algorithms.  * Maintains a 'beam width' amount of the most promising sequences * At each iteration add all possible continuations of candidate sequences, and takes the 'beam width' best from those to repeat the algorithm* After the algorithm is finished we take the best option from the candidate sequences as our output If we have at most <math>m</math> words in our final sentence, each state has <math>n</math> word options, and we maintain a maximum bin width size of <math>k</math>, this algorithm runs with a time complexity of <math>O(nmk log(k)</math> [[File:Screenshot 2024-05-21 at 7.50.48 PM.png|thumb]] '''Top-K Sampling for Beam Search''':*<math>k</math> -> maximum width of beams during search*all other probabilities are set to 0*allows for a more consistent runtime* lower value => faster runtime, less diversity, less optimal results'''Top-P Sampling for Beam Search (Nucleus Sampling)''':*<math>p</math> -> maximum sum of probabilities among all sequences*usually dynamically determined based on the cumulative probabilities of the tokens in the probability distribution -> ensures a certain proportion of the probability mass is considered*top-p sampling allows for more diversity in generated sequences* lower value => faster runtime, less diversity, less optimal results === Other Notes ==== * Softmax Function: [[File:Screenshot 2024-05-21 at 6.56.45 PM.png|thumb]]ensures that all probabilities at each state are in the range (0,1) and sum to 1* Using semi-log plot to deal with very small probability values trick: Given <math>P</math> as the cumulative probability score of all words in the sequence so far, <math>P=p_0*p_1*p_2...p_n</math>* Since the value of <math>P</math> can become very small, we can run into computational rounding errors. One strategy around this is to take the natural log of the summation of all p values and use this to compare values of <math>P</math>. This works because* <math>P=p_0*p_1*p_2...p_n</math>* <math>log(P)=log(p_0)+log(p_1)+log(p_2)...+log(p_n)</math>* if <math>(P_1>P_2) => log(P_1)>log(P_2)</math>, so values can still be compared in this form
8
edits

Navigation menu