Text preprocessing To make the text machine readable, the text has to be converted into numbers. Since edges in images often represent significant changes in brightness, the Sobel filter is really good at Multimodal Data Processing A modality means how a natural phenomenon is perceived or expressed e.g.
Normalization: the preprocessing of text data to make it usable for NLP Tokenization: separates a finding spots where there's a sudden change in intensity. After calculating these rates of change (or language, vision, hearing.Multimodal means having multiple modalities. Different modalities can have
sentence into word fragments. During tokenization, we can also remove unwanted tokens, such as gradients) for every pixel in the image, the Sobel filter assigns higher values to pixels where the change different characteristics: Element representations: discrete, continuous, granularity. Element distributions:
punctuations, digits, symbols, stop words, etc. Stemming chops or replaces word tails with the goal of is more significant. A Sobel filter can tell you the orientation of the edge. It does this by calculating density, frequency (image with 1 million pixels vs a word with a couple of tokens) Structure: temporal, spatial,
approximating the words original form (e.g. consumers ⇒ consumer) Lemmatization: uses dictionaries gradients separately in the horizontal and vertical directions. We can use various hand-crafted latent, explicit, (structured pixels in an image vs a network) Information: abstraction, entropy. (different
and full morphological analysis to correctly identify the base form for each word. To perform convolutional kernels (i.e., filter banks) to extract different kinds of information in the image, such as scales. An image gives more information than a pixel) Noise: uncertainty, noise, missing data. (e.g. faulty
lemmatization appropriately use POS: Part-of-speech (POS) tagging is the process of assigning each using the Gabor filters. Gabor filters: specialized tools for detecting specific patterns in images, sensors) Relevance: the type of task we use it for. Different modalities can share information with
word in a text corpus with a specific part-of-speech tag based on its context and definition. The tags especially textures, by convolving the image with wave-like filters tuned to different frequencies and different levels of connections. The shared information can be connected in different ways: Association:
typically include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, orientations. They are highly effective for feature extraction and texture analysis in computer vision you have pairs of stuff. e.g. correlation, co-occurence (webscraping) Dependency: causal relationships. e.g.
and more. POS tagging can help other NLP tasks disambiguate a token somewhat due to the added tasks. To construct features, we can perform convolution on image patches using a set of kernels (i.e., causal, temporal (factories producing co2 and sensors picking that up) Multiple modalities can exist in
context. Bag of words and TF-IDF Then we need to transform the cleaned tokens to data points in filter banks) and then aggregate them into a feature vector. There are many ways of constructing different parts of the machine learning pipeline. Language: sentence, words, text Vision: video, photos.
some high-dimensional space e.g. Bag of Words: a representation that turns arbitrary text into fixed- feature vectors, such as Histogram of Gradients.Take image patches and compute filter for every filter Audio: audio data. In multimodal ML, we take all the modalities, use an ML technique, and create an
length vectors by counting how many times each word appears. The Bag of Words approach can be patch. Then we get some gradient, then we can aggregate these gradients into histograms. Useful for representation. This can also be a prediction, or another modality. Modality transfers Image Captioning
problematic since it weights all words equally, even after removing stop words. For example, ”play” can identifying people or certain patterns. Convolutional Neural Network. Deep learning allows us to train takes images as input and then outputs sentences that describe the input images (vision ⇒ language). We
appear many times in sports news.TF-IDF (term frequency-inverse document frequency): a weighted a model end-to-end, which means the inputs are raw pixel values, and the outputs are categories or can also take text as input and then generate images that match the input text (language ⇒ vision).
Bag of Words that transforms sentences into vectors. Term Frequency (TF) measures how frequently heatmaps. The convolutional parts of the architecture are used to learn kernels/filters to extract features. Combining modalities to create a prediction or a different modality Visual Question Answering
a term (word) appears in a document. There are different implementations, such as using a log function The last layer(s) of most CNNs are just linear classifiers. The data should be linearly separable by the takes both images and sentences as input and then outputs a label of text-based multiple-choice answer
to scale it down. Inverse Document Frequency (IDF) weights each word by considering how time it reaches the end of the network. Many learned CNN kernels from the first layer look similar to the (vision + language ⇒ label). E.g. How many horses are in this image? (image, question and answer) When
frequently it shows in different documents. IDF is higher when the term appears in fewer documents. hand-crafted kernels. Components of CNN typically involves:Convolutional layers: filters that try the input has multiple modalities, we can fuse the modalities or explicitly learn their connections in the model
Topic modeling Topic modeling is a machine learning technique that automatically analyzes text data to find patterns in an image. Activation functions: make decisions in a neuron. They decide whether architecture. Early fusion: concatenate input modalities at the early stage of the network. Late fusion: run
to determine cluster words for a set of documents. We can use topic modeling to encode a a neuron should be activated or not based on the input it receives. Pooling layers (e.g., max pooling): everything to the end and get 2 predictions, and create a weighed average. Video Classification can use
sentence/document into a distribution of topics. After transforming text into vectors, we can use these Pooling layers help reduce the spatial dimensions of the feature map produced by the convolutional both video and audio signals to predict output categories (vision+audio ⇒ labels). Transforms input into
vectors for national language-processing tasks, such as sentence/document classification (or clustering). layers. Max pooling, e.g., looks at small sections of the feature map and picks the maximum value, numbers, and put all of them in the transformer and output labels. Self-attention Transformers use self-
Latent Dirichlet Allocation (LDA) assumes that each document in a collection is a mixture of different effectively downsizing the information while keeping the most important features. Fully connected attention, which is a way of encoding sequences to tells how much attention each input should pay attention
topics, and each topic is a probability distribution over a set of words. The model then infers the layers:These layers take the high-level features from the previous layers and use them to classify the to the other inputs (including itself). Convolution layers use fixed weights (kernels) to filter information. Self-
underlying topic distribution for each document in the collection and the word distribution for each topic. input into various categories. Think of them as layers where every neuron is connected to every neuron attention layers dynamically compute attention filters to show how well a pixel matches its neighbors. Self-
Output for LDA: A set of topic vectors, where each vector is represented as a probability distribution in the previous and next layers, allowing for complex decision-making based on the extracted features. attention allows each word to consider the importance of other words in the sentence when trying to
over the words in the text corpus. LDA is trained using an iterative algorithm that maximizes the Normalization: Normalization techniques are used to ensure that the inputs to the neural network are understand its context. This process helps transformers capture complex relationships and dependencies in
likelihood of observing the given documents. Using deep learning to automate feature engineering: One- standardized, which helps in faster convergence during training. It also helps in stabilizing and speeding the data, making them powerful models for tasks like language translation, text generation, and more. Multi-
hot encoding: we convert each categorical value into a new categorical column and assign a binary up the learning process by making sure that the inputs are within a certain range or distribution, which head attention Transformers use multi-head attention to look at different aspects of the inputs. Each head
value of 1 or 0 to those columns. But this approach is inefficient (in terms of computation) because it prevents any particular input from dominating the learning process. Fully connected layers: flattens a is encoding or paying attention to some information from some perspective. e.g head 1 attends to entities,
creates long vectors with many zeros, which uses a lot of computer memory. Another problem of one- feature map (image) to a 1-dimensional vector which is then passed to an activation function and goes attention head 2 attends to syntactically relevant words. Transformers are connected by two self-attention
hot encoding is that it does not encode similarity. The dot product of two vectors can also be used to to the next layer. Convolutional layers: perform convolution operations like box filter or edge detection. blocks (one for encoder, one for decoder) and an encoder-decoder attention block (similar to the original
measure similarity, which considers both the angle and the vector lengths. Cosine similarity is a Each step in the convolutional operation produces one number, which is the sum of the element-wise attention). Transformers can work for multiple vision-language tasks (vision+language ⇒ language).
normalized dot product. Cosine similarity and dot product can be used to measure how vectors are close multiplication, or it can also be seen as a dot product of two tensors. We then repeatedly slide the Representation learning Instead of completing the task directly, we can also think about how to learn a
to each other. Word embeddings and Word2Vec We can use word embeddings to efficiently represent convolution kernel over the input feature map (or images). The result is a new matrix of numbers. The good representation (i.e., embedding) so that a linear classifier can separate the data easily. CLIP: a method
text as vectors, in which similar words have a similar encoding in a high-dimensional space. Word input and output image sizes are different due to how we slide the convolutional kernel. For each kernel, that generates high-dimensional vectors representing both images and corresponding text, allowing a model
embeddings also encode semantics, which means similar words are close to each other. Word embeddings after convolution, we get a feature map (or activation map). Activation map: a visual representation to understand and compare the relationships between the visual and textual content. The CLIP model learns
can be trained by iteratively using one word in a sentence to predict nearby words as accurately as of the output of a particular layer in a convolutional neural network (CNN). If we use another a joint text-image representation using a large number of image and text pairs (vision+language ⇒
possible. Position (e.g., distance and direction) in the word embedding vector space can encode semantic convolutional kernel and slide the kernel over the input image again, we obtain another feature representation). Zero shot learning: take the embedding space, and use it to do any downstream task like
relations, such as the relation between a country and its capital. One way to train word embeddings is (activation) map. We can repeat this process many times, and we will get a bunch of feature (activation) classification. Zero-shot prediction is a machine learning technique where a model makes predictions for
to use the context (e.g., nearby words) to represent a word. Training word embeddings: we can maps. The depth of the feature map depends on the number of filters/kernels. Depending on the size of classes it hasn't been explicitly trained on, by leveraging its understanding of the relationships between
represent words by their context (i.e., the nearby words within a fixed-size window). Word2Vec: a a mini-batch, we obtain multiple batches of feature maps. Notice that we use a lot of data to train the different classes. We can use the learned CLIP embedding to perform zero-shot prediction by taking the label
method to train word embeddings by context. The goal is to use the center word to predict nearby words bias vector and the kernels/filters. Batch size: the amount of images that are input in the CNN. The # with the largest similarity score between the label text and the image. Contrastive learning brings positive
as accurate as possible, based on probabilities. How is probability related to word vectors? We use trainable parameters are the number of filters. Without convolution, a fully connected layer gives pairs (data matches well) closer and pushes negative pairs (data not matching) far apart. E.g.
the dot product similarity of word vectors to calculate probabilities, with the help of the softmax function. way more parameters. So, convolutional layers can also be seen as a way to reduce the number of words/sentences from the same article vs two different articles. Foundation models: use one model but use
Softmax is a commonly used function in deep learning to map random values to probabilities. For each trainable parameters (compared to fully connected layers) by only looking at a local region.) In fully them for multiple types of tasks. (e.g. vision based tasks, nlp, and combine them for multimodal tasks). Work
word position 𝑡 = 1,…, 𝑇 with window size 𝑚, we can adjust the word vectors (𝜃) to maximize the connected layers, the layers are always connected to each neuron in the next layer. In convolutional for both unimodal (e.g., image/text classification) and multimodal tasks (e.g., visual question answering).
likelihood function, based on the probability that we calculated. Training word embeddings: We have layers, the neuron is only connected to a part of the previous layer. Convolution operations consider
a large text corpus (i.e., text body) with a long list of words. Every word is represented by a vector 𝑤. stride and padding. Stride means the number of steps when moving the filter. 1 step: stride = 1. 4-dimensional embedding
For each position 𝑡 in the text, determine the center word wt and the context words w0 (i.e.,the words Padding means adding zeros around the input feature map. This preserves input spatial dimensions in
that are nearby wt). For each word wt compute the probabillity of P( w0 | wt)$ using the dot product output activations. When you initialize all the weights in a neural network to zero, all neurons in
similarity of word vectors w0 and wt. Keep adjusting the word vectors to maximize this probability. Primary each layer will compute the same output, and thus, during training, each neuron will learn the same
goal of dot product in word embeddings is to measure the similarity between two words. Sentence/ features. This results in the network not being able to distinguish between different inputs, leading to
document representations We can stack all the word vectors into a matrix, where each column means poor performance. As a consequence, the loss and model performance metrics almost do not change, as
a dimension of the word vector, and the number of rows means sentence length. For a convolutional the network fails to learn meaningful representations from the data. Max pooling layer The max pooling
neural network, all inputs need to have the same size. But sentences can have different length. Solution: layer is designed to have the neural network pay attention to the most important information by taking
We can drop the parts that are too long and pad the parts that are too short with zeros. After we make the maximum value in a convolution window. A convolution that only takes max values(all ones) The
sure that all input data have the same size, we can put them into deep neural networks for different max pooling layer reduces the size of each feature (activation) map independently. Notice that there are
tasks. RNN We can also use the recurrent neural network (RNN) to takes inputs with various lengths. no learnable/trainable parameters in the max pooling layers. Activation function The activation
Recurrent connections are shown in the red cyclic edges (and unfolded into red arrows). Recurrent function in the neural network is designed to introduces non-linearity, such as the sigmoid activation
Neural Network can take inputs with various lengths (e.g., sentences). Typically, we feed function. Sigmoid: squashes numbers to range [0,1]. Problems: Saturated neurons 'kill' the gradients,
features to the deep neural net, but we feed observations (for each time step) to the recurrent neural Sigmoid outputs are not zero-centered, exp() is a bit expensive to compute. Tanh: squashes numbers
net. Notice that the input 𝑋 below is transposed. RNNs contain hidden layers that can capture to range [-1,1]. Pro: Zero centered, Con: Still kills gradients when saturated. ReLu: the saturating
information from previous time steps We can combine RNNs into a sequence-to-sequence (Seq2Seq) gradient problem can be fixed by using the ReLU activation. But the gradient when 𝑥 < 0 is zero, which
model for sentence classification or sentiment analysis. In this case, the output sequence has only one lead to the dying ReLU problem. Dying ReLu: the issue where ReLU neurons in a neural network may
label. Seq2Seq models are flexible in the input and output sizes. The rectangles in the graph below mean become inactive and permanently output zero due to large negative inputs, causing gradient updates to
vectors, red rectangles mean inputs, and blue rectangles mean outputs. We can generalize the Seq2Seq cease during training. Pros: Does not saturate, very computationally efficient, Coverges much faster than
model further to the encoder-decoder structure, where the encoder produces an encoded representation sigmoid/tanh in practice. Con: not zero centered output Leaky ReLu: One way to mitigate the dying
of the entire input sequence. The problem of using only the final encoder output is that it is hard for the ReLU problem is to use a leaky ReLU instead, where the negative value regions still have a slight slope.
model to remember previous information. Instead, we can have the model considers all outputs. But in practice people still use ReLU. Pros: Does not saturate, computationally efficient, converges much
Attention mechanism But, using the same weights may be insufficient, as we may want the weights faster than sigmoid/tanh, will not die. Normalization layer The normalization layer normalizes a certain
to change according to different inputs. Attention: a component of a neural network that assigns a level region (e.g., the blue region below) of the feature maps to zero mean and unit variance. A typical example
of importance, or "attention," to different parts of the input data. Attention helps the model learn is Batch Normalization. Batch Normalization is usually inserted after convolutional (or fully connected)
information from the past and focus on a certain part of the source. In attention, the query matches all layers and before the activation function (the non-linearity), which has advantages: Makes deep networks
keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed. much easier to train, Allows higher learning rates, faster convergence, Networks become more robust to
Order of attention mech: 1. Get the encoder output values (from the RNN), 2. Transform encoder initialization, Acts as regularization during training. Vanishing gradient Deep learning models can suffer
outputs (dimension reduction), 3. Compute attention scores (dot product similarity), 4. Compute the from vanishing gradient, where the gradient becomes too small during backpropagation, and thus the
attention distribution using softmax, 5. Compute attention-weighted sum of encoder output. model weights are hard to update. E.g. sigmoid function as activation function: if the data is near 0, if
Image/video classification Computers read only numbers. Typically, computers store images as pixels you try to compute the gradient, there is no slope to compute. You will not get an update of your model
with RGB channels with values ranging from 0 to 255. Images can be represented as a 3D tensor then. Can be fixes with the ReLU function. But deep models are harder to train. In fact, deep models
(width*height*channels). For RGB images, we have 3 channels corresponding to the pixel intensity of underfit the data. They perform worse in training and testing than the shallow models. Intuitively, deep
red, green, and blue colors. Before deep learning: computer vision used hand-crafted features. models should perform at least as good as shallow models by copying layers from the shallower model
Researchers developed different image filters/kernels to extract features using convolution. and setting extra layers to identity (i.e., set 𝑓 𝑥 = 0). By stacking many residual blocks, we can build the
Convolution: Discrete convolution: sum of the element-wise multiplication. Image coordinates: the residual network architecture (i.e., ResNet), which is a reasonable baseline for image classification. We
origin is at the top-left pixel. Notation 𝐹[𝑥, 𝑦] means the value of the center of the pixel for image 𝐹 at often use pre-trained weights in similar or other tasks as a starting point (but not from scratch). This
location [𝑥, 𝑦] in the 2D array. Filters Box filter: creates a weighted average from all the values in the idea is called Transfer Learning, where we reuse prior knowledge. Video classification Instead of
image, which creates a blurred image. Box filter is a kernel with all ones, which serves to calculate a using 2D convolutional kernels, we can use 3D kernels to learn information from videos. Videos can be
weighted average in the image. Decrease size: a kernel with a 1 in the center and zero’s around will represented as a 4D tensor (channel*time*height*width). We can also combine CNN and RNN for video
produce the same identical image but smaller, as it averages the image. Shifting: a kernel with a 1 on classification We can separately consider appearance and motion. The two-stream network below uses
the middle left pixel and the rest zero will shift the image left by 1 pixel. Intensity: a kernel with a 2 in CNN on individual video frames for the original image and optical flow. Optical flow is a computer vision
the middle and the rest all zeros will multiply the pixels by 2. The filters can be combined: Sharpening: technique that calculates and highlights local motions (of objects, surfaces, edges, etc.) in consecutive
combining intensity increase with the removal of blurred signals through a box filter to create a sharpened video frames. Explainability Apart from performance, another feature to consider when developing an
image. Detecting edges: finding edges from images by applying a Sobel filter to compute the derivative AI system is its explainability, i.e., the ability to show why the model makes decisions in the way it
of signals. Taking the positive values from the left side of the images and subtracting them from the does. Having explanations for a model’s decisions not only helps build trust in them but also increases
negative values from the right. If there are sharp changes the values will reflect that. productivity when debugging the model or tackling areas of improvement.