# Machine Learning Notes

These are my notes made during watching Andrej Karpathy’s tutorials on YouTube. I have also more theoretical (overview) notes about NLP and ML.

The tutorials are these:

## Terminology

• finding local (global) maxima/minima in a function
• backpropagation
• updating gradients for all elements of the functions of the model
• gradient is how each element is contributing to the final value (objective function) of the model
• it is the slope of the function or the partial derivation
• all functions and their elements must be differentiable
• forward pass
• applying the model to the data based on the current parameter values
• weights, bias, non-linearity
• a neuron model: N inputs, N weights for them, + biases; a non-linear function
• non-linearity
• sigmoid, tanh, softmax
• initialization
• it’s advisable not to initialize weights to zero even though the initial loss is close to optimal in this case
• but biases can be zero
• some neurons might be initialized to dead neurons unable of learning (tanh in flat regions, ReLU in the negative numbers, …)
• Kaimin normalization torch.nn.init.kaiming_normal_()
• batch normalization
• normalization layer
• normalization of values to be more Gaussian as too low/high values with tanh may result in dead neurons
• introducing noise into batches (these are random and every example is somehow affected by other examples in the same batch)
• it’s implemented in Torch
• regularization
• add a penalty as model complexity increases
• L1 and L2
• loss function
• function which takes the model’s predictions and the desired output and computes one number
• the lower the better
• (py)torch
• L2 regularization
• learning rate
• how much to change the weights each pass
• too large: the learning is unstable
• too small: the learning takes ages
• learning rate decay:
• learning rate can be dynamic (with regards to the step number within optimization)
• can be determined by tracking losses while altering learning rates
• hyperparameter
• a parameter of the model; to be evaluated
• hyperparameter tuning:
• grid search cross validation: trying all combinations; suffers from exponential growth of combinations
• random search: random combination of hyperparameters are used to find the best solution
• batch(ing)
• random portion of training data used for forward pass (every time different/random)
• to save compute time
• too small batches may introduce noise in loss values over time
• logit = log-count
• softmax = exponentiate numbers and normalize them to sum into 1.0
• makes the output a probability distribution
• cross entropy: better than - mean of log(count)s
• overfitting
• underfitting:
• usually when a model is very small
• training, development/validation and test splits; 80, 10 and 10%

## (Py)Torch

• Tensor
• .view(<shape>) will take the internal linear (memory) representation of tensor and layout it according to the wanted shape
• very efficient (better than e.g. torch.cat(torch.unbind(tensor, 1), 1)
• arange similar to range in Python
• randn(<shape> will fill with numbers from normal distribution
• -1 in shape tells torch to infer the dimension
• squeeze
• torch.linspace(from, to, steps) is like range in Python but works for floats
• sum per row: P.sum(1, keepdim=False)
• from numpy
• when a binary operation is defined for two tensors:
• both dimensions are equal
• one of them is 1
• one of them doesn’t exist
• toch.nn.functional.one_hot
• common way of importing functional: import torch.nn.functional as F
• @ is vector multiplication
• indexing with a range: x[torch.arange(10), y]
• with torch.no_grad(): ... tells torch to not include what follows in backpropagation
• torch.zeros_like(tensor) will create a new tensor with the shape of tensor with all zeros
• torch.all_close(t1, t2) will compare tensors with some tolerance
published: 2022-11-24
modified: 2023-08-28