# Machine Learning Notes

These are my notes made during watching Andrej Karpathy’s tutorials on YouTube. I have also more theoretical (overview) notes about NLP and ML.

The tutorials are these:

## Terminology

• finding local (global) maxima/minima in a function
• backpropagation
• updating gradients for all elements of the functions of the model
• gradient is how each element is contributing to the final value (objective function) of the model
• it is the slope of the function or the partial derivation
• all functions and their elements must be differentiable
• forward pass
• applying the model to the data based on the current parameter values
• weights, bias, non-linearity
• a neuron model: N inputs, N weights for them, + biases; a non-linear function
• non-linearity
• sigmoid, tanh, softmax
• initialization
• it’s advisable not to initialize weights to zero even though the initial loss is close to optimal in this case
• but biases can be zero
• some neurons might be initialized to dead neurons unable of learning (tanh in flat regions, ReLU in the negative numbers, …)
• Kaimin normalization `torch.nn.init.kaiming_normal_()`
• batch normalization
• normalization layer
• normalization of values to be more Gaussian as too low/high values with tanh may result in dead neurons
• introducing noise into batches (these are random and every example is somehow affected by other examples in the same batch)
• it’s implemented in Torch
• regularization
• add a penalty as model complexity increases
• L1 and L2
• loss function
• function which takes the model’s predictions and the desired output and computes one number
• the lower the better
• (py)torch
• L2 regularization
• learning rate
• how much to change the weights each pass
• too large: the learning is unstable
• too small: the learning takes ages
• learning rate decay:
• learning rate can be dynamic (with regards to the step number within optimization)
• can be determined by tracking losses while altering learning rates
• hyperparameter
• a parameter of the model; to be evaluated
• hyperparameter tuning:
• grid search cross validation: trying all combinations; suffers from exponential growth of combinations
• random search: random combination of hyperparameters are used to find the best solution
• batch(ing)
• random portion of training data used for forward pass (every time different/random)
• to save compute time
• too small batches may introduce noise in loss values over time
• logit = log-count
• softmax = exponentiate numbers and normalize them to sum into 1.0
• makes the output a probability distribution
• cross entropy: better than - mean of log(count)s
• overfitting
• underfitting:
• usually when a model is very small
• training, development/validation and test splits; 80, 10 and 10%

## (Py)Torch

• Tensor
• `.view(<shape>)` will take the internal linear (memory) representation of tensor and layout it according to the wanted shape
• very efficient (better than e.g. `torch.cat(torch.unbind(tensor, 1), 1)`
• `arange` similar to `range` in Python
• `randn(<shape>` will fill with numbers from normal distribution
• `-1` in shape tells torch to infer the dimension
• squeeze
• `torch.linspace(from, to, steps)` is like `range` in Python but works for floats
• sum per row: `P.sum(1, keepdim=False)`
• from numpy
• when a binary operation is defined for two tensors:
• both dimensions are equal
• one of them is 1
• one of them doesn’t exist
• `toch.nn.functional.one_hot`
• common way of importing functional: `import torch.nn.functional as F`
• `@` is vector multiplication
• indexing with a range: `x[torch.arange(10), y]`
• `with torch.no_grad(): ...` tells torch to not include what follows in backpropagation
• `torch.zeros_like(tensor)` will create a new tensor with the shape of `tensor` with all zeros
• `torch.all_close(t1, t2)` will compare tensors with some tolerance
published: 2022-11-24
modified: 2023-08-28