Machine Learning Notes
These are my notes made during watching Andrej Karpathy’s tutorials on YouTube. I have also more theoretical (overview) notes about NLP and ML.
The tutorials are these:
Terminology
- gradient descent
- finding local (global) maxima/minima in a function
- backpropagation
- updating gradients for all elements of the functions of the model
- gradient is how each element is contributing to the final value (objective function) of the model
- it is the slope of the function or the partial derivation
- all functions and their elements must be differentiable
- forward pass
- applying the model to the data based on the current parameter values
- weights, bias, non-linearity
- a neuron model: N inputs, N weights for them, + biases; a non-linear function
- non-linearity
- sigmoid, tanh, softmax
- initialization
- it’s advisable not to initialize weights to zero even though the initial loss is close to optimal in this case
- but biases can be zero
- some neurons might be initialized to dead neurons unable of learning (tanh in flat regions, ReLU in the negative numbers, …)
- Kaimin normalization
torch.nn.init.kaiming_normal_()
- batch normalization
- 2015 paper, Google
- normalization layer
- normalization of values to be more Gaussian as too low/high values with tanh may result in dead neurons
- introducing noise into batches (these are random and every example is somehow affected by other examples in the same batch)
- it’s implemented in Torch
- regularization
- add a penalty as model complexity increases
- L1 and L2
- loss function
- function which takes the model’s predictions and the desired output and computes one number
- the lower the better
- (py)torch
- L2 regularization
- learning rate
- how much to change the weights each pass
- too large: the learning is unstable
- too small: the learning takes ages
- learning rate decay:
- learning rate can be dynamic (with regards to the step number within optimization)
- can be determined by tracking losses while altering learning rates
- hyperparameter
- a parameter of the model; to be evaluated
- hyperparameter tuning:
- grid search cross validation: trying all combinations; suffers from exponential growth of combinations
- random search: random combination of hyperparameters are used to find the best solution
- batch(ing)
- random portion of training data used for forward pass (every time different/random)
- to save compute time
- too small batches may introduce noise in loss values over time
- logit = log-count
- softmax = exponentiate numbers and normalize them to sum into 1.0
- makes the output a probability distribution
- cross entropy: better than - mean of log(count)s
- overfitting
- underfitting:
- usually when a model is very small
- training, development/validation and test splits; 80, 10 and 10%
(Py)Torch
- Tensor
.view(<shape>)
will take the internal linear (memory) representation of tensor and layout it according to the wanted shape- very efficient (better than e.g.
torch.cat(torch.unbind(tensor, 1), 1)
- very efficient (better than e.g.
arange
similar torange
in Pythonrandn(<shape>
will fill with numbers from normal distribution-1
in shape tells torch to infer the dimension- squeeze
torch.linspace(from, to, steps)
is likerange
in Python but works for floats- sum per row:
P.sum(1, keepdim=False)
- broadcasting semantics
- from numpy
- when a binary operation is defined for two tensors:
- both dimensions are equal
- one of them is 1
- one of them doesn’t exist
toch.nn.functional.one_hot
- common way of importing functional:
import torch.nn.functional as F
@
is vector multiplication- indexing with a range:
x[torch.arange(10), y]
with torch.no_grad(): ...
tells torch to not include what follows in backpropagationtorch.zeros_like(tensor)
will create a new tensor with the shape oftensor
with all zerostorch.all_close(t1, t2)
will compare tensors with some tolerance
published: 2022-11-24
modified: 2023-08-28
modified: 2023-08-28