# Machine Learning Notes

These are my notes made during watching Andrej Karpathy’s tutorials on YouTube. I have also more theoretical (overview) notes about NLP and ML.

The tutorials are these:

## Terminology

- gradient descent
- finding local (global) maxima/minima in a function

- backpropagation
- updating gradients for all elements of the functions of the model
- gradient is how each element is contributing to the final value (objective function) of the model
- it is the slope of the function or the partial derivation
- all functions and their elements must be differentiable

- forward pass
- applying the model to the data based on the current parameter values

- weights, bias, non-linearity
- a neuron model: N inputs, N weights for them, + biases; a non-linear function

- non-linearity
- sigmoid, tanh, softmax

- initialization
- it’s advisable not to initialize weights to zero even though the initial loss is close to optimal in this case
- but biases can be zero
- some neurons might be initialized to dead neurons unable of learning (tanh in flat regions, ReLU in the negative numbers, …)
- Kaimin normalization
`torch.nn.init.kaiming_normal_()`

- batch normalization
- 2015 paper, Google
- normalization layer
- normalization of values to be more Gaussian as too low/high values with tanh may result in dead neurons
- introducing noise into batches (these are random and every example is somehow affected by other examples in the same batch)
- it’s implemented in Torch

- regularization
- add a penalty as model complexity increases
- L1 and L2

- loss function
- function which takes the model’s predictions and the desired output and computes one number
- the lower the better

- (py)torch
- L2 regularization
- learning rate
- how much to change the weights each pass
- too large: the learning is unstable
- too small: the learning takes ages
- learning rate decay:
- learning rate can be dynamic (with regards to the step number within optimization)

- can be determined by tracking losses while altering learning rates

- hyperparameter
- a parameter of the model; to be evaluated
- hyperparameter tuning:
- grid search cross validation: trying all combinations; suffers from exponential growth of combinations
- random search: random combination of hyperparameters are used to find the best solution

- batch(ing)
- random portion of training data used for forward pass (every time different/random)
- to save compute time
- too small batches may introduce noise in loss values over time

- logit = log-count
- softmax = exponentiate numbers and normalize them to sum into 1.0
- makes the output a probability distribution

- cross entropy: better than - mean of log(count)s
- overfitting
- underfitting:
- usually when a model is very small

- training, development/validation and test splits; 80, 10 and 10%

## (Py)Torch

- Tensor
`.view(<shape>)`

will take the internal linear (memory) representation of tensor and layout it according to the wanted shape- very efficient (better than e.g.
`torch.cat(torch.unbind(tensor, 1), 1)`

- very efficient (better than e.g.

`arange`

similar to`range`

in Python`randn(<shape>`

will fill with numbers from normal distribution`-1`

in shape tells torch to infer the dimension- squeeze
`torch.linspace(from, to, steps)`

is like`range`

in Python but works for floats- sum per row:
`P.sum(1, keepdim=False)`

- broadcasting semantics
- from numpy
- when a binary operation is defined for two tensors:
- both dimensions are equal
- one of them is 1
- one of them doesn’t exist

`toch.nn.functional.one_hot`

- common way of importing functional:
`import torch.nn.functional as F`

`@`

is vector multiplication- indexing with a range:
`x[torch.arange(10), y]`

`with torch.no_grad(): ...`

tells torch to not include what follows in backpropagation`torch.zeros_like(tensor)`

will create a new tensor with the shape of`tensor`

with all zeros`torch.all_close(t1, t2)`

will compare tensors with some tolerance

published: 2022-11-24

modified: 2023-08-28

modified: 2023-08-28