Machine learning and neural networks in NLP

NLP tasks

Lemmatization, morphological / word / sentence segmentation, part-of-speech tagging, language modelling, parsing, terminology extraction, named entity recognition, natural language generation, optical character recognition, question answering, textual entailment, relation extraction, sentiment analysis, topic recognition, word sense disambiguation, summarization, coreference resolution, speech recognition, authorship analysis, language identification, information extraction, information retrieval, machine translation

Machine learning typology

  • supervised (labelled data)
  • unsupervised (unlabelled data)
    • clustering (k-means, hierarchical; sentence positions), data mining
    • Enron (2001) emails, 500 000 emails from 150 people, network analysis
    • PCA: finding highest variance planes in multidimensional space
    • language modelling (Penn TreeBank, One Billion Benchmark, Google N-grams)
    • word embeddings (word2vec, skip gram, CBOW)
  • semi-supervised (only a part of training data is labelled)
  • reinforcement learning
  • genetic
    • fitness (cost) function, mutation, generations
    • not efficient but with a potential to discover novel solutions

Models of ML

  • decision trees
  • neural networks
    • needs a lot of training data (ImageNet 2012, 14 M pictures)
    • computing power, software and hardware optimization (numpy, GPU, dedicated hardware)
    • neuron, activation(inputs*weights + bias), sigmoid, relu, perceptron, random init, cost function
    • NN ZOO
    • deep neural networks (more hidden layers)
    • CNN: image recognition, translational invariance
    • RNN: audio, sequences (RNN toolkit from Tomáš Mikolov)
    • LSTM (sentiment analysis from unsupervised data: sentiment neuron from OpenAI)
    • autoencoder: compress/decompress through lower-dimensional layers (encoder-decoder architecture), used for MT and language modelling
    • sequence-to-sequence learning: GPT-2
    • bi-directional transformers: BERT
    • libraries: TensorFlow, pyTorch, Keras
  • support vector machines (SVM), linear and non-linear
  • (linear) regression & classification
  • kNN, random forests, naive bayes

Further resources

December 29, 2019 |