Introduction
David Mareček
September 30, 2021
NPFL097: Unsupervised Machine Learning in NLP
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics unless otherwise stated
About
Webpage: http://ufal.mff.cuni.cz/courses/npfl097 E-credits: 3
Examination: 1/1 C Form:
• lectures, exercises, discussions
• 3 programming homeworks
• test
Recorded lectures:
Videos of lectures from the previous year are available in SIS.
Course passing requirements
Three programming assignments
• For each one you can obtain at most 10 points.
• You will always have at least three weeks to complete it.
• You will obtain only half of the points for assignments (or parts of them) delivered after the deadline.
Test
You can obtain 15 points from theoretical test (on 4th January) You pass the course by obtaining at least 30 points.
Unsupervised Machine Learning Excercices Unsupervised ML Problems 2/ 19
Schedule
Sep 30: Introduction to Unsupervised ML Oct 7: Beta-Bernoulli probabilistic model
Oct 14: Dirichlet-Categorical probabilistic model, Categorical Mixture Model, EM Oct 21: Gibbs Sampling, Latent Dirichlet Allocation
Oct 28: no lecture (national holiday)
Nov 4: Chinese Restaurant Process, Pitman-Yor Process, Text Segmentation Nov 11: Unsupervised Tagging, Alignment, Parsing
Nov 18: K-Means, Mixture of Gaussians
Nov 25: Hierarchical Clustering, Clustering Evaluation, other clustering methods Dec 2: Principal Component Analysis, Independent Component Analysis, CCA Dec 9: (reserve)
Dec 16: Unsupervised Interpretation of Neural Networks Dec 23: no lecture (Christmas holidays)
Dec 30: no lecture (Christmas holidays) Jan 6: Test
Related courses
Basic probabilistic and ML concepts:
• NPFL067 – Statistical Methods in NLP I
• NPFL129 – Introduction to Machine Learning with Python Basic deep-learning concepts:
• NPFL114 – Deep Learning Other related courses:
• NPFL087 – Statistical Machine Translation
• NPFL103 – Information Retrieval
• NPFL120 – Multilingual Natural Language Processing
Unsupervised Machine Learning Excercices Unsupervised ML Problems 4/ 19
Reading
Books:
• Christopher Bishop: Pattern Recognition and Machine Learning, Springer-Verlag New York, 2006
• Kevin P. Murphy: Machine Learning: A Probabilistic Perspective, The MIT Press, Cambridge, Massachusetts, 2012
Tutorials, papers:
• Kevin Knight: Bayesian Inference with Tears, 2009
(https://www.isi.edu/natural-language/people/bayes-with-tears.pdf)
Unsupervised Machine Learning
Supervised Machine Learning
• We know what the output values for our samples should be.
• We have a labelled (ground truth) data.
• Goal: to find a function that best approximates the relationship between input and output observable in the data.
Unsupervised Machine Learning
• We do NOT have any labelled data.
• Goal: to infer the natural structure or previously unknown patterns present within given dataset.
Unsupervised Machine Learning Excercices Unsupervised ML Problems 7/ 19
Unsupervised Machine Learning
We use unsupervised learning if we want to answer following questions:
• How do you find the underlying structure of a dataset?
→ e.g. Latent Variable Models
• How do you summarize it and group it most usefully?
→ e.g. Clustering
• How do you effectively represent data in a compressed format?
→ e.g. Dimensionality reduction
These are the goals of unsupervised learning, which is called “unsupervised” because you start with unlabeled data.
The machine learning methods which are optimized on one task but then their internal learned representations are taken as outputs may also be called unsupervised.
Excercices
Supervised or unsupervised learning?
• tokenization
• part-of-speech tagging
• dependency parsing
• word sense disambiguation
• word embeddings
• named entity recognition
• word alignment
• machine translation
• sentiment analysis
• document classification
• text summarization
Problem 1
Imagine you get thousands of text documents.
How would you separate documents into several groups according to their topics?
You need to do it in unsupervised way. You do not have any set of annotated documents, neither you know about any topics in the given set.
Unsupervised Machine Learning Excercices Unsupervised ML Problems 10/ 19
Problem 2:
Imagine you get a very long text in an unknown language that does not use spaces between words.
How you would guess the word boundaries?
Unsupervised ML Problems
Modelling Document Collections
We want to find an underlying structure of a given set of documents Goal: assign a class to each document.
Better goal: find a set of topics and assign several relevant topics to each document.
• Each topic is represented by a distribution over words.
• Each document has a distribution over its topics (typically 1 to 5 main topics)
• The total amount of topics is the only constant chosen by the user.
related course:
NPFL103 – Information Retrieval
Modelling Document Collections
Latent Dirichlet Allocation
Unsupervised Machine Learning Excercices Unsupervised ML Problems 13/ 19
Language Clustering
• What is the underlying structure of all world’s languages?
• Do they fit their linguistic categorization to language families?
Language vectors from multilingual MT visualized by T-SNE
Text Segmentation
We need some language units for processing text.
• Paragraphs or sentences are too long, characters are too short.
• Words? Some languages do not have words.
• Byte-Pair Encoding, Bayessian inference
Unsupervised Machine Learning Excercices Unsupervised ML Problems 15/ 19
Embeddings of Linguistic Units
• Word Embeddings
• Contextual Embeddings
• Sentence Embeddings We have huge number of texts.
• We want to find a vector of real numbers representing each word (or sentence).
• Similar words (or sentences) should be represented by similar vectors.
Are the methods like Skipgram, ELMo, and BERT unsupervised?
related course:
NPFL114 – Deep learning
Unsupervised Machine Translation
• Let’s suppose we have a huge number of comparable texts in two languages, but only very little or no parallel data.
• We want to infer a dictionary or a translation system.
related courses:
NPFL087 – Statistical Machine Translation
NPFL120 – Multilingual Natural Language Processing
Unsupervised Machine Learning Excercices Unsupervised ML Problems 17/ 19
Principal Component Analysis
• We want to describe a highly dimensional vector space.
• E.g. 512-dimensional vector space of word embeddings.
• What are the most important features of the space?
Problems to Solve
Word Clustering, Language Clustering
• we can generate many features for each word or language
• Goal: categorize words (part-of-speech tags) or languages (language families) methods: K-means, Mixture of Gaussians, Hierarchical Clustering
related course:
NPFL129 – Machine learning for Greenhorns
Unsupervised Machine Learning Excercices Unsupervised ML Problems 19/ 19