DavidMareček Introduction NPFL097:UnsupervisedMachineLearninginNLP

(1)

Introduction

David Mareček

September 30, 2021

NPFL097: Unsupervised Machine Learning in NLP

Charles University

Faculty of Mathematics and Physics

Institute of Formal and Applied Linguistics unless otherwise stated

(2)

About

Webpage: http://ufal.mff.cuni.cz/courses/npfl097 E-credits: 3

Examination: 1/1 C Form:

• lectures, exercises, discussions

• 3 programming homeworks

• test

Recorded lectures:

Videos of lectures from the previous year are available in SIS.

(3)

Course passing requirements

Three programming assignments

• For each one you can obtain at most 10 points.

• You will always have at least three weeks to complete it.

• You will obtain only half of the points for assignments (or parts of them) delivered after the deadline.

Test

You can obtain 15 points from theoretical test (on 4th January) You pass the course by obtaining at least 30 points.

Unsupervised Machine Learning Excercices Unsupervised ML Problems 2/ 19

(4)

Schedule

Sep 30: Introduction to Unsupervised ML Oct 7: Beta-Bernoulli probabilistic model

Oct 14: Dirichlet-Categorical probabilistic model, Categorical Mixture Model, EM Oct 21: Gibbs Sampling, Latent Dirichlet Allocation

Oct 28: no lecture (national holiday)

Nov 4: Chinese Restaurant Process, Pitman-Yor Process, Text Segmentation Nov 11: Unsupervised Tagging, Alignment, Parsing

Nov 18: K-Means, Mixture of Gaussians

Nov 25: Hierarchical Clustering, Clustering Evaluation, other clustering methods Dec 2: Principal Component Analysis, Independent Component Analysis, CCA Dec 9: (reserve)

Dec 16: Unsupervised Interpretation of Neural Networks Dec 23: no lecture (Christmas holidays)

Dec 30: no lecture (Christmas holidays) Jan 6: Test

(5)

Related courses

Basic probabilistic and ML concepts:

• NPFL067 – Statistical Methods in NLP I

• NPFL129 – Introduction to Machine Learning with Python Basic deep-learning concepts:

• NPFL114 – Deep Learning Other related courses:

• NPFL087 – Statistical Machine Translation

• NPFL103 – Information Retrieval

• NPFL120 – Multilingual Natural Language Processing

(6)

Reading

Books:

• Christopher Bishop: Pattern Recognition and Machine Learning, Springer-Verlag New York, 2006

• Kevin P. Murphy: Machine Learning: A Probabilistic Perspective, The MIT Press, Cambridge, Massachusetts, 2012

Tutorials, papers:

• Kevin Knight: Bayesian Inference with Tears, 2009

(https://www.isi.edu/natural-language/people/bayes-with-tears.pdf)

(7)

Unsupervised Machine Learning

(8)

Supervised Machine Learning

• We know what the output values for our samples should be.

• We have a labelled (ground truth) data.

• Goal: to find a function that best approximates the relationship between input and output observable in the data.

(9)

Unsupervised Machine Learning

• We do NOT have any labelled data.

• Goal: to infer the natural structure or previously unknown patterns present within given dataset.

(10)

Unsupervised Machine Learning

We use unsupervised learning if we want to answer following questions:

• How do you find the underlying structure of a dataset?

→ e.g. Latent Variable Models

• How do you summarize it and group it most usefully?

→ e.g. Clustering

• How do you effectively represent data in a compressed format?

→ e.g. Dimensionality reduction

These are the goals of unsupervised learning, which is called “unsupervised” because you start with unlabeled data.

The machine learning methods which are optimized on one task but then their internal learned representations are taken as outputs may also be called unsupervised.

(11)

Excercices

(12)

Supervised or unsupervised learning?

• tokenization

• part-of-speech tagging

• dependency parsing

• word sense disambiguation

• word embeddings

• named entity recognition

• word alignment

• machine translation

• sentiment analysis

• document classification

• text summarization

(13)

Problem 1

Imagine you get thousands of text documents.

How would you separate documents into several groups according to their topics?

You need to do it in unsupervised way. You do not have any set of annotated documents, neither you know about any topics in the given set.

(14)

Problem 2:

Imagine you get a very long text in an unknown language that does not use spaces between words.

How you would guess the word boundaries?

(15)

Unsupervised ML Problems

(16)

Modelling Document Collections

We want to find an underlying structure of a given set of documents Goal: assign a class to each document.

Better goal: find a set of topics and assign several relevant topics to each document.

• Each topic is represented by a distribution over words.

• Each document has a distribution over its topics (typically 1 to 5 main topics)

• The total amount of topics is the only constant chosen by the user.

related course:

NPFL103 – Information Retrieval

(17)

Modelling Document Collections

Latent Dirichlet Allocation

(18)

Language Clustering

• What is the underlying structure of all world’s languages?

• Do they fit their linguistic categorization to language families?

Language vectors from multilingual MT visualized by T-SNE

(19)

Text Segmentation

We need some language units for processing text.

• Paragraphs or sentences are too long, characters are too short.

• Words? Some languages do not have words.

• Byte-Pair Encoding, Bayessian inference

(20)

Embeddings of Linguistic Units

• Word Embeddings

• Contextual Embeddings

• Sentence Embeddings We have huge number of texts.

• We want to find a vector of real numbers representing each word (or sentence).

• Similar words (or sentences) should be represented by similar vectors.

Are the methods like Skipgram, ELMo, and BERT unsupervised?

related course:

NPFL114 – Deep learning

(21)

Unsupervised Machine Translation

• Let’s suppose we have a huge number of comparable texts in two languages, but only very little or no parallel data.

• We want to infer a dictionary or a translation system.

related courses:

NPFL087 – Statistical Machine Translation

NPFL120 – Multilingual Natural Language Processing

(22)

Principal Component Analysis

• We want to describe a highly dimensional vector space.

• E.g. 512-dimensional vector space of word embeddings.

• What are the most important features of the space?

(23)

Problems to Solve

Word Clustering, Language Clustering

• we can generate many features for each word or language

• Goal: categorize words (part-of-speech tags) or languages (language families) methods: K-means, Mixture of Gaussians, Hierarchical Clustering

related course:

NPFL129 – Machine learning for Greenhorns