Machine Translation: Evaluation

Back to Introduction, Rule-based systems and Statistical approach.

Motivation for MT evaluation

fluency: is the translation fluent, in a natural word order?
adequacy: does the translation preserve meaning?
intelligibility: do we understand the translation?

Evaluation scale

adequacy		fluency
5	all meaning	5	flawless English
4	most meaning	4	good
3	much meaning	3	non-native
2	little meaning	2	dis-fluent
1	no meaning	1	incomprehensible

Annotation tool

Disadvantages of manual evaluation

slow, expensive, subjective
inter-annotator agreement (IAA) shows people agree more on fluency than on adequacy
another option: is X better than Y? → higher IAA
or time spent on post-editing
or how much cost of translation is reduced

Automatic translation evaluation

advantages: speed, cost
disadvantages: do we really measure quality of translation?
gold standard: manually prepared reference translations
candidate $c$ is compared with $n$ reference translations $r_i$
the paradox of automatic evaluation: the task corresponds to situation
where students are to assess their own exam: how they know
where they made a mistake?
various approaches: n-gram shared between $c$ and $r_i$, edit distance, …

Recall and precision on words

$$\text{precision} = \frac{\text{correct}}{\text{output-length}} = \frac{3}{6} = 50%$$

$$\text{recall} = \frac{\text{correct}}{\text{reference-length}} = \frac{3}{7} = 43%$$

$$\text{f-score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} = 2 \times \frac{.5 \times .43}{.5+.43} = 46%$$

Recall and precision: shortcomings

wer

metrics	system A	system B
precision	50%	100%
recall	43%	100%
f-score	46%	100%

It does not capture wrong word order.

BLEU

standard metrics (2001)
IBM, Papineni
n-gram match between reference and candidate translations
precision is calculated for 1-, 2- ,3- and 4-grams
brevity penalty

$$\hbox{BLEU} = \min \left( 1,\frac{\text{output-length}}{\text{reference-length}} \right) ; \big( \prod_{i=1}^4 \text{precision}_i \big)^\frac{1}{4}$$

BLEU: an example

bleu

metrics	system A	system B
precision (1gram)	3/6	6/6
precision (2gram)	1/5	4/5
precision (3gram)	0/4	2/4
precision (4gram)	0/3	1/3
brevity penalty	6/7	6/7
BLEU	0%	52%

NIST

NIST: National Institute of Standards and Technology
weighted matches of n-grams (information value)
very similar results as for BLEU (a variant)

NEVA

Ngram EVAluation
BLEU score adapted for short sentences
it takes into account synonyms (stylistic richness)

WAFT

Word Accuracy for Translation
edit distance between $c$ and $r$

$\hbox{WAFT} = 1 - \frac{d + s + i}{max(l_r, l_c)}$

TER

Translation Edit Rate
the least edit steps (deletion, insertion, swap, replacement)
$r =$ dnes jsem si při fotbalu zlomil kotník
$c =$ při fotbalu jsem si dnes zlomil kotník
TER = ?

$$\hbox{TER} = \frac{\hbox{number of edits}}{\hbox{avg. number of ref. words}}$$

HTER

Human TER
$r$ manually prepared and then TER is applied

METEOR

aligns hypotheses to one or more references
exact, stem (morphology), synonym (WordNet), paraphrase matches
various scores including WMT ranking and NIST adequacy
extended support for English, Czech, German, French, Spanish, and Arabic.
high correlation with human judgments

Evaluation of evaluation metrics

Correlation of automatic evaluation with manual evaluation.

EuroMatrix

EuroMatrix II

Round-trip translation

a kind of evaluation
DEMO

Factored translation models I

SMT models do not use linguistic knowledge
lemmas, PoS, stems helps with data sparsity
translation of vectors instead of words (tokens)

in standard SMT: dům and domy are independent tokens
in FTM they share a lemma, PoS and morph. information
lemma and morphologic information are translated separately
in target language, appropriate wordform is then generated
FMT in Moses

Tree-based translation models

SMT translates word sequences
many situations can be better explained with syntax:
moving verb around a sentence, grammar agreement at long distance, …
→ translation models based on syntactic trees
for some language pairs it gives the best results

Synchronous phrase grammar

EN rule: NP → DET JJ NN
DE rule: NP → DET NN JJ
synchronous rule NP → DET$_1$ NN$_2$ JJ$_3$ | DET$_1$ JJ$_3$ NN$_2$
final rule N → dům | house
mixed rule N → la maison JJ$_1$ | the JJ$_1$ house

Parallel tree-bank

Syntactic rules extraction

Hybrid systems of machine translation

combination of rule-based and statistical systems
rule-based translation with post-editing by SMT (e.g. smoothing with a LM)
data preparaion for SMT based on rules, changing output of SMT based on rules

Hybrid SMT+RBMT

Chimera, UFAL
TectoMT + Moses
better than Google Translate (En-Cz)

Computer-aided Translation

CAT – computer-assisted (aided) translation
out of score of pure MT
tools belonging to CAT realm:
- spell checkers (typos): hunspell
- grammar checkers: Lingea Grammaticon
- terminology management: Trados TermBase
- electronic translation dictionaries: Metatrans
- corpus managers: Manatee/Bonito
- translation memories: MemoQ, Trados

Translation memory

DB of segments: titles, phrases, sent., terms, par.
translated manually → translation units
advantages:
- everything is translated only once
- cost reducing (repeated translation)
disadvantages:
- majority is commercial
- translation units are know-how
- bad translation is repeated
CAT suggests translations based on exact match
vs. exact context match, fuzzy match
combining with MT

Questions: examples

Enumerate at least 3 rule-based MT systems.
What does abbreviation FAHQMT mean?
What does IBM-2 model adds to IBM-1?
Explain noisy channel principle with its formula.
State at least 3 metrics for MT quality evaluation.
State types of translation according to R. Jakobson.
What does Sapir-Whorf hypothesis claim?
Describe Georgetown experiment (facts).
State at least 3 examples of morphologically rich languages (different language families).
What is the advantage of systems with interlingua against transfer systems?
Draw a scheme of translations between 5 languages for these two types of systems.
Give an example of a problematic string for tokenization (English, Czech).
What is tagset, treebank, PoS tagging, WSD, FrameNet, gisting, sense granularity?
What advantages does space-based meaning representation have?
Which classes of WSD methods do we distinguish?
Draw Vauquois’ triangle with SMT IBM-1 in it.
Explain garden path phenomenon and come up with an example for Czech (or English) not used in slides.
Draw dependency structure for sentence
Máma vidí malou Emu.
Draw the scheme of SMT.
Give at least 3 sources of parallel data.
Explain Zipf’s law.
Explain (using an example) Bayes’ rule (state its formula).
What is the purpose of decoding algorithms?
Write down the formula or describe with words Markov’s assumption.
Examples of frequent 3-, 4-grams (Cz, En).
We aim at low or high perplexity in language models?
Describe IBM models (1–5) briefly.
Draw word alignment matrix for sentences
I am very hungry.
Jsem velmi hladový.

published: 2020-11-14
last modified: 2023-11-20

#edu #mt

https://vit.baisa.cz/notes/learn/mt-eval/