Machine Translation – Rule-based Systems

Go back to Introduction or forward to Statistical MT.

Rule-based Machine Translation

RBMT
linguistic knowledge in form of rules
rules for analysis of SL
rules for transfer between languages
rules for generation/rendering/synthesis of TL

Knowledge-based Machine Translation

systems using linguistic knowledge about languages
older types, more general notion
analysis of meaning of SL is crucial
no total meaning (connotations, common sense)
to be able to translate vrána na stromě
not necessary to know vrána is a bird and can fly
term KBMT rather for systems with interlingua
for us KBMT $=$ RBMT

KBMT classification

direct translation
systems with interlingua
transfer systems

The only types of MT until 90s.

Direct translation

the oldest systems
one step process: transfer
Georgetown experiment, METEO
interest dropped quickly

Direct translation

focus on S↔T elements correspondences
first experiments on En-Ru pair
all components are bound to a language pair (and one direction)
typically consists of:
- translation dictionary
- monolithic program dealing with analysis and generation
necessarily one-directional and bilingual
efficacy: for N languages we need ?
N x (N-1) direct bilingual systems / modules

MT with interlingua

we suppose it is possible to convert SL to a language-independent representation
interlingua (IL) must be unambiguous
two steps: analysis & synthesis (generation)
from IL, TL is generated
analysis is SL-dependent but TL-independent
and vice versa for synthesis
for translation among N languages
we need 2 x N systems / modules

Rosetta

It should be stressed that the isomorphy and not the interlinguality is the primary characteristic of our approach.

Two sentences are considered translations of each other if they have the same semantic derivation trees, i.e. corresponding syntactic derivation trees.

grammar: 14,000 lines in Pascal
all three: interlingua-based
analysis: semantic derivational tree
research, not commercial

	Rosetta2	Rosetta3	Rosetta4
release	1985	1988	1991
speed	1-3 words/sec	?	?
dictionary	5,000	90,000	?
SL	NL, EN	NL, EN	NL, EN
TL	NL, EN	NL, EN, ES	NL, EN, ES

KBMT-89

kbmt-89-scheme

Nirenburg, Sergei. Knowledge-based machine translation. Machine Translation 4.1 (1989): 5-24.

Transfer translation

analysis up to a certain level
transfer rules S forms → T forms
not necessarily between same levels
usually on syntactic level
→ context constraints
(not available in direct translation)
distinction IL vs. transfer blurred
three-step translation

Interlingua vs. transfer

Source language analysis

Tokenization

first level in Vauquois’ △
input text to tokens (words, numbers, punctuation)
token = sequence of non-white characters
output = list of tokens
input for further processing

Obstacles of tokenization

don’t: do n’t, do n ’t, don ’t, ?
červeno-černý: červeno - černý, červeno-černý, červeno- černý

Scriptio continua

Thai

What is a word?

Tokenization

in most cases a heuristic is used
alphabetic writing systems: split on spaces
and on other punctuation marks ?!.,-()/:;
demo: unitok (alba)
paper on tokenization

Sentence segmentation

MT almost always uses sentences
90% of periods are sentence boundary indicators (Riley 1989)
using list of punctuation (!?.<>)
Měl jsem 5 (sic!) poznámek.
exceptions:
- abbreviations (aj. atd. etc. e.g.)
- degrees (RNDr., prof.)
HTML elements might be used (p, div, td, li)
demo: tag_sentences

Obstacles of sentence segmentation

Zeleninu jako rajče, mrkev atd. Petr nemá rád.
Složil zkoušku a získal titul Mgr. Petr mu dost záviděl.
John F. Kennedy = one token?
John F. Kennedy’s
related to named entity recognition
neglected step in the processing (DCEP, EUR-Lex)

Morphological level

Morphology

morpheme: the smallest item carrying a meaning
pří-lež-it-ost-n-ým-i
prefix-root-infix-suffix-suffix-suffix-affix
case, number, gender, lemma

Morphologic level

second level in Vauquois’ △
reducing the immense amount of wordforms
demo: lexicon sizes of various corpora
conversion from wordforms to lemmata
give, gives, gave, given, giving → give
dělá, dělám, dělal, dělaje, dělejme, … → dělat
analysis of grammatical categories of wordforms
dělali → dělat + past t. + continuous + plural + 3rd p.
did → do + past t. + perfective + person ? + number ? Robertovým → Robert + case ? + adjective + number ?
demo: wwwajka
demo: majka (aurora)

Morphologic analysis

for each token we get a base form, grammar categories, segmentation to morphemes
What is a base form? Lemma.
nouns: singular, nominative, positive, masculine
bycha → bych?, nejpomalejšími → pomalý
neschopný → schopný?
verbs: infinitive
neraď → radit?, bojím se → bát (se)
Why infinitive? the most frequent form of verbs
example

Morphological tags, tagset

language-dependent (various morphological categories)
attribute system: pairs category–value
maminkou k1gFnSc7
udělány k5eAaPmNgFnP
positional system: 16 fixed positions
kontury NNFP1-----A----
zdají VB-P---3P-AA---
Penn Treebank tagset (English): limited set of tags
faster RBR
doing VBG
CLAWS tagset (English)
and others (German)
gigantische ADJA.ADJA.Pos.Acc.Sg.Fem
erreicht VVPP.VPP.Full.Psp

BNC tags

head -n 10000 VERT |\
grep -v "^&lt;" |\
cut -f3 |\
sort |\
uniq -c |\
sort -rn

Morphological polysemy

in many cases: words have more than one tag
PoS polysemy (>1 lemma), in Czech
jednou k4gFnSc7, k6eAd1, k9
ženu k1gFnSc4, k5eAaImIp1nS
k1 + k2, k3 + k5?
what about English?
demo: SkELL auto PoS
polysemy within a PoS
in Czech: nominative = accusative
víno: k1gNnSc1, k1gNnSc4, …
odhalení: 10 tags

Morphological disambiguation

for each word: one tag and one lemma
morphological disambiguation
a tool: tagger
translational polysemy is another issue
pubblico – Öffentlichkeit, Publikum, Zuschauer
most of methods use context
i.e. surrounding words, lemmas, tags

Statistical disambiguation

the most probable sequence of tags
Ženu je domů.
k5|k1, k3|k5, k6|k1
Mladé muže
gF|gM, nS|nP
there are tough situations: dítě škádlí lvíče
machine learning trained on manually tagged/disambiguated data
Brill’s tagger (later), TreeTagger, Freeling, RFTagger
demo for Czech: Desamb (hybrid, DESAM)

Rule-based disambiguation

the only option if an annotated corpus not available
also used as a filter before a statistical method
rules help to capture wider context
case, number and gender agreement in noun phrases
malému (c3, gIMN) chlapci (nPc157, nSc36, gM)
a more matured: valency structure of sentences
valency: vidět koho/co, to give OBJ to DIROBJ
vidím dům → c4
I gave the present to her → DIROBJ
VerbaLex, PDEV

Morphologic segmentation

instead of lemma → root/stem?
wordlist-based segmentation: Morfessor
mít, měj, mám, měl, mívá,
various wordforms of the same lemma
-i, -ové, -a, -y
the same func., various morphemes
morphological polysemy: -s (pl., 3rd p., possession)
grammar categories have particular forms—gramemes
nad-měr-ný, ne-patr-ně, vid-ím, ne-chci, čtyř-i-cet, po-po-sun-out, u-děl-al-i
pre-de-fine, co-oper-ate, think-ing, lion-s
stemming necessary if morph. analyser not available
stemming is a good baseline (IR)
unionize: union-ize or un-ionize?

Guesser

we aim at high coverage: as many words as possible
for out-of-vocabulary (OOV) tokens
new, borrowed, compound words
stemming, guessing PoSes from word suffix
vygooglit, olajkovat, zaxzovat
sedm dunhillek
třitisícedvěstědevadesátpět znaků
funny errors: Matka božit, topit box, lido na oko, jíst berma, mimochod

Morphological disambiguation—example

slovo	analýzy	disambiguace
Pravidelné	k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5)	k2eAgNnSc1d1
krmení	k2eAgMnPc1d1, k2eAgMnPc5d1, k1gNnSc1, k1gNnSc4, k1gNnSc5, k1gNnSc6, k1gNnSc3, k1gNnSc2, k1gNnPc2, k1gNnPc1, k1gNnPc4, k1gNnPc5	k1gNnSc1
je	k5eAaImIp3nS, k3p3gMnPc4, k3p3gInPc4, k3p3gNnSc4, k3p3gNnPc4, k3p3gFnPc4, k0	k5eAaImIp3nS
pro	k7c4	k7c4
správný	k2eAgMnSc1d1, k2eAgMnSc5d1, k2eAgInSc1d1, k2eAgInSc4d1, k2eAgInSc5d1, ... (+ 18)	k2eAgInSc4d1
růst	k5eAaImF, k1gInSc1, k1gInSc4	k1gInSc4
důležité	k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5)	k2eAgNnSc1d1

Universal POS tags

number of PoS tags differ in various languages: unification

TAG	Meaning
VERB	verbs (all tenses and modes)
NOUN	nouns (common and proper)
PRON	pronouns
ADJ	adjectives
ADV	adverbs
ADP	adpositions (prepositions and postpositions)
CONJ	conjunctions
DET	determiners
NUM	cardinal numbers
PRT	particles or other functional words
X	other: foreign words, typos, abbreviations
.	punctuation

Mapping for cca 25 languages (with tree banks)

Guessing POSes from gramemes

EN	CZ	meaning
-s	-á	3rd person, sing., present simple
-ed	-al, -l, -en.	past tense
-ing	-(ov)ání	present continuous
-en	-en(.)	past participle
-s	-y, -i, -ové, -a	plural
-’s	ov(o, a, y)	possession
-er	-ší	comparative
-est	nej-, -ší	superlative
you	-’s	pronoun

A problem: myší, west, fotbal, … → myšám, wer, fotbala, božit

Brill’s tagger

PhD thesis, 1995
transformation-based, error-driven
supervised learning
accuracy over 90%
algorithm:
- initialize tagging (the most frequent tag)
- compare with training data
- make a set of rules to correct current tagging
- rate these rules
- apply the most salient rule and go to 2.
- repeat it until a threshold is reached
example of a rule: IN NN WDPREVTAG DT while

Problems of MA, POSes

quality of MA affects all further levels of analysis
precision depends on a language (English vs. Hungarian) and the size of tagsets
chončaam: my small house (Tajik)
kahramoni: you are hero (Tajik)
legeslegmagasabb: the very highest (Hungarian)
raněný: SUBS / ADJ
the big red fire truck: SUBS / ADJ?
The Duchess was entertaining last night.
Pokojem se neslo tiché pšššš.

Morphology—summary

MA introduces critical errors into the analysis
the goal is to limit the immense amount of wordforms
wordform → lemma + tag
much simpler for English (cc. 30 tags)
PoS tagging accuracy depends on a language
usually around 95%

Lexical level

Dictionaries in MT

connection between languages
transfer systems: syntactic level
dictionaries crucial for KBMT systems
GNU-FDL slovník
Wiktionary
BabelNet, OmegaWiki, …
how many items in a dict do we need / want?
→ named entities, slang, MWE
listeme: lexical item, which can not be deduced from the principle of compositionality (slaměný vdovec)
which form in a dict? → lemmatization
how many different senses is reasonable to distinguish? → granularity

Polysemy in dictionaries

words relates to senses
what is meaning of meaning?
we need a formal definition for computers
data is discrete, meaning is continuous
man: an adult male person
what about 17-years-old male person?

Smooth sense transitions

log

log chair

chair

Polysemy on several levels

morphology: -s
word level: key
multiword expressions: bílá vrána
sentence level: I saw a man with a telescope.
homonymy: accidental
- full homonymy: líčit, kolej
- partial homonymy: los, stát
polysemy is natural and ubiquitous

Meaning representation

list: a common dictionary
graph: senses:vertices, semantic relations:edges
space: senses:dots, similarity:distance

sem types

Semantic network—WordNet

literal dát:8
synset (louže:1, kaluž:1, tratoliště:1)
semantic relations: hyper-, hypo-, holo-, meronymum
150k items, 117k synsets: n, adj, v, adv
WN used as a referential bank of senses

wordnet

VerbaLex

WordNet lacks syntax, morpho-syntax
6,256 synsets: atakovat:1, útočit:2, dorážet:3, napadnout:6
valency frames (mačkat:1) and slots (19,247)
AG(person:1,kdo1)+VERB+OBJ(obj:1,co4)+[PART(hand:1,včem6)]
semantic roles
I: ABS, ISUB, AG, KNOW, PAT, VERB, … (29)
II: abstraction:1, person:1, artifact:1, … (1,000)
further constraints:
prepositional cases, animacy, part of speech, obligation
synsets linked to WordNet → dictionary

Word sense disambiguation

finding a proper sense of a word in a given context
trivial for human, very hard for computers
classification task
we need a finite inventory of senses
when using WN: determine a proper synset
hard to evaluate (SensEval, SemEval)
accuracy about 90%
crucial task for KBMT:
Ludvig dodávka Beethoven, kiss me honey, …
box in the pen (Bar-Hillel)

WSD: deep methods

knowledge about world (common sense)
not suitable for general language
knowledge representation: apples are red or green
Lesk’s algorithm: context vs. definition
algorithms using valency dictionaries

WSD: shallow methods

words from context
cheaper, faster implementation
various methods of machine learning (classification)
supervised, unsupervised
variants of Brill’s algorithm might be used
similar to morph. disambiguation

Granularity: cat

WordNet

feline mammal usually having thick soft fur and no ability to roar
an informal term for a youth or man
a spiteful woman gossip
the leaves of the shrub Catha edulis (tea)
a whip with nine knotted cords (British sailors feared the cat)
a large tracked vehicle propelled by two endless metal belts
any of several large cats living in the wild
a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis
CAT

Granularity: oko

zrakový orgán
klička, smyčka, kroužek z různého materiálu
věc připomínající tvarem oko (morské oko)
jednotka v kartách, loterii
druh karetní hry

Granularity: dát

odevzdat do vlastictví, darovat, prodat
vyžádat, způsobit (dá to mnoho práce)
umístění něčeho
dopřát, dovolit, připustit (nedej pane)
projevit nedostatek odporu (dát se ošidit)
přikázat (dát něco udělat)

VerbaLex states 32 (!) senses (irreflexive variants).

Granularity: malý

neveliký rozměry, počtem, časovým rozsahem
nedospělý
slabý, nevydatný (malý rozhled)
nevýznamný (malý pán)
téměř (malý zázrak)
děvčátko (malá)
přihrávka vlastnímu brankáři (malá domů)

Granularity for MT

The granularity of translation dictionaries may be enough: a word $w$ has exactly the number of senses as it has equivalents in a dictionary.

What is the most polysemous word in English?

Answers from

wordnik, PDEV

Lexica: summary

sense mainly on word level (dictionaries)
compositionality reduces size of dictionaries
WSD crucial for rule-based systems
accuracy of WSD affected by a chosen granularity
lexical polysemy is a bottleneck of (RB)MT

Kilgarriff, Adam. I don’t believe in word senses. Computers and the Humanities 31.2 (1997): 91–113.

Syntactic level

Syntactic analysis

next level in MT triangle
goal: finite structures for infinite number of sentences
finite representation ~ finite grammar (rules)
input (usually): morphologically annotated data
output: syntactic tree, bush, forest, (multi)graph
tools: parsers
the task: for a given grammar and input sentence, generate all possible derivation trees
potentially millions of different analyses
demo: wwwSynt
what is needed:
- a formalism
- a grammar in that formalism
- implementation of parsing algorithm
currently, many parsers use statistics (rules can be learnt)

Context-free grammar

Grammars

regular grammar
rules only N -> epsilon | A | bB
tree-adjoining grammar
nonterminals can be trees

Types of analyses

top-down analysis
the left-most derivation of a sentence
bottom-up analysis
rules rewriting a sentence to root non-terminal

Why syntactic analysis?

semantic interpretation of source code (computer science)
intermediate step to a semantic representation
transfer systems: finite number of transfer rules for infinite number of phrases/sentences
WSD: distant dependencies (wider context)
what words belong together and what words do not

Syntactic ambiguity

I saw a man with a telescope.
Viděl jsem muže (s) dalekohledem.
I’m glad I’m a man, and so is Lola.
Jsem rád, že jsem muž a Lola také.
Someone ate every tomato.
Někdo snědl všechna rajčata.
Každé rajče bylo někým snězeno.
Lvíče škádlí dítě.
A child teases a lion cub.
A lion cub teases a child.
Flying planes can be dangerous.
Letadlo spadlo do pole za lesem. (PP attachment)
Ženu holí stroj.
Zabít ne propustit.
Ibis, redibis nunquam per bella peribis.
Rodiče by mu mohli závidět.
Eat the pizza with a fork.
Eat the pizza with the anchovies.
Neboť každý, kdo prosí, dostává a kdo hledá, nalézá a tomu, kdo tluče, bude otevřeno. (Lk: 11,10)

Partial syntactic polysemy—garden path

The man returned to his house was happy.
The man whistling tunes pianos.
Time flies like an arrow; fruit flies like a banana.
Ženu krávy nezajímají.
The complex houses married and single soldiers and their families

…cognitive plausibility of parsing.

Phrase structure

one of the oldest formalisms (Chomsky)
a grammar contains rewriting rules
usually a context-free grammar
captures hierarchy of constituents

Example

S   -> NP VP
VP  -> ADV V | V ADV
NP  -> DET N
DET -> the | a | an
N   -> cat | dog
...

Analyse: the dog runs fast (bottom-up and top-down)

Phrasal tree

Constituency (phrasal structure)

suitable for: fixed word order, coordinations
drawback: inability to express non-projectivity by unbroken phrasal tree
nonprojective dependency = a dependency between two words separated by another word depending on neither of them
I saw a man with a dog yesterday which was a Yorkshire terrier.

Dependency structure

captures dependencies between words
dependency tree does not contain nonterminals
a root and dependent (subordinate) words
suitable for free word order languages

Dependency tree

Dependency

suitable for: free word order, morphosyntactic agreement
drawback: inability to express complement (double dependency)
Babička seděla u stolu shrbená. (complement)
Babička seděla u stolu shrbeně. (ADV)
solution: hybrid trees

Hybrid trees

Hybrid tree I

Hybrid tree (SET)

Evaluation of parsing quality

which analysis is the best?
quality evaluation is tough and hard to interpret
usually a tree similarity to a gold standard (PDT)
the best tools achieve accuracy around 85%

Transfer translation

Transfer scheme

Example of transfer rules I

Example of transfer rules II

From Arturo Trujillo, Translation Engines: Techniques for Machine Translation.

Writing rules

You like her. x Ella te gusta.

Transfer syntax

Classes of rules

Head switching
The baby just ate X El bébé acaba de comer
Structural:
Peter entered the house X Petr vešel do domu
Lexical Gap
Marry (has) jumped up a little X Marie povyskočila
Categorial (different synt. categories)
A little bread X trochu (kousek) chleba
Collocational, idiomatic
I am hungry X mám hlad
Generalization/morphological
I am at school X Jsem ve škole.

Semantic level / analysis

representation of total meaning impossible: common sense, sensory reception, interpersonal relationships, nonverbal communication, …
some transfer systems do not require sem. analysis
boundary between syntax and semantics sometimes blurred (deep analysis)
another level of language: pragmatics (speech acts)
and logic: how large is its intersection with language? Is logic necessary for MT?
arguments against IL: meaning is subjective, often language, culturally and historically dependent

Semantic roles

syntax allows to uncover semantic relations
constituents of sentences correspond to semantic roles
relation of predicate and other sentence constituents
also referred as semantic case, thematic role, theta role
agent, causer, instrument, manner, patient, result, time, source
various sets of roles, see e.g. VerbaLex (29 roles)

  Dítě     škádlí  lvíče.
  AG/SUBJ  V       PAT/OBJ

  A child (SUBJ)    teases (V) a lion cub (OBJ).
  A lion cub (SUBJ) teases (V) a child (OBJ).

Errors propagated from below

zatímco trhal prsty svého pstruha

(George R. R. Martin, Hostina pro vrány)

FrameNet

Miller, Berkeley
electronic dictionary of semantic frames
a frame describes a thing, a state or an action and their participants
a situation: process of cooking comprises a cook, a meal, a pot, a heater, etc.
frame Apply_heat, role Cook, Food, Heating_instrument, …
800 frames, 10,000 lex. items, 120,000 annotated sentences
training data for semantic role analysis
demo

FrameNet: Closure

Framenet

An Agent manipulates a Fastener to open or close a Containing_object (e.g. coat, jar). Sometimes an Enclosed_region or a Container_portal may be expressed. Since the Manipulator is syntactically omissible, many verbs in this frame incorporate the Fastener.

Mary closed her coat with a belt.

Prague Dependency TreeBank 2.0

an application of theories of Prague linguistic circle
functional generative description of language
levels: phonological, phonetic, morphonologic, morphematic, surface syntax and
tectogrammatical level—a level of language meaning
a lower level is a form of a higher level and a higher level is a function of lower level
2 M morphologically, 1.5 M syntactically a 800 k semantically annotated words from newspapers in Czech Natinal Library
contains also coreference and topic-focus structure
Petr gave Marry a bouquet. Then he took it and put into a vase.
nodes in the structure for unexpressed words
relations among nodes on different levels

PDT layers

TectoMT

MT system using PDT formalism, high modularity
splitting tasks to a sequence of blocks—scenarios
blocks are Perl scripts communicating via API
structure of the system corresponds with PDT
internal representation of language: trees in tmt format derived from PML format for PDT
blocks allow massive data processing, parallelisation
rule-based, statistical, hybrid methods
processing: conversion to tmt format → application of a scenario → conversion to output format

TectoMT: a simple block

English negative particles → verb attributes

sub process_document {
  my ($self,$document) = @_;

  foreach my $bundle ($document->get_bundles()) {
    my $a_root = $bundle->get_tree('SEnglishA');

    foreach my $a_node ($a_root->get_descendants) {
      my ($eff_parent) = $a_node->get_eff_parents;
      if ($a_node->get_attr('m/lemma')=~/^(not|n't)$/
          and $eff_parent->get_attr('m/tag')=~/^V/ ) {
        $a_node->set_attr('is_aux_to_parent',1);
      }
    }
  }
}

Tecto-align

Analysis in RBMT

morphology: basic word forms (lemmata)
syntax: sentence level, parser, a chosen formalism
semantics: representing meaning of lexical items, relations between words, usually on sentence level; usually limited to a domain (ontology)
pragmatics, discourse analysis: above sentence level; anaphora, intention

Synthesis in RBMT, issues

content exclusion: what is an output and what should be deduced by a recipient
Koupil jsem si nový mobil. Nový mobil má velký display. Nový mobil má velká tlačítka.
proposition order
Nový mobil má velký display. Koupil jsme si nový mobil.
lexical choice (related to WSD), but also
for my father: pro můj otec
syntactic choice
Uvařil jsem guláš. Guláš byl mnou uvařen.
constituents order
Uvařil jsem guláš. Guláš jsem uvařil.
coreference: e.g. anaphora insertion / deletion
Koupil jsem nový mobil. Má velký display.
generating surface structures (character sequences)
lemma+tag → wordform

Rule-based systems: conclusion

(purely) rule-based systems overthrown
statistical systems achieve better results
many linguistic phenomena hard to distinguish even for human
interannotator agreement
but still, some methods from RBMT may improve SMT
RBMT development rather sluggish

published: 2017-03-01
last modified: 2023-11-20

#edu #mt #machine translation

https://vit.baisa.cz/notes/learn/mt-rules/

Machine Translation – Rule-based Systems

Rule-based Machine Translation

Knowledge-based Machine Translation

KBMT classification

Direct translation

Direct translation

MT with interlingua

Rosetta

KBMT-89

Transfer translation

PC translator (LangSoft)

Systran

Interlingua vs. transfer

Source language analysis

Tokenization

Obstacles of tokenization

Scriptio continua

Tokenization

Sentence segmentation

Obstacles of sentence segmentation

Morphological level

Morphology

Morphologic level

Morphologic analysis

Morphological tags, tagset

DEMO: tag list

Morphological polysemy

Morphological disambiguation

Statistical disambiguation

Rule-based disambiguation

Morphologic segmentation

Guesser

Morphological disambiguation—example

Universal POS tags

Guessing POSes from gramemes

Brill’s tagger

Problems of MA, POSes

Morphology—summary

Lexical level

Dictionaries in MT

Polysemy in dictionaries

Smooth sense transitions

Polysemy on several levels

Meaning representation

Semantic network—WordNet

VerbaLex

Word sense disambiguation

WSD: deep methods

WSD: shallow methods

Granularity: cat

Granularity: oko

Granularity: dát

Granularity: malý

Granularity for MT

Lexica: summary

Syntactic level

Syntactic analysis

Context-free grammar

Grammars

Types of analyses

Why syntactic analysis?

Syntactic ambiguity

Partial syntactic polysemy—garden path

Phrase structure

Example

Phrasal tree

Constituency (phrasal structure)

Dependency structure

Dependency tree

Dependency

Hybrid trees

Evaluation of parsing quality

Transfer translation

Writing rules

Classes of rules

Semantic level / analysis

Semantic roles

Errors propagated from below

FrameNet

FrameNet: Closure