Machine Translation – Rule-based Systems

Go back to Introduction or forward to Statistical MT.

Rule-based Machine Translation

  • RBMT
  • linguistic knowledge in form of rules
  • rules for analysis of SL
  • rules for transfer between languages
  • rules for generation/rendering/synthesis of TL

Knowledge-based Machine Translation

  • systems using linguistic knowledge about languages
  • older types, more general notion
  • analysis of meaning of SL is crucial
  • no total meaning (connotations, common sense)
  • to be able to translate vrána na stromě
    not necessary to know vrána is a bird and can fly
  • term KBMT rather for systems with interlingua
  • for us KBMT $=$ RBMT

KBMT classification

  • direct translation
  • systems with interlingua
  • transfer systems

The only types of MT until 90s.

Direct translation

Direct translation

  • focus on S↔T elements correspondences
  • first experiments on En-Ru pair
  • all components are bound to a language pair (and one direction)
  • typically consists of:
    • translation dictionary
    • monolithic program dealing with analysis and generation
  • necessarily one-directional and bilingual
  • efficacy: for N languages we need ?
  • N x (N-1) direct bilingual systems / modules

MT with interlingua

  • we suppose it is possible to convert SL to a language-independent representation
  • interlingua (IL) must be unambiguous
  • two steps: analysis & synthesis (generation)
  • from IL, TL is generated
  • analysis is SL-dependent but TL-independent
  • and vice versa for synthesis
  • for translation among N languages
  • we need 2 x N systems / modules

Rosetta

It should be stressed that the isomorphy and not the interlinguality is the primary characteristic of our approach.

Two sentences are considered translations of each other if they have the same semantic derivation trees, i.e. corresponding syntactic derivation trees.

  • grammar: 14,000 lines in Pascal
  • all three: interlingua-based
  • analysis: semantic derivational tree
  • research, not commercial
Rosetta2 Rosetta3 Rosetta4
release 1985 1988 1991
speed 1-3 words/sec ? ?
dictionary 5,000 90,000 ?
SL NL, EN NL, EN NL, EN
TL NL, EN NL, EN, ES NL, EN, ES

KBMT-89

kbmt-89-scheme

Nirenburg, Sergei. Knowledge-based machine translation. Machine Translation 4.1 (1989): 5-24.

Transfer translation

  • analysis up to a certain level
  • transfer rules S forms → T forms
  • not necessarily between same levels
  • usually on syntactic level
    → context constraints
    (not available in direct translation)
  • distinction IL vs. transfer blurred
  • three-step translation

PC translator (LangSoft)

img

Systran

img

Interlingua vs. transfer

img

Source language analysis

Tokenization

  • first level in Vauquois’ △
  • input text to tokens (words, numbers, punctuation)
  • token = sequence of non-white characters
  • output = list of tokens
  • input for further processing

Obstacles of tokenization

  • don’t: do n’t, do n ‘t, don ‘t, ?
  • červeno-černý: červeno - černý, červeno-černý, červeno- černý

Scriptio continua

Thai

What is a word?

Tokenization

  • in most cases a heuristic is used
  • alphabetic writing systems: split on spaces
    and on other punctuation marks ?!.,-()/:;
  • demo: unitok (alba)
  • paper on tokenization

Sentence segmentation

  • MT almost always uses sentences
  • 90% of periods are sentence boundary indicators (Riley 1989)
  • using list of punctuation (!?.<>)
    Měl jsem 5 (sic!) poznámek.
  • exceptions:
    • abbreviations (aj. atd. etc. e.g.)
    • degrees (RNDr., prof.)
  • HTML elements might be used (p, div, td, li)
  • demo: tag_sentences

Obstacles of sentence segmentation

  • Zeleninu jako rajče, mrkev atd. Petr nemá rád.
  • Složil zkoušku a získal titul Mgr. Petr mu dost záviděl.
  • John F. Kennedy = one token?
    John F. Kennedy’s
  • related to named entity recognition
  • neglected step in the processing (DCEP, EUR-Lex)

Morphological level

Morphology

  • morpheme: the smallest item carrying a meaning
  • pří-lež-it-ost-n-ým-i
  • prefix-root-infix-suffix-suffix-suffix-affix
  • case, number, gender, lemma

Morphologic level

  • second level in Vauquois’ △
  • reducing the immense amount of wordforms
  • demo: lexicon sizes of various corpora
  • conversion from wordforms to lemmata
    give, gives, gave, given, giving → give
    dělá, dělám, dělal, dělaje, dělejme, … → dělat
  • analysis of grammatical categories of wordforms
    dělali → dělat + past t. + continuous + plural + 3rd p.
    did → do + past t. + perfective + person ? + number ? Robertovým → Robert + case ? + adjective + number ?
  • demo: wwwajka
  • demo: majka (aurora)

Morphologic analysis

  • for each token we get a base form, grammar categories, segmentation to morphemes
  • What is a base form? Lemma.
  • nouns: singular, nominative, positive, masculine
  • bycha → bych?, nejpomalejšími → pomalý
    neschopný → schopný?
  • verbs: infinitive
  • neraď → radit?, bojím se → bát (se)
  • Why infinitive? the most frequent form of verbs
  • example

Morphological tags, tagset

  • language-dependent (various morphological categories)
  • attribute system: pairs category–value
    maminkou k1gFnSc7
    udělány k5eAaPmNgFnP
  • positional system: 16 fixed positions
    kontury NNFP1-----A----
    zdají VB-P---3P-AA---
  • Penn Treebank tagset (English): limited set of tags
    faster RBR
    doing VBG
  • CLAWS tagset (English)
  • and others (German)
    gigantische ADJA.ADJA.Pos.Acc.Sg.Fem
    erreicht VVPP.VPP.Full.Psp

DEMO: tag list

BNC tags

head -n 10000 VERT |\
grep -v "^&lt;" |\
cut -f3 |\
sort |\
uniq -c |\
sort -rn

Morphological polysemy

  • in many cases: words have more than one tag
  • PoS polysemy (>1 lemma), in Czech
    jednou k4gFnSc7, k6eAd1, k9
    ženu k1gFnSc4, k5eAaImIp1nS
    k1 + k2, k3 + k5?
  • what about English?
  • demo: SkELL auto PoS
  • polysemy within a PoS
  • in Czech: nominative = accusative
    víno: k1gNnSc1, k1gNnSc4, …
    odhalení: 10 tags

Morphological disambiguation

  • for each word: one tag and one lemma
  • morphological disambiguation
  • a tool: tagger
  • translational polysemy is another issue
    pubblico – Öffentlichkeit, Publikum, Zuschauer
  • most of methods use context
  • i.e. surrounding words, lemmas, tags

Statistical disambiguation

  • the most probable sequence of tags
    Ženu je domů.
    k5|k1, k3|k5, k6|k1
    Mladé muže
    gF|gM, nS|nP
  • there are tough situations: dítě škádlí lvíče
  • machine learning trained on manually tagged/disambiguated data
  • Brill’s tagger (later), TreeTagger, Freeling, RFTagger
  • demo for Czech: Desamb (hybrid, DESAM)

Rule-based disambiguation

  • the only option if an annotated corpus not available
  • also used as a filter before a statistical method
  • rules help to capture wider context
  • case, number and gender agreement in noun phrases
    malému (c3, gIMN) chlapci (nPc157, nSc36, gM)
  • a more matured: valency structure of sentences
    valency: vidět koho/co, to give OBJ to DIROBJ
    vidím dům → c4
    I gave the present to her → DIROBJ
  • VerbaLex, PDEV

Morphologic segmentation

  • instead of lemma → root/stem?
  • wordlist-based segmentation: Morfessor
  • mít, měj, mám, měl, mívá,
    various wordforms of the same lemma
  • -i, -ové, -a, -y
    the same func., various morphemes
  • morphological polysemy: -s (pl., 3rd p., possession)
  • grammar categories have particular forms—gramemes
    nad-měr-ný, ne-patr-ně, vid-ím, ne-chci, čtyř-i-cet, po-po-sun-out, u-děl-al-i
    pre-de-fine, co-oper-ate, think-ing, lion-s
  • stemming necessary if morph. analyser not available
  • stemming is a good baseline (IR)
  • unionize: union-ize or un-ionize?

Guesser

  • we aim at high coverage: as many words as possible
  • for out-of-vocabulary (OOV) tokens
  • new, borrowed, compound words
  • stemming, guessing PoSes from word suffix
  • vygooglit, olajkovat, zaxzovat
  • sedm dunhillek
  • třitisícedvěstědevadesátpět znaků
  • funny errors: Matka božit, topit box, lido na oko, jíst berma, mimochod

Morphological disambiguation—example

slovo analýzy disambiguace
Pravidelné k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5) k2eAgNnSc1d1
krmení k2eAgMnPc1d1, k2eAgMnPc5d1, k1gNnSc1, k1gNnSc4, k1gNnSc5, k1gNnSc6, k1gNnSc3, k1gNnSc2, k1gNnPc2, k1gNnPc1, k1gNnPc4, k1gNnPc5 k1gNnSc1
je k5eAaImIp3nS, k3p3gMnPc4, k3p3gInPc4, k3p3gNnSc4, k3p3gNnPc4, k3p3gFnPc4, k0 k5eAaImIp3nS
pro k7c4 k7c4
správný k2eAgMnSc1d1, k2eAgMnSc5d1, k2eAgInSc1d1, k2eAgInSc4d1, k2eAgInSc5d1, ... (+ 18) k2eAgInSc4d1
růst k5eAaImF, k1gInSc1, k1gInSc4 k1gInSc4
důležité k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1, k2eAgNnSc5d1, ... (+ 5) k2eAgNnSc1d1

Universal POS tags

  • number of PoS tags differ in various languages: unification
TAG Meaning
VERB verbs (all tenses and modes)
NOUN nouns (common and proper)
PRON pronouns
ADJ adjectives
ADV adverbs
ADP adpositions (prepositions and postpositions)
CONJ conjunctions
DET determiners
NUM cardinal numbers
PRT particles or other functional words
X other: foreign words, typos, abbreviations
. punctuation

Mapping for cca 25 languages (with tree banks)

Guessing POSes from gramemes

EN CZ meaning
-s 3rd person, sing., present simple
-ed -al, -l, -en. past tense
-ing -(ov)ání present continuous
-en -en(.) past participle
-s -y, -i, -ové, -a plural
-‘s ov(o, a, y) possession
-er -ší comparative
-est nej-, -ší superlative
you -‘s pronoun

A problem: myší, west, fotbal, … → myšám, wer, fotbala, božit

Brill’s tagger

  • PhD thesis, 1995
  • transformation-based, error-driven
  • supervised learning
  • accuracy over 90%
  • algorithm:
    • initialize tagging (the most frequent tag)
    • compare with training data
    • make a set of rules to correct current tagging
    • rate these rules
    • apply the most salient rule and go to 2.
    • repeat it until a threshold is reached
  • example of a rule: IN NN WDPREVTAG DT while

Problems of MA, POSes

  • quality of MA affects all further levels of analysis
  • precision depends on a language (English vs. Hungarian) and the size of tagsets
  • chončaam: my small house (Tajik)
  • kahramoni: you are hero (Tajik)
  • legeslegmagasabb: the very highest (Hungarian)
  • raněný: SUBS / ADJ
  • the big red fire truck: SUBS / ADJ?
  • The Duchess was entertaining last night.
  • Pokojem se neslo tiché pšššš.

Morphology—summary

  • MA introduces critical errors into the analysis
  • the goal is to limit the immense amount of wordforms
  • wordform → lemma + tag
  • much simpler for English (cc. 30 tags)
  • PoS tagging accuracy depends on a language
  • usually around 95%

Lexical level

Dictionaries in MT

  • connection between languages
  • transfer systems: syntactic level
  • dictionaries crucial for KBMT systems
  • GNU-FDL slovník
  • Wiktionary
  • BabelNet, OmegaWiki, …
  • how many items in a dict do we need / want?
    → named entities, slang, MWE
  • listeme: lexical item, which can not be deduced from the principle of compositionality (slaměný vdovec)
  • which form in a dict? → lemmatization
  • how many different senses is reasonable to distinguish? → granularity

Polysemy in dictionaries

  • words relates to senses
  • what is meaning of meaning?
  • we need a formal definition for computers
  • data is discrete, meaning is continuous
  • man: an adult male person
    what about 17-years-old male person?

Smooth sense transitions


log

log chair

chair

Polysemy on several levels

  • morphology: -s
  • word level: key
  • multiword expressions: bílá vrána
  • sentence level: I saw a man with a telescope.
  • homonymy: accidental
    • full homonymy: líčit, kolej
    • partial homonymy: los, stát
  • polysemy is natural and ubiquitous

Meaning representation

  • list: a common dictionary
  • graph: senses:vertices, semantic relations:edges
  • space: senses:dots, similarity:distance

sem types

Semantic network—WordNet

  • literal dát:8
  • synset (louže:1, kaluž:1, tratoliště:1)
  • semantic relations: hyper-, hypo-, holo-, meronymum
  • 150k items, 117k synsets: n, adj, v, adv
  • WN used as a referential bank of senses

wordnet

VerbaLex

  • WordNet lacks syntax, morpho-syntax
  • 6,256 synsets: atakovat:1, útočit:2, dorážet:3, napadnout:6
  • valency frames (mačkat:1) and slots (19,247)
    AG(person:1,kdo1)+VERB+OBJ(obj:1,co4)+[PART(hand:1,včem6)]
  • semantic roles
    I: ABS, ISUB, AG, KNOW, PAT, VERB, … (29)
    II: abstraction:1, person:1, artifact:1, … (1,000)
  • further constraints:
    prepositional cases, animacy, part of speech, obligation
  • synsets linked to WordNet → dictionary

Word sense disambiguation

  • finding a proper sense of a word in a given context
  • trivial for human, very hard for computers
  • classification task
  • we need a finite inventory of senses
  • when using WN: determine a proper synset
  • hard to evaluate (SensEval, SemEval)
  • accuracy about 90%
  • crucial task for KBMT:
    Ludvig dodávka Beethoven, kiss me honey, …
    box in the pen (Bar-Hillel)

WSD: deep methods

  • knowledge about world (common sense)
  • not suitable for general language
  • knowledge representation: apples are red or green
  • Lesk’s algorithm: context vs. definition
  • algorithms using valency dictionaries

WSD: shallow methods

  • words from context
  • cheaper, faster implementation
  • various methods of machine learning (classification)
  • supervised, unsupervised
  • variants of Brill’s algorithm might be used
  • similar to morph. disambiguation

Granularity: cat

WordNet

  • feline mammal usually having thick soft fur and no ability to roar
  • an informal term for a youth or man
  • a spiteful woman gossip
  • the leaves of the shrub Catha edulis (tea)
  • a whip with nine knotted cords (British sailors feared the cat)
  • a large tracked vehicle propelled by two endless metal belts
  • any of several large cats living in the wild
  • a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis
  • CAT

Granularity: oko

  • zrakový orgán
  • klička, smyčka, kroužek z různého materiálu
  • věc připomínající tvarem oko (morské oko)
  • jednotka v kartách, loterii
  • druh karetní hry

Granularity: dát

  • odevzdat do vlastictví, darovat, prodat
  • vyžádat, způsobit (dá to mnoho práce)
  • umístění něčeho
  • dopřát, dovolit, připustit (nedej pane)
  • projevit nedostatek odporu (dát se ošidit)
  • přikázat (dát něco udělat)

VerbaLex states 32 (!) senses (irreflexive variants).

Granularity: malý

  • neveliký rozměry, počtem, časovým rozsahem
  • nedospělý
  • slabý, nevydatný (malý rozhled)
  • nevýznamný (malý pán)
  • téměř (malý zázrak)
  • děvčátko (malá)
  • přihrávka vlastnímu brankáři (malá domů)

Granularity for MT

The granularity of translation dictionaries may be enough: a word $w$ has exactly the number of senses as it has equivalents in a dictionary.

What is the most polysemous word in English?

Answers from

wordnik, PDEV

Lexica: summary

  • sense mainly on word level (dictionaries)
  • compositionality reduces size of dictionaries
  • WSD crucial for rule-based systems
  • accuracy of WSD affected by a chosen granularity
  • lexical polysemy is a bottleneck of (RB)MT

Kilgarriff, Adam. I don’t believe in word senses. Computers and the Humanities 31.2 (1997): 91–113.

Syntactic level

Syntactic analysis

  • next level in MT triangle
  • goal: finite structures for infinite number of sentences
  • finite representation ~ finite grammar (rules)
  • input (usually): morphologically annotated data
  • output: syntactic tree, bush, forest, (multi)graph
  • tools: parsers
  • the task: for a given grammar and input sentence, generate all possible derivation trees
  • potentially millions of different analyses
  • demo: wwwSynt
  • what is needed:
    • a formalism
    • a grammar in that formalism
    • implementation of parsing algorithm
  • currently, many parsers use statistics (rules can be learnt)

Context-free grammar

Context-free grammar

Grammars

  • regular grammar
    rules only N -> epsilon | A | bB
  • tree-adjoining grammar
    nonterminals can be trees

Types of analyses

  • top-down analysis
    the left-most derivation of a sentence
  • bottom-up analysis
    rules rewriting a sentence to root non-terminal

Why syntactic analysis?

  • semantic interpretation of source code (computer science)
  • intermediate step to a semantic representation
  • transfer systems: finite number of transfer rules for infinite number of phrases/sentences
  • WSD: distant dependencies (wider context)
  • what words belong together and what words do not

Syntactic ambiguity

  • I saw a man with a telescope.
    Viděl jsem muže (s) dalekohledem.
  • I’m glad I’m a man, and so is Lola.
    Jsem rád, že jsem muž a Lola také.
  • Someone ate every tomato.
    Někdo snědl všechna rajčata.
    Každé rajče bylo někým snězeno.
  • Lvíče škádlí dítě.
    A child teases a lion cub.
    A lion cub teases a child.
  • Flying planes can be dangerous.
  • Letadlo spadlo do pole za lesem. (PP attachment)
  • Ženu holí stroj.
  • Zabít ne propustit.
  • Ibis, redibis nunquam per bella peribis.
  • Rodiče by mu mohli závidět.
  • Eat the pizza with a fork.
  • Eat the pizza with the anchovies.
  • Neboť každý, kdo prosí, dostává a kdo hledá, nalézá a tomu, kdo tluče, bude otevřeno. (Lk: 11,10)

Partial syntactic polysemy—garden path

  • The man returned to his house was happy.
  • The man whistling tunes pianos.
  • Time flies like an arrow; fruit flies like a banana.
  • Ženu krávy nezajímají.
  • The complex houses married and single soldiers and their families

…cognitive plausibility of parsing.

Phrase structure

  • one of the oldest formalisms (Chomsky)
  • a grammar contains rewriting rules
  • usually a context-free grammar
  • captures hierarchy of constituents

Example

S   -> NP VP
VP  -> ADV V | V ADV
NP  -> DET N
DET -> the | a | an
N   -> cat | dog
...

Analyse: the dog runs fast (bottom-up and top-down)

Phrasal tree

Phrasal tree

Constituency (phrasal structure)

  • suitable for: fixed word order, coordinations
  • drawback: inability to express non-projectivity by unbroken phrasal tree
    nonprojective dependency = a dependency between two words separated by another word depending on neither of them
  • I saw a man with a dog yesterday which was a Yorkshire terrier.

Dependency structure

  • captures dependencies between words
  • dependency tree does not contain nonterminals
  • a root and dependent (subordinate) words
  • suitable for free word order languages

Dependency tree

Dependency tree

Dependency

  • suitable for: free word order, morphosyntactic agreement
  • drawback: inability to express complement (double dependency)
    Babička seděla u stolu shrbená. (complement)
    Babička seděla u stolu shrbeně. (ADV)
  • solution: hybrid trees

Hybrid trees

Hybrid tree I

Hybrid tree (SET)

Evaluation of parsing quality

  • which analysis is the best?
  • quality evaluation is tough and hard to interpret
  • usually a tree similarity to a gold standard (PDT)
  • the best tools achieve accuracy around 85%

Transfer translation

Transfer scheme

Example of transfer rules I

Example of transfer rules II

From Arturo Trujillo, Translation Engines: Techniques for Machine Translation.

Writing rules

You like her. x Ella te gusta.

Transfer syntax

Classes of rules

  • Head switching
    The baby just ate X El bébé acaba de comer
  • Structural:
    Peter entered the house X Petr vešel do domu
  • Lexical Gap
    Marry (has) jumped up a little X Marie povyskočila
  • Categorial (different synt. categories)
    A little bread X trochu (kousek) chleba
  • Collocational, idiomatic
    I am hungry X mám hlad
  • Generalization/morphological
    I am at school X Jsem ve škole.

Semantic level / analysis

  • representation of total meaning impossible: common sense, sensory reception, interpersonal relationships, nonverbal communication, …
  • some transfer systems do not require sem. analysis
  • boundary between syntax and semantics sometimes blurred (deep analysis)
  • another level of language: pragmatics (speech acts)
  • and logic: how large is its intersection with language? Is logic necessary for MT?
  • arguments against IL: meaning is subjective, often language, culturally and historically dependent

Semantic roles

  • syntax allows to uncover semantic relations
  • constituents of sentences correspond to semantic roles
  • relation of predicate and other sentence constituents
  • also referred as semantic case, thematic role, theta role
  • agent, causer, instrument, manner, patient, result, time, source
  • various sets of roles, see e.g. VerbaLex (29 roles)
  Dítě     škádlí  lvíče.
  AG/SUBJ  V       PAT/OBJ

  A child (SUBJ)    teases (V) a lion cub (OBJ).
  A lion cub (SUBJ) teases (V) a child (OBJ).

Errors propagated from below

zatímco trhal prsty svého pstruha

(George R. R. Martin, Hostina pro vrány)

FrameNet

  • Miller, Berkeley
  • electronic dictionary of semantic frames
  • a frame describes a thing, a state or an action and their participants
  • a situation: process of cooking comprises a cook, a meal, a pot, a heater, etc.
  • frame Apply_heat, role Cook, Food, Heating_instrument, …
  • 800 frames, 10,000 lex. items, 120,000 annotated sentences
  • training data for semantic role analysis
  • demo

FrameNet: Closure

Framenet

An Agent manipulates a Fastener to open or close a Containing_object (e.g. coat, jar). Sometimes an Enclosed_region or a Container_portal may be expressed. Since the Manipulator is syntactically omissible, many verbs in this frame incorporate the Fastener.

Mary closed her coat with a belt.

Prague Dependency TreeBank 2.0

  • an application of theories of Prague linguistic circle
  • functional generative description of language
  • levels: phonological, phonetic, morphonologic, morphematic, surface syntax and
  • tectogrammatical level—a level of language meaning
  • a lower level is a form of a higher level and a higher level is a function of lower level
  • 2 M morphologically, 1.5 M syntactically a 800 k semantically annotated words from newspapers in Czech Natinal Library
  • contains also coreference and topic-focus structure
    Petr gave Marry a bouquet. Then he took it and put into a vase.
  • nodes in the structure for unexpressed words
  • relations among nodes on different levels

PDT layers

TectoMT

  • MT system using PDT formalism, high modularity
  • splitting tasks to a sequence of blocks—scenarios
  • blocks are Perl scripts communicating via API
  • structure of the system corresponds with PDT
  • internal representation of language: trees in tmt format derived from PML format for PDT
  • blocks allow massive data processing, parallelisation
  • rule-based, statistical, hybrid methods
  • processing: conversion to tmt format → application of a scenario → conversion to output format

TectoMT: a simple block

English negative particles → verb attributes

sub process_document {
  my ($self,$document) = @_;

  foreach my $bundle ($document->get_bundles()) {
    my $a_root = $bundle->get_tree('SEnglishA');

    foreach my $a_node ($a_root->get_descendants) {
      my ($eff_parent) = $a_node->get_eff_parents;
      if ($a_node->get_attr('m/lemma')=~/^(not|n't)$/
          and $eff_parent->get_attr('m/tag')=~/^V/ ) {
        $a_node->set_attr('is_aux_to_parent',1);
      }
    }
  }
}

Tecto-align

Analysis in RBMT

  • morphology: basic word forms (lemmata)
  • syntax: sentence level, parser, a chosen formalism
  • semantics: representing meaning of lexical items, relations between words, usually on sentence level; usually limited to a domain (ontology)
  • pragmatics, discourse analysis: above sentence level; anaphora, intention

Synthesis in RBMT, issues

  • content exclusion: what is an output and what should be deduced by a recipient
    Koupil jsem si nový mobil. Nový mobil má velký display. Nový mobil má velká tlačítka.
  • proposition order
    Nový mobil má velký display. Koupil jsme si nový mobil.
  • lexical choice (related to WSD), but also
    for my father: pro můj otec
  • syntactic choice
    Uvařil jsem guláš. Guláš byl mnou uvařen.
  • constituents order
    Uvařil jsem guláš. Guláš jsem uvařil.
  • coreference: e.g. anaphora insertion / deletion
    Koupil jsem nový mobil. Má velký display.
  • generating surface structures (character sequences)
    lemma+tag → wordform

Rule-based systems: conclusion

  • (purely) rule-based systems overthrown
  • statistical systems achieve better results
  • many linguistic phenomena hard to distinguish even for human
  • interannotator agreement
  • but still, some methods from RBMT may improve SMT
  • RBMT development rather sluggish
March 1, 2017 |