Machine Translation – Introduction

Which one is human translation?

Moses is an implementation of the statistical (or data-driven) approach to machine translation (MT). This is the dominant approach in the field at the moment, and is employed by the online translation systems deployed by the likes of Google and Microsoft.

Mojžíš je implementace statistické (nebo řízené daty) přístupu k strojového překladu (MT). To je převládajícím přístupem v oblasti v současné době, a je zaměstnán pro on-line překladatelských systémů nasazených likes Google a Microsoft.

Moses je implementace statistického (nebo daty řízeného) přístupu k strojovému překladu (MT). V současné době jde o převažující přístup v rámci strojového překladu, který je použit online překladovými systémy nasazenými Googlem a Microsoftem.

Mojžíš je provádění statistické (nebo aktivovaný) přístup na strojový překlad (mt). To je dominantní přístup v oblasti v tuto chvíli, a zaměstnává on - line překlad systémů uskutečněné takové, Google a Microsoft.

gtrans

Bibliography

  • John Hutchins: Machine translation: past, present, future
  • John Hutchins: An introduction to machine translation
  • Philipp Koehn: Statistical Machine Translation
  • Sergei Nirenburg et al.: Readings in Machine Translation
  • Jiří Levý: Umění překladu
  • Jiří Levý: České theorie překladu
  • Jeff Hawkins: On intelligence
  • Douglas Hofstadter: Le Ton beau de Marot

Translation

Translation is a transfer of a text from a source language to a target language. Interpreting is oral translation of spoken language.

  • technical translation × literary translation
  • exact reproduction × loose translational rephrasing

The context is crucial for translation. —Maimonidés (12th century)

Each word is an element pulled out from a complex language system and its relations to other segments of the system differ in different languages. Each meaning (sense) is an element from a complex system of segments which a speaker divides reality into. —Werner Winter (1923–2010)

Which properties of a source should be preserved?

J. Levý img

Translation (Levý)

  • must reproduce 1) words of the original 2) ideas of the original
  • should be able to be read as the original
  • is to be read as a translation
  • should
    1. reflect style of the original
    2. show translator’s style
    3. should be read as a text falling into the period of
    4. the original
    5. the translator
  • can add or skip something from the original
  • shoud never add or skip something from the original

Translatology

  • deals with translation of texts between languages and semiotic systems
  • questions of accuracy (fidelity), translatability
  • translation between cultural areas, various periods
  • descriptive branch (critics, history) × applied (practice)
  • formed 60’s–70’s, linguistic orientation
  • 80’s—close to theory of literature
  • 90’s—turned to a translator him/her-self

Translator

What should a good translator know (Levý):

  • source language
  • target language
  • factual content of the text: facts of a period, the field (domain, for technical translation)

Levý on machine translation and artistic translation

“Machine Translation’s goal is to fragment a sentence to the simplest comparable elements; artistic translation’s goal is the opposite: transfering of the highest units.”

Types of translations according to Roman Jakobson

  • interlingual: transfer between different languages
  • intralingual: transfer within a language, e.g. to a different dialect, to a standard language etc.
  • intersemiotic: transfer between different semiotic systems (sign language)

Questions

  • Is accurate translation possible at all?
  • What is easier: to translate from / to your mother tongue?
  • How we know $w_1$ is equivalent to $w_2$?
  • English wind types: airstream, breeze, crosswind, dust devil, easterly, gale, gust, headwind, jet stream, mistral, monsoon, prevailing wind, sandstorm, sea breeze, sirocco, southwester, tailwind, tornado, trade wind, turbulence, twister, typhoon, whirlwind, wind, windstorm, zephyr

Example of hard words

  • alkáč, večerníček, telka, čoklbuřt, knížečka, ČSSD … ?
  • matka, macecha, mamka, máma, maminka, matička, máti, mama, mamča, mamina
  • scvrnkls, nejneobhospo….nějšími
  • Navajo Code: language as a cipher
  • Leacock: Nonsense novels (Literární poklesky)

Palindrome

“Dammit I’m Mad” by Demetri Martin

Dammit I’m mad. Evil is a deed as I live. God, am I reviled? I rise, my bed on a sun, I melt. To be not one man emanating is sad. I piss. Alas, it is so late. Who stops to help? Man, it is hot. I’m in it. I tell. I am not a devil. I level “Mad Dog”. Ah, say burning is, as a deified gulp, In my halo of a mired rum tin. I erase many men. Oh, to be man, a sin. Is evil in a clam? In a trap? No. It is open. On it I was stuck. Rats peed on hope. Elsewhere dips a web. Be still if I fill its ebb. Ew, a spider… eh? We sleep. Oh no! Deep, stark cuts saw it in one position. Part animal, can I live? Sin is a name. Both, one… my names are in it. Murder? I’m a fool. A hymn I plug, deified as a sign in ruby ash, A Goddam level I lived at. On mail let it in. I’m it. Oh, sit in ample hot spots. Oh wet! A loss it is alas (sip). I’d assign it a name. Name not one bottle minus an ode by me: “Sir, I deliver. I’m a dog” Evil is a deed as I live. Dammit I’m mad.

Linguistic relativity

  • language properties substantially affect our view of the world
  • properties of different languages differ significantly
  • → their speakers live in different, incompatible worlds

The limits of my language mean the limits of my world. —Ludwig Wittgenstein

If Aristotle had spoken Chinese or Dakota he would have arrived at a totally different logic. —Fritz Mauthner

Wikipedia on Linguistic relativity

Dualism

  • mould theories: language and thinking are the same, we think in our language
  • cloak theories: language is on surface, behind is a complex maze of thoughts

Where does linguistic relativity belong?

Sapir-Whorf hypotesis

  • important theory of psycholinguistics
  • language determines thought
  • 30’s of 20th century, Edward Sapir, from linguistic relativity
  • comparison of concepts in American-Indian and European languages
  • elaborated by Benjamin Lee Whorf
  • later criticized: falsifiable form of the hypothesis (concepts for colours) showed the opposite to be true

Machine Translation—definition

A discipline of computational linguistics dealing with design, implementation and application of automatic systems (software) for translating texts with minimal human invervention.

E.g. a translation with an electronic dictionary does not belong to the field of machine translation.

Machine translation

We consider only technical / specialized texts:

  • web pages,
  • technical manuals,
  • scientific documents and papers,
  • leaflets and catalogues,
  • law texts and
  • in general, texts from specific domains.

Nuances on different language levels in art literature are out of scope of current MT systems.

Machine translation: issues

In fact an output of MT is always revised. We distinguish pre-editing and post-editing.

MT systems make different types of errors.

These mistakes are characteristic for human translators:

  • wrong prepositions: (I am in school)
  • missing determiners (I saw man)
  • wrong tense (Viděl jsem: I was seeing), …

For computers, errors in meaning are characteristic:

  • Kiss me honey.Polib mi med.

Taxonomy of MT errors

Costa, Ângela, et al. “A linguistically motivated taxonomy for Machine Translation error analysis.” Machine Translation 29.2 (2015): 127-161.

Lexical choice

A choice of a proper translational equivalent:

  • homonymy: pila, baby, ženu; byte, ate
  • polysemy: take, run, line; klíč, kohout, mít
  • synonymy: kluk, chlapec, hoch; dívka, holka, děvče

Word order

Word order

Free word order

Word order rule:

The more morphologically rich language, the freer word order it has.


Katka snědla kousek koláče.

  • Kati megevett egy szelet tortát → Katie eating a piece of cake
  • Egy szelet tortát Kati evett meg → Katie ate a piece of cake
  • Kati egy szelet tortát evett meg → Katie ate a piece of cake
  • Egy szelet tortát evett meg Kati → Katie ate a piece of cake
  • Megevett egy szelet tortát Kati → Katie eating a piece of cake
  • Megevett Kati egy szelet tortát → Katie ate a piece of cake

Free word order in Czech

  • Víš, že se z kávy vyrábí mouka?
  • Víš, že se z kávy mouka vyrábí?
  • Víš, že se mouka vyrábí z kávy?
  • Víš, že se mouka z kávy vyrábí?
  • Víš, že se vyrábí mouka z kávy?
  • Víš, že se vyrábí z kávy mouka?

How their meanings differ?

Direct methods for improving MT quality

  • limit input to a:
  • sublanguage (indicative sentences)
  • domain (informatics)
  • document type (patents)
  • text pre-processing (e.g. manual syntactic analysis)

Basic terms

  • accuracy (precision)
  • intelligibility
  • fluency
  • source language, SL, L1
  • target language, TL, L2
  • corpus (plural corpora)
  • ambiguity, polysemy

Classification based on approach

  • rule-based, knowledge-based (RBMT, KBMT)
  • transfer
  • with interlingua
  • statistical machine translation (SMT)
  • hybrid machine translation (HMT, HyTran)

Vauquois’s triangle

Vauquois’s triangle

Interaction with a user

  • (human, manual translation)
  • machine-aided human translation (MAHT)
  • human-aided machine translation (HAMT)
  • fully automated high-quality (FAHQMT)
  • HAMT and MAHT:
  • CAT: computer-aided translation.

Direction and arity

Arity:

  • bilingual systems
  • multilingual systems

Direction:

  • unidirectional
  • bidirectional

Systems of Machine Translation

Conferences, workshops, institutions

  • ACL: Annual meetings of the Association for Computational Linguistics
  • NIST: National Institute of Standards and Technology
  • Translating and the Computer (London)
  • RANLP: Recent Advances in Natural Language Processing
  • Workshop on Machine Translation (WMT)
  • The Conference of the Association for Machine Translation in the Americas
  • LREC: Language Resources and Evaluation Conferences
  • wiki-call-for-papers
  • List of labs

E-resources

Institutions

  • IAMT: International Association for Machine Translation:
  • EAMT: European Association for Machine Translation
  • AMTA: The Association for MT in the Americas
  • AAMT: The Asian-Pacific Association for MT
  • META-NET: unites European MT departments
  • British Computer Society Natural Language Translation Group
  • UK MFF ÚFAL
  • Obec překladatelů
  • Jednota tlumočníků a překladatelů
  • Ústav translatologie, FF UK

History of MT

Motivations for MT

  • information explosion
  • 1922—regular BBC radio broadcast
  • 1923—radio broadcast in Czech Republic
  • 1936—regular BBC TV broadcast
  • 1953—TV broadcast in Czech Republic
  • computer development
  • generation zero—Z1–3, Colossus, ABC, Mark I,II
  • first generation—ENIAC (Electronic Numerical Integrator And Computer, 1945), MANIAC

In 1947 RAM could store 100 numbers and $a + b$ took 1/8 s!

Early MT believes

  • translation is repeated activity—can be superseded by computers
  • computers were successful in deciphering war codes: would they be useful also for MT?

Warren Weaver: When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.

First impulses

In 1950 Weaver sended a memorandum to 200 addressees in which he outlined some problems of MT.

  • polysemy (ambiguity) is a common phenomenon
  • intersection of logic and language
  • connections with cryptography
  • universal properties of languages

An early interest in MT held at several departments. At first at University of London (Andrew D. Booth). Soon after at MIT, University of Washington, University of California, Harvard, …

Topics and first exchanges of experience

  • morphologic and syntactic analysis
  • meaning and knowledge representation
  • creating and working with electronic dictionaries
  • 1952—first public conference at MIT
  • 1954—first showcase of a working MT

Turing’s Test: using language as humans do is a sufficient operational test for intelligence.

Georgetown experiment

The first working prototype of MT.

  • IBM, New York
  • 7/1/1954, first public demonstration of MT
  • a computer applied to a non numerical task
  • over 60 sentences (probably carefully selected)
  • a dictionary with 250 words
  • from Russian to English
  • grammar for Russian contained 6 rules
  • provoked enthusiasm: MT was obviously possible (despite being fraudulently presented)
  • many new projects, mainly in USA and SSSR

Progress in 50’s

  • MT provoked development in these fields:
  • theoretical linguistics (Chomsky)
  • computational linguistics
  • artificial intelligence
  • with higher coverage, the quality of MT decreased
  • even the best systems (GAT, Georgetown, Ru→En) provided unsatisfying results
  • generating random love poems (1952)

Progress in 50’s

  • a first PhD thesis on MT defended (1954)
  • Journal of Machine Translation (1954)
  • First international MT conference held in London (1956)
  • Noam Chomsky: Syntactic Structures (1957)
  • MT research in USSR, Japan
  • first book about MT (Introduction), Paris (1959)

60’s, Disappointments from poor results

  • despite rather poor results, optimism prevailed
  • Yehoshua Bar-Hillel wrote a critics of MT status in 1959
  • he claimed computers are not capable of lexical disambiguation
  • fully automated high-quality translation (FAHQT) unreachable
  • Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.
  • MT projects expenses began to decrease

Progress in 60’s

  • MT in USSR focused on En scientific paper abstracts
  • Association for MT in USA (1962)
  • Peter Toma leaves Georgetown MT, develops AUTOTRAN, later Systran

ALPAC report, 1966

  • Automatic Language Processing Advisory Commitee
  • an institution under U.S. National Academy of Science
  • it carried out analyses and evaluations of MT quality and usability
  • recommended to reduce expenditures for MT support
  • negative impact on MT as a scientific field
  • a problem was in strong underestimation of complexity of natural language understanding
  • MT development continued in Europe, USSR, Japan continuously
  • it took MT in USA another 15 years to regain its previous respect and status

TAUM, METEO

TAUM

  • Traduction Automatique à l’Université de Montréal
  • Université de Montréal, 1965
  • prototypes of MT systems: TAUM-73, TAUM-METEO
  • first MT systems incorporating analysis of SL and synthesis of TL
  • EN → FR
  • TAUM Aviation (cancelled)

METEO

  • 1981–2001 used for weather forecast translation
  • author John Chandiou, Canada

Systran

  • one of the oldest MT companies (1968)
  • very popular translation system
  • basis for Yahoo Babelfish
  • until 2007 used even by Google
  • RBMT, since 2010, hybrid translation
  • from 1976 oficial MT system used by EEC

Renaissance—70’s

  • First Soviet MT program: AMPAR (En→Ru)
  • Systran installed at EC
  • Xerox uses Systran
  • a project proposes using Esperanto as interlingua (refused)

Renaissance—80’s

  • development of rule-based systems with interlingua
  • Rosetta project started (1980, logical interlingua)
  • first data-driven systems (Example-based MT)
  • boom of commercial MT systems
  • EUROTRA project (EU funded) began
  • IBM introduces 8-bit ASCII (1983)
  • Trados—the first company to develop CAT, Stuttgart, 1984
  • Unicode project (1987)
  • World Wide Web proposal (1989)

Renaissance—90’s

  • research on statistical MT (IBM)
  • SDL (CAT market leader) founded in UK (1992)
  • Verbmobil project (1992–99)
  • rule-based systems still dominating the field
  • AltaVista Babelfish (1997), 500,000 requests/day
  • first online commercial online MT service iTranslator

Renaissance—00’s

  • statistical systems dominate the field
  • quality of rule-based systems improved by statistical methods (hybrid systems)
  • new translation pairs
  • NIST launches first round of MT system benchmarking (2001)
  • EuroMatrix—a large scale EC funded project (2006)
  • Moses (open source statistical MT engine, 2007)

Too optimistic prognosis

So called “hype”. Similar now with Artificial inteligence (Watson, Go, robotics) Evolution of MT

Machine Translation nowadays

  • unprecedented computational power, data structures
  • enabled work with billion words instantly
  • Google 1 PB sort (2008)
  • trillion 100 B records
  • 6 hours; 4,000 PCs; 48,000 discs
  • MapReduce technique
  • Google Ngrams
  • development of MT systems for everyone
  • number of parallel corpora steadily increasing
  • focus on under-resourced languages (LREC)
  • MT quality is improved slowly but steadily
  • SMT slowly replaced by neural network techniques
  • intense parallel (and comparable) data acquiring
  • development of MT systems based on evaluation metric outputs
  • USA: interest mainly in English as TL
  • EU: translation between 24 oficial languages of EU (EuroMatrix): English, Bulgarian, Czech, Croatian, Danish, Estonian, Finnish, French, Irish, Italian, Lithuanian, Latvian, Hungarian, Maltese, German, Dutch, Polish, Portugese, Romanian, Greek, Slovak, Slovene, Spanish a Swedish.
  • big companies (Microsoft) focused on English as SL
  • large pairs (En:Sp, En:Fr): very good translation quality
  • SMT enriched with syntax, end-to-end NNMT
  • Google Translate as a gold standard
  • morphologically rich languages neglected (hard)
  • En:Xx and Xx:En pairs prevail

Motivation in 21st century

  • translation of web pages for gisting (getting the main message)
  • methods for speeding-up human translation substantially (translation memories)
  • cross-language extraction of facts and search for information
  • instant translation of e-communication
  • translation on mobile devices

Conclusion

  • MT falls into AI-complete problems
  • immense computational power at our disposal
  • commercial (market) potential is bigger than ever
  • there is always a thing to be improved in MT
  • statistical methods seem to be more convenient (fast, cheap)
March 1, 2017 |