Machine Translation: Introduction
Which one is human translation?
Moses is an implementation of the statistical (or data-driven) approach to machine translation (MT). This is the dominant approach in the field at the moment, and is employed by the online translation systems deployed by the likes of Google and Microsoft.
Mojžíš je implementace statistické (nebo řízené daty) přístupu k strojového překladu (MT). To je převládajícím přístupem v oblasti v současné době, a je zaměstnán pro on-line překladatelských systémů nasazených likes Google a Microsoft.
Moses je implementace statistického (nebo daty řízeného) přístupu k strojovému překladu (MT). V současné době jde o převažující přístup v rámci strojového překladu, který je použit online překladovými systémy nasazenými Googlem a Microsoftem.
Mojžíš je provádění statistické (nebo aktivovaný) přístup na strojový překlad (mt). To je dominantní přístup v oblasti v tuto chvíli, a zaměstnává on - line překlad systémů uskutečněné takové, Google a Microsoft.
- John Hutchins: Machine translation: past, present, future
- John Hutchins: An introduction to machine translation
- Philipp Koehn: Statistical Machine Translation
- Sergei Nirenburg et al.: Readings in Machine Translation
- Jiří Levý: Umění překladu
- Jiří Levý: České theorie překladu
- Jeff Hawkins: On intelligence
- Douglas Hofstadter: Le Ton beau de Marot
Translation is a transfer of a text from a source language to a target language. Interpreting is oral translation of spoken language.
- technical translation × literary translation
- exact reproduction × loose translational rephrasing
The context is crucial for translation. —Maimonidés (12th century)
Each word is an element pulled out from a complex language system and its relations to other segments of the system differ in different languages. Each meaning (sense) is an element from a complex system of segments which a speaker divides reality into. —Werner Winter (1923–2010)
Which properties of a source should be preserved?
- must reproduce 1) words of the original 2) ideas of the original
- should be able to be read as the original
- is to be read as a translation
- reflect style of the original
- show translator’s style
- should be read as a text falling into the period of
- the original
- the translator
- can add or skip something from the original
- shoud never add or skip something from the original
- deals with translation of texts between languages and semiotic systems
- questions of accuracy (fidelity), translatability
- translation between cultural areas, various periods
- descriptive branch (critics, history) × applied (practice)
- formed 60’s–70’s, linguistic orientation
- 80’s—close to theory of literature
- 90’s—turned to a translator him/her-self
What should a good translator know (Levý):
- source language
- target language
- factual content of the text: facts of a period, the field (domain, for technical translation)
Levý on machine translation and artistic translation
“Machine Translation’s goal is to fragment a sentence to the simplest comparable elements; artistic translation’s goal is the opposite: transfering of the highest units.”
Types of translations according to Roman Jakobson
- interlingual: transfer between different languages
- intralingual: transfer within a language, e.g. to a different dialect, to a standard language etc.
- intersemiotic: transfer between different semiotic systems (sign language)
- Is accurate translation possible at all?
- What is easier: to translate from / to your mother tongue?
- How we know $w_1$ is equivalent to $w_2$?
- English wind types: airstream, breeze, crosswind, dust devil, easterly, gale, gust, headwind, jet stream, mistral, monsoon, prevailing wind, sandstorm, sea breeze, sirocco, southwester, tailwind, tornado, trade wind, turbulence, twister, typhoon, whirlwind, wind, windstorm, zephyr
Example of hard words
- alkáč, večerníček, telka, čoklbuřt, knížečka, ČSSD … ?
- matka, macecha, mamka, máma, maminka, matička, máti, mama, mamča, mamina
- scvrnkls, nejneobhospo….nějšími
- Navajo Code: language as a cipher
- Leacock: Nonsense novels (Literární poklesky)
“Dammit I’m Mad” by Demetri Martin
Dammit I’m mad. Evil is a deed as I live. God, am I reviled? I rise, my bed on a sun, I melt. To be not one man emanating is sad. I piss. Alas, it is so late. Who stops to help? Man, it is hot. I’m in it. I tell. I am not a devil. I level “Mad Dog”. Ah, say burning is, as a deified gulp, In my halo of a mired rum tin. I erase many men. Oh, to be man, a sin. Is evil in a clam? In a trap? No. It is open. On it I was stuck. Rats peed on hope. Elsewhere dips a web. Be still if I fill its ebb. Ew, a spider… eh? We sleep. Oh no! Deep, stark cuts saw it in one position. Part animal, can I live? Sin is a name. Both, one… my names are in it. Murder? I’m a fool. A hymn I plug, deified as a sign in ruby ash, A Goddam level I lived at. On mail let it in. I’m it. Oh, sit in ample hot spots. Oh wet! A loss it is alas (sip). I’d assign it a name. Name not one bottle minus an ode by me: “Sir, I deliver. I’m a dog” Evil is a deed as I live. Dammit I’m mad.
- language properties substantially affect our view of the world
- properties of different languages differ significantly
- → their speakers live in different, incompatible worlds
The limits of my language mean the limits of my world. —Ludwig Wittgenstein
If Aristotle had spoken Chinese or Dakota he would have arrived at a totally different logic. —Fritz Mauthner
Wikipedia on Linguistic relativity
- mould theories: language and thinking are the same, we think in our language
- cloak theories: language is on surface, behind is a complex maze of thoughts
Where does linguistic relativity belong?
- important theory of psycholinguistics
- language determines thought
- 30’s of 20th century, Edward Sapir, from linguistic relativity
- comparison of concepts in American-Indian and European languages
- elaborated by Benjamin Lee Whorf
- later criticized: falsifiable form of the hypothesis (concepts for colours) showed the opposite to be true
A discipline of computational linguistics dealing with design, implementation and application of automatic systems (software) for translating texts with minimal human invervention.
E.g. a translation with an electronic dictionary does not belong to the field of machine translation.
We consider only technical / specialized texts:
- web pages,
- technical manuals,
- scientific documents and papers,
- leaflets and catalogues,
- law texts and
- in general, texts from specific domains.
Nuances on different language levels in art literature are out of scope of current MT systems.
Machine translation: issues
In fact an output of MT is always revised. We distinguish pre-editing and post-editing.
MT systems make different types of errors.
These mistakes are characteristic for human translators:
- wrong prepositions: (I am in school)
- missing determiners (I saw man)
- wrong tense (Viděl jsem: I was seeing), …
For computers, errors in meaning are characteristic:
- Kiss me honey. → Polib mi med.
Costa, Ângela, et al. “A linguistically motivated taxonomy for Machine Translation error analysis.” Machine Translation 29.2 (2015): 127-161.
A choice of a proper translational equivalent:
- homonymy: pila, baby, ženu; byte, ate
- polysemy: take, run, line; klíč, kohout, mít
- synonymy: kluk, chlapec, hoch; dívka, holka, děvče
Free word order
Word order rule:
The more morphologically rich language, the freer word order it has.
Katka snědla kousek koláče.
- Kati megevett egy szelet tortát → Katie eating a piece of cake
- Egy szelet tortát Kati evett meg → Katie ate a piece of cake
- Kati egy szelet tortát evett meg → Katie ate a piece of cake
- Egy szelet tortát evett meg Kati → Katie ate a piece of cake
- Megevett egy szelet tortát Kati → Katie eating a piece of cake
- Megevett Kati egy szelet tortát → Katie ate a piece of cake
Free word order in Czech
- Víš, že se z kávy vyrábí mouka?
- Víš, že se z kávy mouka vyrábí?
- Víš, že se mouka vyrábí z kávy?
- Víš, že se mouka z kávy vyrábí?
- Víš, že se vyrábí mouka z kávy?
- Víš, že se vyrábí z kávy mouka?
How their meanings differ?
Direct methods for improving MT quality
- limit input to a:
- sublanguage (indicative sentences)
- domain (informatics)
- document type (patents)
- text pre-processing (e.g. manual syntactic analysis)
- accuracy (precision)
- source language, SL, L1
- target language, TL, L2
- corpus (plural corpora)
- ambiguity, polysemy
Classification based on approach
- rule-based, knowledge-based (RBMT, KBMT)
- with interlingua
- statistical machine translation (SMT)
- hybrid machine translation (HMT, HyTran)
Interaction with a user
- (human, manual translation)
- machine-aided human translation (MAHT)
- human-aided machine translation (HAMT)
- fully automated high-quality (FAHQMT)
- HAMT and MAHT:
- CAT: computer-aided translation.
Direction and arity
- bilingual systems
- multilingual systems
Systems of Machine Translation
- Apertium (RBMT, open-source)
- Babelfish (Yahoo)
- Caitra (CAT system)
- Cunei (data-driven MT)
- ČESILKO (Czech-Slovak translation)
- EuroTra (ambicious project of EC)
- Google Translate
- Logos, OpenLogos (one of the oldest MT systems)
- METEO (weather forecasts, En:Fr)
- Moses (open-source SMT)
- Pangloss (example-based MT)
- Rosetta (contains a logic analysis of propositions)
- Systran (one of the oldest commercial MT systems)
- SDL Trados (translation memory, CAT system),
(in 2011 SDL acquired Language Weaver)
- Verbmobil (speech2speech for Ge, En, Jp)
- matecat (online CAT system), …
Conferences, workshops, institutions
- ACL: Annual meetings of the Association for Computational Linguistics
- NIST: National Institute of Standards and Technology
- Translating and the Computer (London)
- RANLP: Recent Advances in Natural Language Processing
- Workshop on Machine Translation (WMT)
- The Conference of the Association for Machine Translation in the Americas
- LREC: Language Resources and Evaluation Conferences
- List of labs
- IAMT: International Association for Machine Translation:
- EAMT: European Association for Machine Translation
- AMTA: The Association for MT in the Americas
- AAMT: The Asian-Pacific Association for MT
- META-NET: unites European MT departments
- British Computer Society Natural Language Translation Group
- UK MFF ÚFAL
- Obec překladatelů
- Jednota tlumočníků a překladatelů
- Ústav translatologie, FF UK
History of MT
Motivations for MT
- information explosion
- 1922—regular BBC radio broadcast
- 1923—radio broadcast in Czech Republic
- 1936—regular BBC TV broadcast
- 1953—TV broadcast in Czech Republic
- computer development
- generation zero—Z1–3, Colossus, ABC, Mark I,II
- first generation—ENIAC (Electronic Numerical Integrator And Computer, 1945), MANIAC
In 1947 RAM could store 100 numbers and $a + b$ took 1/8 s!
Early MT believes
- translation is repeated activity—can be superseded by computers
- computers were successful in deciphering war codes: would they be useful also for MT?
Warren Weaver: When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.
In 1950 Weaver sended a memorandum to 200 addressees in which he outlined some problems of MT.
- polysemy (ambiguity) is a common phenomenon
- intersection of logic and language
- connections with cryptography
- universal properties of languages
An early interest in MT held at several departments. At first at University of London (Andrew D. Booth). Soon after at MIT, University of Washington, University of California, Harvard, …
Topics and first exchanges of experience
- morphologic and syntactic analysis
- meaning and knowledge representation
- creating and working with electronic dictionaries
- 1952—first public conference at MIT
- 1954—first showcase of a working MT
Turing’s Test: using language as humans do is a sufficient operational test for intelligence.
The first working prototype of MT.
- IBM, New York
- 7/1/1954, first public demonstration of MT
- a computer applied to a non numerical task
- over 60 sentences (probably carefully selected)
- a dictionary with 250 words
- from Russian to English
- grammar for Russian contained 6 rules
- provoked enthusiasm: MT was obviously possible (despite being fraudulently presented)
- many new projects, mainly in USA and SSSR
Progress in 50’s
- MT provoked development in these fields:
- theoretical linguistics (Chomsky)
- computational linguistics
- artificial intelligence
- with higher coverage, the quality of MT decreased
- even the best systems (GAT, Georgetown, Ru→En) provided unsatisfying results
- generating random love poems (1952)
Progress in 50’s
- a first PhD thesis on MT defended (1954)
- Journal of Machine Translation (1954)
- First international MT conference held in London (1956)
- Noam Chomsky: Syntactic Structures (1957)
- MT research in USSR, Japan
- first book about MT (Introduction), Paris (1959)
60’s, Disappointments from poor results
- despite rather poor results, optimism prevailed
- Yehoshua Bar-Hillel wrote a critics of MT status in 1959
- he claimed computers are not capable of lexical disambiguation
- fully automated high-quality translation (FAHQT) unreachable
- Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.
- MT projects expenses began to decrease
Progress in 60’s
- MT in USSR focused on En scientific paper abstracts
- Association for MT in USA (1962)
- Peter Toma leaves Georgetown MT, develops AUTOTRAN, later Systran
ALPAC report, 1966
- Automatic Language Processing Advisory Commitee
- an institution under U.S. National Academy of Science
- it carried out analyses and evaluations of MT quality and usability
- recommended to reduce expenditures for MT support
- negative impact on MT as a scientific field
- a problem was in strong underestimation of complexity of natural language understanding
- MT development continued in Europe, USSR, Japan continuously
- it took MT in USA another 15 years to regain its previous respect and status
- Traduction Automatique à l’Université de Montréal
- Université de Montréal, 1965
- prototypes of MT systems: TAUM-73, TAUM-METEO
- first MT systems incorporating analysis of SL and synthesis of TL
- EN → FR
- TAUM Aviation (cancelled)
- 1981–2001 used for weather forecast translation
- author John Chandiou, Canada
- one of the oldest MT companies (1968)
- very popular translation system
- basis for Yahoo Babelfish
- until 2007 used even by Google
- RBMT, since 2010, hybrid translation
- from 1976 oficial MT system used by EEC
- First Soviet MT program: AMPAR (En→Ru)
- Systran installed at EC
- Xerox uses Systran
- a project proposes using Esperanto as interlingua (refused)
- development of rule-based systems with interlingua
- Rosetta project started (1980, logical interlingua)
- first data-driven systems (Example-based MT)
- boom of commercial MT systems
- EUROTRA project (EU funded) began
- IBM introduces 8-bit ASCII (1983)
- Trados—the first company to develop CAT, Stuttgart, 1984
- Unicode project (1987)
- World Wide Web proposal (1989)
- research on statistical MT (IBM)
- SDL (CAT market leader) founded in UK (1992)
- Verbmobil project (1992–99)
- rule-based systems still dominating the field
- AltaVista Babelfish (1997), 500,000 requests/day
- first online commercial online MT service iTranslator
- statistical systems dominate the field
- quality of rule-based systems improved by statistical methods (hybrid systems)
- new translation pairs
- NIST launches first round of MT system benchmarking (2001)
- EuroMatrix—a large scale EC funded project (2006)
- Moses (open source statistical MT engine, 2007)
Too optimistic prognosis
So called “hype”. Similar now with Artificial inteligence (Watson, Go, robotics)
Machine Translation nowadays
- unprecedented computational power, data structures
- enabled work with billion words instantly
- Google 1 PB sort (2008)
- trillion 100 B records
- 6 hours; 4,000 PCs; 48,000 discs
- MapReduce technique
- Google Ngrams
- development of MT systems for everyone
- number of parallel corpora steadily increasing
- focus on under-resourced languages (LREC)
- MT quality is improved slowly but steadily
- SMT slowly replaced by neural network techniques
- intense parallel (and comparable) data acquiring
- development of MT systems based on evaluation metric outputs
- USA: interest mainly in English as TL
- EU: translation between 24 oficial languages of EU (EuroMatrix): English, Bulgarian, Czech, Croatian, Danish, Estonian, Finnish, French, Irish, Italian, Lithuanian, Latvian, Hungarian, Maltese, German, Dutch, Polish, Portugese, Romanian, Greek, Slovak, Slovene, Spanish a Swedish.
- big companies (Microsoft) focused on English as SL
- large pairs (En:Sp, En:Fr): very good translation quality
- SMT enriched with syntax, end-to-end NNMT
- Google Translate as a gold standard
- morphologically rich languages neglected (hard)
- En:Xx and Xx:En pairs prevail
Motivation in 21st century
- translation of web pages for gisting (getting the main message)
- methods for speeding-up human translation substantially (translation memories)
- cross-language extraction of facts and search for information
- instant translation of e-communication
- translation on mobile devices
- MT falls into AI-complete problems
- immense computational power at our disposal
- commercial (market) potential is bigger than ever
- there is always a thing to be improved in MT
- statistical methods seem to be more convenient (fast, cheap)