Status of the Czech/English Corpus

Reader's Digest

The Reader's Digest corpus is a parallel text of articles form Reader's Digest, years 1993-1996. The Czech part is translation of the English one.

Some Statistics:

# of articles: 450
# of parallel sentences: 53,117
# of tokens in English part: 1,010,346 (after tokenization and normalization)
# of tokens in Czech part: 877,658 (after tokenization and normalization)

See corpus analyze on word form version of the Czech-English Reader's Digest corpus.
 

See corpus analyze on lemmatized version of the Czech-English Reader's Digest corpus.

See proposal for lemmatization, How to get English-like Czech.
 

See comparison of sentence alignments between previous and actual version of this corpus.
 
 

Statistics on Different Corpora Sizes:

See analyze on 10k train corpus.
 

See analyze on 20k train corpus.
 

See analyze on 30k train corpus.
 

See analyze on 40k train corpus.
 

See analyze on 50k train corpus.
 
 

Comparation of Different Czech/English Translation System

CZ_or ... Czech tokenized original (from test data)
EN_or ... English tokenized original (from test data)
F_tok ... AlTemp for tokenized Czech input
F_lem ... AlTemp for lemmatized Czech input
PCtr .... PC Translator 98
Skik .... SKIK v4.0
 

Some Nice Examples:

CZ_or: jak je to možné ?
EN_or: how could that be ?
F_tok: how is that possible ?
F_lem: how could it be ?
PCtr:  as possible?
Skik:  how is it possible?

CZ_or: její hlava nehnutě trčela nad okrajem ledu ve vzdálenosti asi půlky fotbalového hřiště od nás .
EN_or: her head , motionless , showed above the rim of ice half a football field away .
F_tok: her head they 're motionless was still stuck over chip ice about this football field .
F_lem: his head motionless , protruding over the ice in the distance about half soccer - football field from me .
PCtr:  her head motionless stick supramarginal ice in distance perhaps half football playground from us.
Skik:  her head motionless rise up above the brimmed ice in the distance perhaps the half pitch from off us.

CZ_or: ohromilo nás , s jakou trpělivostí se nechala vykoupat v šamponu .
EN_or: the patience she showed as we shampooed her astounded us .
F_tok: see him ? us , with patience , she had left victims of shampoos .
F_lem: I was stunned . with patience track down to bath shampoos .
PCtr:  #Ohromilo# us, with what kind patience with leave take a batle in shampoo.
Skik:  dismay us, with the what patience leave #VYKOUPAT# in #ŠAMPONU#.

CZ_or: " letadlo je plné . "
EN_or: " the plane 's full . "
F_tok: " the plane filled with . " .
F_lem: " the plane 's full . " .
PCtr:  "aircraft abound with."
Skik:  "The aircraft of the them full "
 

The rest of translations.


Jan Curin

Last modified: Thu Jul 22 19:11:47 EDT 1999