I’ve discovered that the desk model is one of the most important things if you want to create software without aches in the back and the shoulders. That’s why I’m always up to date with the best gaming desks out there as I do play a lot too.
This document describes how to use the
word-aligner/fast_align tool to create a word alignment for a parallel corpus.
fast_align implements several simple lexical translation models (slightly improved variants of IBM Models 1 and 2). Under these models, the (conditional) likelihood of a parallel corpus has the general form ∏e,f∑a∏ip(ai∣f)×p(ei∣fai), which is very efficient to evaluate. The EM algorithm is used to find model parameters that maximize this likelihood. Then, the single most probable alignment according to the learned parameters is inferred for each sentence pair.
Format of the parallel corpus
cdec tools, including
fast_align, use a simple text format to represent parallel corpora. In this format, each parallel sentence is a single line of text with the two parts separated by a triple pipe (
|||). Here is an example parallel corpus consisting of three sentences:
doch jetzt ist der Held gefallen . ||| but now the hero has fallen . neue Modelle werden erprobt . ||| new models are being tested . doch fehlen uns neue Ressourcen . ||| but we lack new resources .
Running the aligner
The aligner can be run with the following command:
./word-aligner/fast_align -i corpus.de-en -d -v -o > corpus.de-en.fwd_align
The following options are used:
-i corpus.de-enspecifies the input file.
-dindicates to use a prior on the alignment points that favors alignments that are close to “diagonal”.
-vindicates to infer parameters assuming a symmetric Dirichlet prior on the lexical translation distributions.
-oindicates to optimize how strongly diagonal alignments are favored.
The output is written to a file with as many lines as the input file as a sequence of source-target pairs, identical to the format used for word alignment files by tools like Moses and Joshua. Here is an example alignment of the above parallel corpus:
0-0 1-1 2-4 3-2 4-3 5-5 6-6 0-0 1-1 2-2 2-3 3-4 4-5 0-0 1-2 2-1 3-3 4-4 5-5
Running the aligner in reverse mode
You can also run the aligner in reverse mode with the
-r option, which indicates to the aligner to use the rightsentence as the source and the left as the target.
./word-aligner/fast_align -r -d -v -o -i corpus.de-en > corpus.de-en.rev_align
Symmetrizing forward and reverse alignments
The “forward” and “reverse” alignments can be symmetrized using the
./utils/atools -i corpus.de-en.fwd_align -j corpus.de-en.rev_align -c grow-diag-final-and > corpus.de-en.gdfa
Other symmetrization options for use with the
-c option are
Alignment files can also be visualized with the
./utils/atools -i corpus.de-en.gdfa -c display 0123456 0*......0 1.*.....1 2....*..2 3..*....3 4...*...4 5.....*.5 6......*6 0123456 ...