Patterns in Translation: Using Colibri Core for the Syriac Bible
To what extent can linguistically uninformed features help us in tracing divergent patterns in an ancient Syriac Bible translation and its Hebrew source text? To answer this question, we need a language-independent tool that allows for a fine-grained comparison of both texts.
About the project
This Fellowship project aims to use Colibri Core for ananalysis of the Hebrew and Syriac corpora of the ETCBC. The richly annotated linguistic Hebrew text of the Bible has been created over a period of almost four decades (1977–2017). Thanks to a CLARIN-NL project (2013–2014), it has been made available through the SHEBANQ website, besides its presence on GitHub as the BHSA. An electronic representation of the ancient Syriac translation of this text, called Peshitta, is also produced and maintained by the ETCBC. Modifications of this corpus were made in the CLARIAH research pilot Linking Syriac Data (2017–2018; see final report and github repository). Some encoded texts (Kings, Psalms 1–30 and others) are linguistically annotated in a way similar to the BHSA.
In Van Gompels dissertation (2020), Colibri Core is used in frequentist extraction of patterns, which later plays an important role in automatic translation and word sense disambiguation based on context-sensitive suggestions for translations from one language into another. This is an interesting case in relation to the Bible, because Bible Translation and Machine Translation have been allies that need each other and reinforce each other: the Bible providing a huge parallel corpus in a few thousand languages for Machine Translation, Machine Translation being the most advanced means to support and speed up Bible translation projects (cf. Hurskainen 2020).
In the CLARIAH Fellowship, however, we want to experiment from the opposite starting point, starting from existing translations, rather than focusing on tools to create new translations. An interesting case is the ancient Syriac Bible translation, because even though Hebrew and Syriac are cognate languages, they each have their own structure. We use the sentence alignment for the Hebrew and Syriac Bible, provided by the verse indexation of the ETCBC data. Next, we compute word alignment according to Och and Ney (2003) and perform phrase alignment following Koehn (2009). This provides us with a phrase-translation table to find corresponding patterns in the parallel corpus, which forms an ideal basis for n-gram, skipgram and flexgram (henceforth simply “n-gram”) analysis.
The computation of n-gram pattern models provides a basis for comparative corpus analysis. Since n-grams are typically distributed in a Zipfian fashion, there are only a few high-frequency patterns, with words such as common function words in the lead, leaving a longtail of patterns that occur only sparsely. N-grams that are not subsumed by higher order n-grams, i.e., which do not occur as part of a higher order n-gram in the data/model, can be pruned from the model. This pruning allows us to focus on the most salient n-gram features. We use the Hebrew data as the baseline and compare to what extent these features persist in the Syriac translation. An important metric for corpus comparison is log-likelihood. It expresses how much more likely any given pattern is for either of the two models, which therefore allows us to identify how indicative a pattern is for a particular corpus, and which patterns are most interesting for closer investigation.