English to Persian Transliteration
Sarvnaz Karimi
Andrew Turpin
Falk Scholer
School of Computer Science and Information Technology,
RMIT University,
Melbourne, Australia.
Status
Proc. 13th Symposium on String Processing and
Information Retrieval (SPIRE 2006),
Glasgow,
to appear October 2006.
Abstract
Persian is an Indo-European language written using Arabic script, and is an
official language of Iran, Afghanistan, and Tajikistan.
Transliteration of Persian to English---that is, the character-by-character
mapping of a Persian word that is not readily available in a bilingual
dictionary---is an unstudied problem.
In this paper we make three novel contributions.
First, we present performance comparison of existing grapheme-based
transliteration methods on English to Persian.
Second, we discuss the difficulties in establishing a corpus
for studying transliteration.
Finally, we introduce a new model of Persian that takes into account the habit
of shortening, or even omitting, runs of English vowels.
This trait makes transliteration of Persian particularly difficult for phonetic
based methods.
This new model outperforms the existing grapheme based methods on Persian,
exhibiting a 38\% relative increase in the number of words transliterated correctly.