Major Tasks in Dialectical Arabic Processing

There have been some recent advancements in Dialectical Arabic processing across various NLP tasks, in this article the goal of this article won’t be to explore any particular task but to explore as many tasks as possible and give an overview on the best available systems, datasets, and methodologies for each of them.

Properties of Dialectical Arabic

The Arabic language is a well-known example of diglossia, in this type of languages the formal variety of the language, which is taught in schools and used in written communication and formal speech (religion, politics, etc.) differs significantly from the informal varieties that are acquired from the home.

These informal variants are used in all other types of communication either in speaking or in social media. The spoken varieties of the Arabic language (which we refer to collectively as Dialectal Arabic) differ widely depending on the geographic location and the socio-economic conditions of the speakers, and they can be quite different from the formal variety known as Modern Standard Arabic (MSA) فصحى.

While NLP applications in the MSA have had some success, these applications can’t be directly applied to the Arabic dialects because there are some significant differences across the dialects themselves and between the dialects and MSA, these difference can be in:

  • Phonology: the way letters are pronounced in different dialects
  • Morphology: the way the words are composed from smaller parts for example in the Palestinian dialect the suffix ش is usually used to negate an action for example عملت (I did) can be negated by adding ش to its end creating عملتش (I did not), there is no similar mechanism in MSA and most of the other dialects, most of the dialects have special ways to build words
  • Lexicon: the commonly used words to describe an entity, for example, the word knife in MSA is سكين but in the Syrian dialect, it is موس in Palestinian it is خوصة
  • And even syntax: the way words are combined together to build a sentence, basically the MSA follows the Verb-Subject-Object structure (VSO) in the sentences for example رمى الولد التفاحة which translates word-to-word to (throw the boy the apple) however in most of dialects both VSO and SVO is permitted meaning both الولد زت التفاحة (the boy throw the apple) and زت الولد التفاحة (throw the boy the apple) are ok while the former being the most common

These differences render some dialects incomprehensible to the speakers of other dialects and make NLP systems built on MSA not usable for handling dialectical content found in everyday speech or social media.

Textual Dialect Identification:

In this task, the given an input text the system should try to identify the dialect of the text.

The applications of such a system includes:

  • Geotagging of reviews/tweets: giving a coarse view of the users location and cultural background,
  • As pre-processing step to enable specialized dialect-specific models for the other tasks,
  • And finally it can be used as a post-processing step in systems such as ASR, autocorrect, auto-complete and to a lower extent in OCR as identifying the dialect can allow us to use specialized error correction (language models).

The task has 2 main categories based on the scope of the identification:

  • Coarse classification: in which the goal is the identification of the main Arabic dialects (Levent, Gulf, Egypt, Maghrib, Iraqi, other [Sudan, Somalia, Yemen, …], and MSA)
  • Fine-grained classification: here the goal is to identify more specific tags like the country or the city/region

Datasets

This task has a relatively large number of sizable datasets mainly due to the simplicity of building them in an automatic fashion. The following table shows the freely available datasets for dialect classification

Name Aligned (can be used in translation) size Notes
ADD No 10K covering the 5 main dialects (Egy, Gulf, Levant, North Africa and MSA)
Shami No 66k tweets automatically scraped and annotated based on geo-location only for the 4 Levant dialects
PADIC Yes 6.4k PADIC includes four dialects from the Maghreb: two from Algeria, one from Tunisia, one from Morocco and two dialects from the Middle- East (Syria and Palestine).
comparable Wikipedia No 10k Wikipedia Articles from Arabic and Egyptian Wikipedia aligned using wikipedia aligner although the articles represent a “translation” of the same Wikipedia entry the aligned documents while discussing the same entry rarely have the exact same content and thus this can’t be used for translation
DART No 27.5k The covered dialects are the 5 main dialects the data is composed of
25k automatically annotated tweets and phrases with 2.5k human-annotated data
VarDial2016 No ? Automatic Speech recognition transcripts covering the Egyptian, Gulf, Levantine, and North-African, and Modern Standard Arabic (MSA)
AOC No Huge Covers the same 5 main dialects, and is created by scraping comments found on online sites This dataset contains 2 subsets: a huge 46M automatically annotated set a smaller 100k manually annotated set through Mturk Furthermore the dataset includes for each of the comments the following fields: the original article URL the subtitle of the article and thus can be used in other tasks like topic modelling and classification, keyphrase extraction, …
MADAR Yes Varying MADAR project offers 2 datasets one is parallel that can be used for translation between dialects and another is not parallel that can be used to do dialect classification: the First data set (parallel one) is split into 2 sub-sets: 6-cities set :that have tuples of parallel sentences from 6 Arabic cities across the regions of the MENA the set contains 12k tuples 26-cities set: that have tuples of parallel sentences from 26 Arabic cities the set contains 2k tuples The second Dataset (None parallel) is also split into 2 subtasks the first task includes predicting the city tag of a sentence, this set is further split in 2 smaller tasks predicting the 6-cities tags, this set contains 54k examples predicting the 26-cities tags, this set contains 41.6K examples the second task is simpler and aims at predicting the whole country, the training set alone contains 217k tweets annotated by country
LDC2012T09 Yes 350M words Parallel text for English, MSA, Levant and Egypt dialects, Note that this dataset is paid (2250$) on LDC

Furthermore, if the goal is coarse classification, it is relatively easy to collect more data from social media or blogs by utilizing features like Geo-location, localized groups, …

Approaches

Both the coarse and the fine-grained tasks can be tackled using basic text classification methods. For the coarse variant, the state of the art is around 82% in [1]. Honestly, with this much data, it is possible to train any model you want.

On the other-hand for the find-grained case, the task is a bit harder due to the confusion between the related dialects and the limited data size. The MADAR project system for identification which is called ADIDA [2] can identify the dialect of over 26 cities around the Arabic world. The authors claim that they can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words, and reach more than 90% when the text is longer than 16 words.

Conclusion

We believe that the task of dialect identification can be considered solved with several systems having relatively high performance and a wealth of data

Cross-dialect translation

The goal of this task is to translate text from one dialect to another or to the MSA.

Datasets

Some of the dialect Identification datasets also include bi-text that can be used to train machine translation systems, these datasets are LDC2012T09, PADIC, Shami and MADAR. See the following figure for an example of such parallel data from PADIC

However, all of the available datasets is extremely small to enable training of a machine translation system.

Other available resources

Multilingual words embedding can be used to create a word-to-word translation system and can also serve as features in a larger translation system, MUSE by facebook AI is a tool to build these embeddings, and the people at MADAR have trained such embeddings for the case of Arabic dialects

Ways to Collect Data

In [3] the authors suggest a method to generate synthetic data for machine translation of under-resourced languages using words-embeddings mapping, the algorithm is rather delicate and we refrain from describing it here. The authors test this method to generate data for Levantian-English translation pair, generating only 50K synthetic examples and adding them to the available 160k manual examples. And the NMT and SMT model was trained on both the manual data and the manual+synthetic data with the generated data increasing the performance of the baseline translator by around 2 BLEU points. Their best model scored 17.33 BLEU points. It is not clear how good is the generated data and if the 50k size was chosen in order not to influence the manual set.

Approaches

The translation of dialects falls in the category of under-resourced machine translation since as we saw above. There is not a lot of data to train such models. There is a full field regarding this task and in this section we will try to explore the approaches that we believe can be applied to the dialectical Arabic translation. Please note that most of the following approaches can be applied to translating any under-resourced language.

Note about evaluation metrics

The main metric used in the automatic evaluation of translation systems is BLEU. If you are not familiar with the metric follow this link.

Rule-based

  • In [4] an algorithm was proposed that normalizes Sanaani dialect to MSA based on morphological rules. Input text was tokenized and segmented. The stem and the affixes can be either dialect-specific, MSA-specific, or both. A rule-based system is built on top of the segmenter output.
  • In [5], a rule-based approach for machine translation from Arabic dialects to MSA was presented. The approach relies on morphological analysis, morphological transfer rules and dictionaries, in addition to language models to produce MSA paraphrases of dialectal sentences. The treated dialects are Levantine, Egyptian, Iraqi, and Gulf Arabic

However, the performance of these systems is questionable at best

Supervised SMT

Most of the research on dialectical translation and any automatic translation in Arabic have relied mostly on statistical machine translation (SMT) using tool kits like Moses.

  • In [6] the authors manually annotate a small dataset of approximately 150k pairs covering MSA, Levantine and Egyptian dialects. Their reported BLEU scores are 16.7 and 18.5 for the Levant and Egyptian dialects respectively.
  • In [7] the first real cross-dialect translation experiments are done using SMT methodology, the reported results are not optimal see the table below. However, some pairs have a surprisingly high performance this can be partially due to the limited size of test data

However, the results of SMT in Dialectical Arabic is very low.

Supervised NMT

Neural Machine translation (NMT) dominates the current state of the art in machine translation in many languages. However the main obstacle against the widespread of this mode is its reliance on large quantities of data. The only result we found is reported in [8]. The authors used the transformer architecture and trained a model on the LDC2012T09 dataset mentioned above to translate from dialects to English. The authors tested whether using a pipelined model (given an input text the model first detects the dialect and then direct the text to a model tuned specifically for that dialect) is better than using a multi-lingual model (where a single model is used for all the inputs). The results were 22.79 and 23.78 BLEU score for the Egyptian and Levant dialects respectively.

Unsupervised NMT

Unsupervised neural machine translation trains seq2seq models to translate from a source language to a target language without having any parallel dataset (sentences in the source language translated to the target language) and using only 2 monolingual independent texts.

There has been some recent work on full-scale unsupervised neural machine translation [9]–[11]. The best SOTA for translation between English and French [12] (using the WMT dataset) is 35 BLEU which is considerably less than the supervised SOTA of 46 on the same translation pair and test set.

This same technique is language-agnostic and can be applied to the dialects of Arabic.

Model adaptation in NMT

Since the major success in NMT was seen in western languages were large quantities of parallel data is available. Many people [13]–[15] considered the task of adapting the models trained on western languages to work on under-resourced languages.

The steps of the process of using the parameters of a parent model (trained on the large dataset) in initializing a child model (to be trained on a small dataset) usually go like this [14]:

  1. Learn monolingual embedding of the child language in a monolingual fashion using word2vec
  2. Extract source embedding from a pre-trained parent NMT model.
  3. Learn a cross-lingual linear mapping between 1 and 2
  4. replace the source embedding with the linearly mapped ones

See the following diagram for an illustration

This approach can increase the performance of under-resourced NMT models in some cases by over 100%. The following table is from [12], please note that although the results might not seem very good the pairs in question are extremely under-resourced, the top pair (Slovenian to English) have 17k parallel sentence while other pairs can have as low as 1000 sentences.

We have not come across any work tackling the issue of NMT model adaptation from Arabic MSA to Arabic dialects.

Cross-lingual embedding mapping

This method depends on the closeness between the dialects of Arabic and the MSA. As mentioned above it is possible to build cross-lingual words embedding where the representations of words in one language are very close to the representation of the translation of that word in another language, see the figure below from Facebook’s MUSE.

In [16] the author describes a similar approach to carry out word-by-word translation of dialects. They basically train 2 separate embedding models for the source language (Egy – Epyptian) and the target language (MSA – Arabic Fusha) and then they use a bi-text dictionary to train a linear mapping that takes the representation of the word in the source language and return its representation in the target language, their model can predict possible translations for each word see the figure below.

This approach stops at this point. However, it is possible to extend this model to translate full sentences by treating the suggested words as a lattice and disambiguate it in a similar manner to speech recognition using a language model. Tools like SRILM is usually used to solve this task. See the following graph for an example lattice from speech recognition

While this method is relatively simple and language agnostic, such a simplistic approach have several issues including: named entities handling, out of vocabulary words, and the fact that although such a system considers the context in the target language it does not consider it in the source language.

Other Approaches

  • In [17] the authors describe a dialectical translation system using a hybrid rule based and SMT approach. The paper does not describe any evaluation results on test data but the system is available for researchers upon contact. This system was used in [18] to improve the translation of Arabic to english. However the improvement to plain MSA baseline is negligible.
  • In [19] the authors follow a similar pattern of 1- translating Egyptian dialectical speech first to MSA and 2- then translating the MSA results to english, the results are not impressive with 16.8 BLEU score.

Conclusion

  • First I would take these conclusions with a grain of salt because we need more research before taking any decision regarding the best path we should follow. The best source to start with is [20].
  • Supervised SMT is rather worrisome since we have so little data
  • Rule-based and ad-hoc approaches have been used previously. I am not sure of their potential but we should try to acquire the ELISSA system just to measure the performance of such system.
  • We need to do more research on the venues of cross-lingual words embeddings as well as a way to collect more data.
  • What we should start with is:
    • Supervised NMT is a possible solution; if approaches like domain adaption and synthetic data generation were followed.
    • The unsupervised NMT is worth a shot since codebases are available, and it can serve as an initialization for a supervised NMT.

Arabizi Transliteration and Vowelization

The Arabizi reefers to text written in Arabic language but using roman script e.g. “Sho Hal7ki”. Such type of text is used a lot in the social media. The text varies due to the variations of the Arabic dialects. The goal here is to convert the aforementioned text to the Arabic script.

This task is important for other applications related to dialectical Arabic (basically anything related to social media analysis)

The task can be classified into several levels:

  • Phonological: by basically doing a char-to-char translation in which a simple letter based mapping between Arabizi symbols and Arabic letters. Such an approach would be suitable in creating a very simple editor.
  • Lexical: this a bit higher level in which the translation is done on a word-to-word basis. This task can be declared solved mainly because there is a lot of open software that provides this service like Google’s Taarib, Yamli.
  • Syntactic: basically doing a fully-fledged machine translation between Arabizi and Arabic which is harder since the context has a role to play in this variation.

Datasets

There is a decent corpora size for this task the following table list the ones we found

Name Free or paid Size in words Notes
ILPS Free 10K From [21] but the details of data collection in their paper are rather vague and the quality of the dataset needs to be checked
ELRA-W0126 Paid
Contains 3,452 Arabizi tokens manually transliterated into Arabic, and a set of 127 Arabizi tweets containing 1,385 words also manually transliterated into Arabic. And while the commercial license costs 650 euros there is a free license for research purposes.
Camel Free 10k Arabic words transcribed using roman code (Romanization) and in local Arabizi
BOLT Paid ? Might be outdated
QCRI Free ? I remember they had a dataset for this task I just can’t find it anywhere

Methods to collect more data

In One paper by the QCRI (I really couldn’t find it) the authors generated a pronunciation table using first names from students records in several schools. These records included the names in Arabic and a romanized version of it. The goal was to improve the transliteration performance on OOV words.

Approaches

  • The authors in [21] collect the ILPS dataset and then build a simple translation system using a method similar to the lattice disambiguation approach from the machine translation section. The software is open.
  • In [22] the QCRI ppl achieve an F1 measure of 0.93 on the task of normalizing people names mentions in multiple documents.
  • [23] tackle both the task of code-switching detection (if the writer is writing in English or Arabizi) as well as the transliteration. They report accuracy of 93% on the code-switching task and an 88.7% conversion accuracy, with roughly a third of errors being spelling and morphological variants of the forms in ground truth. Their method used the same idea of lattice generation and disambiguation using a language model.

Conclusion

  • We believe that this task can be successfully implemented at least on the phonological and morphological lexical levels by for example training a character level decoder since there is some data for this task.
  • For full syntactical transliteration, we should try at least the code from ILPS as an off the shelf component.

Sentiment analysis

In this task given a piece of text, the system should extract the overall polarity in it (positive, negative, neutral).

This is usually done in 2 steps. First, the subjectivity of the text is determined aka (subjective vs objective) and if it is subjective the second step determines the polarity.

There is a wealth of datasets, lexicons, and pre-trained embeddings and quite a lot of work in the literature. The best line of work known to me is by Sief Mohamad from the CNRC. The dataset is all in the corpora list document.

However, in total, the performance of these models is relatively low (when the results are reported on shared challenges and not on homemade corpora ) and highly depends on the domain specificity as over-fitting is a real threat [22]. This is due to issues like the excessive use of sarcasm and contextual language, code-switching, and the dependence of the system on the dialect or the genre. The best approach to tackle this task is building a universal model and then adapting that model using data from a specific domain, genre, and dialect to ensure the highest performance. However, in a real industrial application, the size and type of data needed for adaptation become crucial and further investigation is needed.

There are some already established social media Analysis services for Arabic including crowdAnalyzer and trend25.

Summarization

This task is also listed under dialectical language processing mainly because of applications like social media summarizations and customers reviews summarization which would deal closely with dialectical content. Such applications can be helpful in tasks like social media analysis or stuff like google alert. In these applications the customer would like to see a gist of the user’s reviews in, say the last month, related to a specific brand or product and hows sentiment is negative.

Datasets

We have not come across any summarization datasets for dialectical Arabic. Furthermore, none of the available summarization datasets in Arabic includes dialectical content (they are taken from Wikipedia, news outlets, …)

Approaches

In [24], a microblog summarization technique based on machine learning for Egyptian dialect was presented. The results achieved were compared to several well-known algorithms such as SumBasic, TF-IDF, TextRank, and human summaries.

Conclusion

  • Simple language-agnostic summarization methods like TextRank can work easily on dialectical content.
  • There is no work on abstractive summarization in dialectical Arabic due to the lack of dataset (there is a lack of data in MSA itself let alone dialects), this task poses a challenge.
  • Most of the aforementioned applications of dialectical summarization focus on the multi-document variation (finding the gist of multiple users reviews/ tweets)

Information Retrieval

This addresses the issue of building search services that can deal with dialectical language. The main complexity in applying usual IR methods comes from the variations in orthography that is found in dialectical content due to the usage of Arabizi for example. This issue can be resolved by either transliteration or dialect specific normalization.

Approaches

  • In [25] the author addressed the issue of linguistic differences in IR. The presented tool automatically generates dialect search terms with relevant morphological variations from English or Standard Arabic query terms.
  • [26] is a very good summary on the IR in Arabic (couldn’t go through it cause it is too long.)

Preprocessing Tasks

In this section we will explore the various tasks related to text pre-processing:

  • Normalization: different normalization schemes need to be employed to handle dialectical text mainly because, in contrast to MSA, dialectical Arabic has no orthographic standard. The same word can be written in different forms. This poses difficulties in NLP tools. This task is extremely related to the task of Arabizi transliteration.
  • Segmentation: splitting the words into their constituents e.g. playing → play+ing
  • Part of Speech (POS) tagging: basically identifying the pos tag for each of the words in the sentence e.g. “boy plays the violin” → [(boy, Noun), (plays, Verb), (the, Identifier), (violin, Noun)]
  • Named entity recognition (NER): recognizing named entities like persons names, organizations or places.

Multi-task Datasets

  • The people at QCRI have built a sizeable dataset for dialectical Arabic segmentation and POS tagging for dialectical content.
  • Curras dataset is a similar dataset that contains the lemma and fine-grain pos for text in the Palestinian dialect.

Normalization

  • In [27] the first steps towards normalizing Arabic dialects orthography for Levantine and Egyptian were made. For that, different similarity measures were employed that exploit string similarity and contextual semantic similarity, to unify different writings of the same word.
  • Some people [28] suggested creating a specialized orthography (way of writing) that is comparable between dialectical and MSA. However, this never gained traction.

Named Entity Recognition (NER)

In [29], the authors address the issue of named entity recognition in microblogs, the describe the various complexities of NER in tweets. They utilize a weekly supervised language-agnostic method and report a rather low result of 0.65 f1 scores on dialectical tweets.

Following approaches, we didn’t have time to investigate are [30], [31].

Morphological Analysis

  • [32] rule-based Morphological analyzer for dialectical Arabic, didn’t gain tract and very old. Not sure of the importance.
  • In [33], two morphological analyzers for Gulf, Levantine, Egyptian, North African, Sudani, and Iraqi dialects were presented. The first one relies on an MSA morphological analyzer. The second one applies word segmentation and uses web data as a corpus to produce statistical information about the frequency of different segment combinations.
  • [34] morphological analyzer for Egyptian dialect.
  • [35] training a supervised pos tagger on a manually annotated set of Egyptian dialect text.
  • [36] a full morphological analyzer for Egyptian that supports part-of-speech tagging, diacritization, lemmatization, and tokenization.

Speech Recognition

In speech recognition, the system takes as input the speech waveform and is tasked with transcribing the content into words.

Datasets

  • A series of challenges by DARPA created a set of paid datasets most famous of which is the call home data set that contained Egyptian Arabic speech.
  • However, the largest free dataset for MSA 1200 hours was provided by the QCRI through their MGB-2 challenge that focused on speech recognition in the wild (i.e. under various variations in the channel, noise, and speakers conditions) by using transcribed TV broadcasts from Aljazeera.
  • MGB-3 introduced a small dataset of 16 hours of Egyptian dialect youtube videos for adaptation of systems developed for MSA using the MGB-2 to the Egyptian dialect.
  • Similarly, MGB-5 added another small Moroccan dialectical dataset of 18 hours also for adaptation of MGB-2 systems. All the MGB datasets are available on contact.
  • This collection of datasets by the QCRI includes several useful data for speech recognition.

Methods to collect more data

In some cases speech and transcripts exist independently. For example, all the articles of the Syrian Researchers are read aloud by some of the team members and saved on SoundCloud. In such cases it is possible to use forced aligners like aeanes to align the articles with the spoken version, and thus create large dataset for ASR and TTS. The only pain in this process is the normalization of text to align with speech. However, I don’t know any resources like this in dialects. Other examples include the audio version of the bible and the Quran (which are not dialectical.)

Approaches

  • The dissertation of Ahmad Ali [37] is by far the most important work in the field currently and it sums the effort of 5 years collaboration between Edinburgh University, QCRI, and Aljazeera on Arabic speech. They even provide a recipe for an ASR system in Arabic using Kaldi.
  • The major approach many of the applicants have used in MGB-3 challenge is based on training an ASR on MSA using te MGB-2 data and then adapting it to the dialect[38]. The best result on the Egyptian dialect is 29% WER (word error rate). Please note that the results of MGB-5 are not public yet.
  • Another approach is proposed for English by Google AI team [39], [40]. In these, they are investigating the applicability of multi-dialect speech recognition using a single end-to-end model in a similar fashion to google machine translation system. This new method has been applied successfully to multi-dialect English ASR and the results seems to be promising. Especially in performing zero-shot ASR (recognizing speech from dialects on which the system was never trained). There have even been some attempts at developing multi-lingual ASR [41].

Conclusion

  • The task of speech recognition is one of the most important tasks in dialectical Arabic mainly because dialects are mostly expressed through speech and to a lower degree through text (social media and texting.)
  • While it is possible to build ASR for MSA, the dialectical ASR is limited to dialects that have data for adaptation. MGB challenge provide data for Egyptian and Moroccan dialects only. We need to do more research to explore if more dialectical data is available or if there are more ways to collect data.

Speech Dialect Identification

The Arabic dialect Identification task is a special case of the more general language Identification, it is a vital component in most of the multi-dialect speech recognition systems, and is usually used as well in some speaker recognition systems.

In very simple terms the input of such a model is a speech wave and the output is the dialect of the speech. However, in contrast to the language detection, dialect detection is harder due to the similarity between the dialects.

Datasets

  • MGB-3 data set has been utilized for dialectical speech recognition (by adapting systems developed for MSA using the larger MGB-2 data to the Egyptian dialect) as well as dialect identification. This dataset is rather small it contains only 16 hours including adaptation, development and evaluation set, the dataset only includes Egyptian YouTube programs and is centred on podcasts.
  • MGB-5 ADI data: this dataset contains 3000 hours of YouTube videos covering 17 different dialects and many genres.
  • VarDial conference 2018 and 2017 have a harder shared task on Arabic speech dialect identification.
  • Finally, this is also a pretty large dataset from Aljazeera programs of over 1200 hours of speech annotated for the 5 main dialects.

Approaches

  • [42]–[45] have utilized the MGB-3 dataset to train their system with results reaching up to 80% f1 and here is a live demo the best model by QCRI and MIT utilized I-vectors in the detection, a method commonly used in user identification.
  • On VarDial the results are much worse mainly because the data is much smaller with much lower results of around 60% see [46]–[48].

Conclusion

We believe that also this task can be implemented since there is a wealth of data and a sizable literature on the field

Text to Speech

While TTS is available in MSA I have come across no work related to TTS in the dialects, and I hardly see any point in developing such a system. Mo thinks that maybe I’m wrong.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

References

[1] M. Elaraby and M. Abdul-Mageed, “Deep Models for Arabic Dialect Identification on Benchmarked Data,” in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), 2018, pp. 263–274.

[2] M. Salameh and H. Bouamor, “Fine-grained arabic dialect identification,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 1332–1344.

[3] H. Hassan, M. Elaraby, and A. Tawfik, “Synthetic data for neural machine translation of spoken-dialects,” ArXiv Prepr. ArXiv170700079, 2017.

[4] G. H. Al-Gaphari and M. Al-Yadoumi, “A method to convert Sana’ani accent to Modern Standard Arabic,” Int. J. Inf. Sci. Manag., vol. 8, no. 1, 2010.

[5] W. Salloum and N. Habash, “Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation,” in Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties, 2011, pp. 10–21.

[6] R. Zbib et al., “Machine translation of Arabic dialects,” in Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2012, pp. 49–59.

[7] K. Meftouh, S. Harrat, S. Jamoussi, M. Abbas, and K. Smaili, “Machine translation experiments on padic: A parallel arabic dialect corpus,” in The 29th Pacific Asia conference on language, information and computation, 2015.

[8] P. Shapiro and K. Duh, “Comparing Pipelined and Integrated Approaches to Dialectal Arabic Neural Machine Translation,” in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, 2019, pp. 214–222.

[9] M. Artetxe, G. Labaka, E. Agirre, and K. Cho, “Unsupervised neural machine translation,” ArXiv Prepr. ArXiv171011041, 2017.

[10] G. Lample, L. Denoyer, and M. Ranzato, “Unsupervised Machine Translation Using Monolingual Corpora Only,” ArXiv Prepr. ArXiv171100043, 2017.

[11] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-Based & Neural Unsupervised Machine Translation,” ArXiv Prepr. ArXiv180407755, 2018.

[12] M. Artetxe, G. Labaka, and E. Agirre, “An effective approach to unsupervised machine translation,” ArXiv Prepr. ArXiv190201313, 2019.

[13] M. Gheini and J. May, “A Universal Parent Model for Low-Resource Neural Machine Translation Transfer,” ArXiv Prepr. ArXiv190906516, 2019.

[14] Y. Kim, Y. Gao, and H. Ney, “Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies,” ArXiv Prepr. ArXiv190505475, 2019.

[15] T. Kocmi and O. Bojar, “Trivial transfer learning for low-resource neural machine translation,” ArXiv Prepr. ArXiv180900357, 2018.

[16] E. H. Almansor, “Translating Arabic as low resource language using distribution representation and neural machine translation models,” 2018.

[17] W. Salloum and N. Habash, “Elissa: A dialectal to standard Arabic machine translation system,” in Proceedings of COLING 2012: Demonstration Papers, 2012, pp. 385–392.

[18] W. Salloum and N. Habash, “Dialectal arabic to english machine translation: Pivoting through modern standard arabic,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 348–358.

[19] H. Sajjad, K. Darwish, and Y. Belinkov, “Translating dialectal arabic to english,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 1–6.

[20] S. Harrat, K. Meftouh, and K. Smaili, “Machine translation for Arabic dialects (survey),” Inf. Process. Manag., 2017.

[21] M. van der Wees, A. Bisazza, and C. Monz, “A simple but effective approach to improve arabizi-to-english statistical machine translation,” in Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), 2016, pp. 43–50.

[22] W. Magdy, K. Darwish, O. Emam, and H. Hassan, “Arabic cross-document person name normalization,” in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, 2007, pp. 25–32.

[23] K. Darwish, “Arabizi detection and conversion to Arabic,” ArXiv Prepr. ArXiv13066755, 2013.

[24] N. El-Fishawy, A. Hamouda, G. M. Attiya, and M. Atef, “Arabic summarization in twitter social network,” Ain Shams Eng. J., vol. 5, no. 2, pp. 411–420, 2014.

[25] A. Pasha et al., “Dira: Dialectal arabic information retrieval assistant,” in The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations, 2013, pp. 13–16.

[26] K. Darwish and W. Magdy, “Arabic information retrieval,” Found. Trends® Inf. Retr., vol. 7, no. 4, pp. 239–342, 2014.

[27] P. Dasigi and M. Diab, “Codact: Towards identifying orthographic variants in dialectal arabic,” 2011.

[28] N. Habash, M. T. Diab, and O. Rambow, “Conventional Orthography for Dialectal Arabic.,” in LREC, 2012, pp. 711–718.

[29] K. Darwish and W. Gao, “Simple Effective Microblog Named Entity Recognition: Arabic as an Example.,” in LREC, 2014, pp. 2513–2517.

[30] A. Zirikly and M. Diab, “Named entity recognition for dialectal arabic,” ANLP 2014, p. 78, 2014.

[31] A. Zirikly and M. Diab, “Named entity recognition for arabic social media,” in Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015, pp. 176–185.

[32] N. Habash and O. Rambow, “Morphophonemic and orthographic rules in a multi-dialectal morphological analyzer and generator for arabic verbs,” in International symposium on computer and arabic language (iscal), riyadh, saudi arabia, 2007, vol. 2006.

[33] K. Almeman and M. Lee, “Towards developing a multi-dialect morphological analyser for arabic,” in 4th international conference on arabic language processing, rabat, morocco, 2012.

[34] W. Salloum and N. Habash, “ADAM: Analyzer for dialectal Arabic morphology,” J. King Saud Univ.-Comput. Inf. Sci., vol. 26, no. 4, pp. 372–378, 2014.

[35] R. Al-Sabbagh and R. Girju, “A supervised POS tagger for written Arabic social networking corpora.,” in KONVENS, 2012, pp. 39–52.

[36] N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh, “Morphological analysis and disambiguation for dialectal Arabic,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 426–432.

[37] A. M. A. M. Ali, “Multi-dialect Arabic broadcast speech recognition,” 2018.

[38] S. Khurana, A. Ali, and J. Glass, “DARTS: Dialectal Arabic Transcription System,” ArXiv Prepr. ArXiv190912163, 2019.

[39] M. Johnson et al., “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 339–351, 2017.

[40] B. Li et al., “Multi-dialect speech recognition with a single sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4749–4753.

[41] S. Toshniwal et al., “Multilingual speech recognition with a single end-to-end model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4904–4908.

[42] C. Zhang, Q. Zhang, and J. H. Hansen, “Semi-supervised Learning with Generative Adversarial Networks for Arabic Dialect Identification,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5986–5990.

[43] S. Khurana, M. Najafian, A. M. Ali, T. Al Hanai, Y. Belinkov, and J. R. Glass, “QMDIS: QCRI-MIT Advanced Dialect Identification System.,” in Interspeech, 2017, pp. 2591–2595.

[44] A. E. Bulut, Q. Zhang, C. Zhang, F. Bahmaninezhad, and J. H. Hansen, “UTD-CRSS submission for MGB-3 Arabic dialect identification: Front-end and back-end advancements on broadcast speech,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 360–367.

[45] S. Shon, A. Ali, and J. Glass, “Convolutional neural networks and language embeddings for end-to-end dialect recognition,” ArXiv Prepr. ArXiv180304567, 2018.

[46] A. M. Butnaru and R. T. Ionescu, “UnibucKernel Reloaded: First place in Arabic dialect identification for the second year in a row,” ArXiv Prepr. ArXiv180504876, 2018.

[47] M. Zampieri et al., “Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign,” in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, 2018.

[48] P. Nakov, M. Zampieri, N. Ljubešić, J. Tiedemann, S. Malmasi, and A. Ali, “Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial),” in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2017.

Leave a Reply

Your email address will not be published. Required fields are marked *