Biggest Challenges in Arabic Natural Language Processing

We have mentioned in previous blogs the significance of NLP and the wide range of applications where NLP is used. As the basic goal of NLP is to ease and simplify the communication between machines and humans, it is highly crucial to see how it will impact the lives of the people who speak, communicate and work with the 6th most spoken language in the world, the Arabic language. Arabic is a Semitic language that is spoken by approximately 420 million people in the world, in addition to that, Arabic is an official language in 26 countries and it is one of the 6th official languages of the United Nations. Arabic is morphologically rich and has many varieties, for example, there is the classical form of Arabic which is the language of the Quran (the Muslims holy book) and this is considered to be the most perfect form of Arabic, another variety is the modern standard Arabic which is the official language today and used in literature, education, books, media and other formal locations and situations and finally there are the Arabic dialects that are the everyday speech and they are different in each country. After the previous short introduction on the Arabic language, we will discuss in this article 3 of the most major issues in Arabic NLP.

1.      Arabic orthography

The Arabic language alphabet consists of 28 letters, only three are long vowels (ا) pronounced (Alef), (و) pronounced as (Waw) and (ي) pronounced as (Ya’a). In addition to other nine vowels represented as characters (َ ُُ ِِ ً ٌ ٍ ّ ْ ). Arabic is also one of the languages where the shape of the letter can change according to how it is connected with the other letters. For example the letter (ت) (the letter ‘T’ in English) has three forms of writing: it is written as (ت) if it is located at the end of the word, ( ) if it is located at the middle of the word and ( ) if it is located at the beginning of the word. Arabic orthography is very important to consider in all NLP tasks and applications such as: tokenization and text to speech.

2.      Arabic morphology

All the verbs in Arabic have a root from three or four letters which make Arabic a highly derivational language. Usually there is a template for Verbs derivation we can write that as verb=Root+pattern. The following table shows some examples of verbs in their past, present/future and commanding form derived from three and four letters roots.

rootpattern verbTransliteration meaning 
كتب ي ي+كتب=يكتب yaktbFuture/present form from write 
كتب ا ا+كتب=اكتب Ektbcommanding form from write 

It is also very common in arabic to attach prefixes and suffixes to verbs and we can formulate that with the following equation New_Verb=Prefix(es)+Verb+Suffix(es). The following table shows an example of inflection in Arabic.

verbNew Verbmeaning
يكتبس + يكتب = سيكتبHe will write
يكتبس + يكتب + ه = سيكتبهHe will write it

 Studying the Arabic language morphology is very important for NLP tasks such as morphological analysis and POS tagging.

3. Complex syntax

 Arabic language is rich in vocabulary where each word can have several meanings. for example, “البيت كبير” means (the big house) the word كبير which means (big) can give the sentence a different meaning if we said “كبير القوم” which means (the man that is responsible for a group of people). The problem of having multiple word expressions in the Arabic language will influence applications such as text summarization and translation.

References:

Challenges in Arabic Natural Language Processing

Leave a Reply

Your email address will not be published. Required fields are marked *