A Hybrid Algorithm for Recognizing the Position of Ezafe Constructions in Persian Texts

 Abtract — In the Persian language, an Ezafe construction is a linking element which joins the head of a phrase to its modifiers. The Ezafe in its simplest form is pronounced as –e, but generally not indicated in writing. Determining the position of an Ezafe is advantageous for disambiguating the boundary of the syntactic phrases which is a fundamental task in most natural language processing applications. This paper introduces a framework for combining genetic algorithms with rule-based models that brings the advantages of both approaches and overcomes their problems. This framework was used for recognizing the position of Ezafe constructions in Persian written texts. At the first stage, the rule-based model was applied to tag some tokens of an input sentence. Then, in the second stage, the search capabilities of the genetic algorithm were used to assign the Ezafe tag to untagged tokens using the previously captured training information. The proposed framework was evaluated on Peykareh corpus and it achieved 95.26 percent accuracy. Test results show that this proposed approach outperformed other approaches for recognizing the position of Ezafe constructions.


I. INTRODUCTION
HE "Ezafe" 1 is a Persian language grammatical construct which links two words together. Ezafe means "addition" and is an unstressed vowel -ewhich marks genitive cases. The constructs linked by the Ezafe particle are known as "Ezafe constructions". Some common uses of the Persian Ezafe are [1]:  a noun before an adjective: e.g. ‫قرمز‬ ‫توپ‬ 2 (tu:p-Ezafe Germez) "red ball"  a noun before a possessor: e.g. ‫علی‬ ‫کتاب‬ (ketã:b-Ezafe Ali:) "Ali's book"  some prepositions before nouns: e.g. ‫میز‬ ‫زیر‬ (zi:r-Ezafe mi:z) "under the table" The Ezafe in its simplest form is pronounced as -e, but 1 It is also known as Kasreh. 2 For each Persian word or phrase we wrote its transliteration within parenthesis and its English meaning within double quotes. International Phonetic Alphabet (IPA) was used to represent Persian language pronunciations.
generally not indicated in writing. In some cases the Ezafe has an explicit sign in writing. For example, with nouns ending in ‫ا‬ (ã:) or ‫و‬ (ou), the Ezafe appears as an ‫ی‬ (j) at the end; with nouns ending in a silent ‫ه‬ (h) (short e followed by a mute h), the Ezafe may appear as a superscript ‫ء‬ (hamze) or a ‫ی‬ (j).
The Ezafe is also found in Urdo [2], Kurdish [3] and Turkish [4]. The Persian Ezafe has been discussed extensively [5]- [8]. This construction raises several issues in syntax and morphology. There are three issues on the function of the Ezafe in the literature: (1) the Ezafe is a case marker [9], (2) the Ezafe is inserted at PF to identify constituenthood [10], and (3) the Ezafe is a phrasal affix [11].
Determining the position of an Ezafe construct may facilitate text processing activities in natural language processing (NLP) applications, such as segmenting a phrase or detecting the head word of a phrase [12]. Moreover, recognizing words which need an Ezafe is advantageous for tokenization [13], morphological analysis, and syntax parsing [14], and it is essential for speech synthesis [15].
Some NLP tasks in Persian, such as machine translation [16], construction of morphological lexicons [14], and grammar construction [17], have benefited from the availability of an Ezafe construction. However, they have determined the position of the Ezafe manually, exploited cases in which the Ezafe is visually represented, or extracted some insertion rules which are not general; therefore, they could not determine the Ezafe tags for all tokens in a text.
The Persian Ezafe has been discussed extensively in theory [5], [18], but there are few works on the automatic detection of this construction in Persian texts. The most completely reported works on this subject are one work based on probabilistic context free grammar (PCFG) [19] and another based on classification and regression tree (CART) [20]. The former uses a bank including trees of noun groups in Persian for training PCFG. Then, a bottom-up parser extracts the most probable noun groups of the input. Finally, using lexical analysis, the system determines which words need an Ezafe. The disadvantage of this method is that writing a PCFG requires a large amount of linguistic knowledge. In addition, it is not sensitive to lexical information. The latter uses morphosyntactic features of words to train and construct binary classification trees to predict the presence or absence of an Ezafe between two adjacent words. In fact, there are two kinds of rules: rules which predict Ezafe words, and rules which predict non-Ezafe words. Although this method can predict the absence of an Ezafe with high accuracy, it is not sufficient in detecting words which need an Ezafe. In other words, the rules which predict the non-Ezafe words act more precisely.
The main contributions of this paper were (1) introducing a framework for combining genetic algorithms with rule-based models, and (2) using the proposed framework to develop an Ezafe tagger.
Combining genetic algorithms with rule-based models brings the advantages of both approaches and overcomes their problems. Genetic algorithms can detect general patterns in text, but sometimes they cannot handle exceptions and special cases. In such cases, rule-based models can provide significant improvements by defining rules for handling special cases and exceptions. In our proposed framework, for the rule-based model, linguistic rules were extended by analyzing errors of the genetic algorithm and defining new rules for handling these errors (named as correction rules). In contrast, a rule-based model needs a great deal of knowledge external to the corpus that only linguistic experts can generate. In fact, acquiring rules through interviews with experts is cumbersome and timeconsuming. Furthermore, certain application domains are very complicated and may require a large number of rules. Therefore, the acquired rules may be incomplete or even partially correct. In order to overcome these problems, we can handle general patterns by genetic algorithms and only define correction rules to handle special cases, which means less time handling by expert humans. We can also define a set of general rules besides correction rules in order to reduce the run time of the genetic algorithm.
There is a remarkable amount of ongoing research on applying machine learning approaches to different tasks of NLP in the English language. Most machine learning approaches such as those methods based on hidden Markov models, use information extracted from a tagged corpus to assign a suitable tag to each word according to preceding tags. Since these approaches are purely statistical, as such they are most suitable for cases that have a corpus large enough to contain all possible combinations of n-grams. In contrast, evolutionary algorithms offer a more generalized method that can be applied to any statistical model. For example, they can be applied to perform tagging operations according to the Markov model (tag prediction for a current word based on preceding tags) or improve the Markov results by using more contextual information (for example, using tags of preceding words or those of following words). In other words, HMM or other models can be used as part of the fitness function in a genetic algorithm. Therefore, a genetic algorithm provides more flexibility than any of the other classical approaches such as HMM based methods. On the other hand, the effectiveness of using hybrid approaches has been demonstrated in different NLP tasks [21]. Thus a hybrid approach for determining the position of Ezafe construction was chosen for this study.
Results of the tests in this study show that our proposed algorithm outperformed other algorithms for Persian Ezafe tagging as well as the classical MM based method.
The rest of the paper is organized as follows. Section 2 introduces the annotated corpus of Persian texts. Section 3 explains our proposed model. Experiment results are discussed in section 4. Finally, section 5 concludes the paper.

II. THE CORPUS
An annotated corpus of Persian text is needed in order to train and evaluate the Ezafe tagger. This corpus must be annotated with POS and Ezafe tags. For the current work, a subset of Persian POS tagged corpora known as Peykareh 3 [21] was used. This collection was gathered from daily news and common texts and contained about 2.6 million, manually tagged tokens. The main corpus was tagged with a rich set of POS tags consisting of 550 different tags from which 20 tags were selected for the system. Those that could be detected by the present Persian POS taggers were selected for use in the system applied to this study.
The tagged corpus was divided into three sets: (1) a training set including 423,721 tokens, (2) a held-out data set containing 1,010,375 tokens, and (3) a test set containing 39,850 tokens.
A big portion of the Peykareh corpus was set aside as a held-out dataset. The held-out dataset was used to find exceptions to general rules in the rule-based model. Since the exceptions occur only rarely, much more data was needed to determine the exceptions. Furthermore, to determine the classes of conjunctions and prepositions which never take an Ezafe or always require an Ezafe, the held-out data set was searched. Thus, a sufficiently large data set was needed to find as many words as possible.

III. THE PROPOSED ALGORITHM
This paper proposes a hybrid approach to determine the position of Ezafe constructions in Persian texts. The Ezafe tagger contained two phases. The first phase used the rulebased model to tag as many words as possible. Then the second phase ran the genetic algorithm to assign tags to the tokens which had not been assigned an Ezafe tag in the previous phase. Therefore, a faster genetic algorithm was achieved by producing more tagged tokens that had been generated from the rule-based model.
The Ezafe tagger assigned each word of an input sentence with one of two tags: true or false. Tag true for a word meant that it requires an Ezafe, and the tag false meant it does not require one.

A. The rule-based model
Initially, some general rules such as "verbs do not take an Ezafe" were defined using linguistic knowledge. Although the genetic algorithm could detect these tags correctly by training on annotated examples, we preferred to define such rules in order to reduce the run time of the algorithm. The more tokens detected by the rule-based model there were, the less chromosome length and lower number of generations were needed.
Next, the exceptions of each rule were explored on the heldout data, and some new rules were defined to handle exceptions. This process was repeated. In other words, if the generated rules had exceptions, new rules were defined. In some cases we could not find suitable rules to fix errors. In these cases, probabilistic rules were used.
The genetic algorithm tagged tokens according to the context; however, experiments showed that words which appeared in an infrequent context usually took an incorrect tag from the genetic algorithm. We tried to handle these cases through hand-crafted rules. Thus, the initial rule-based model and the genetic algorithm were run on the held-out data, and errors were analyzed to introduce rules that would fix them.
In this way, a set of 53 hand-crafted rules was developed. Then, the most suitable sequence of rules was determined in terms of avoiding bleeding and creeping; in fact, a tree was constructed. The first level of the tree contained some general rules. Level 2, consisted of some rules for handling exceptions of the first level and so it continued. However, each node in level i handled an exception of the rule of its parent node in level i-1 (if that rule had exceptions).
At the first stage of the proposed algorithm, each rule was taken individually from the rule-set one at a time and the function was performed only if the rule was applicable to the input word.
Rules were categorized according to various dimensions:  Deterministic vs. probabilistic rules: Deterministic rules are those which are always valid and correct; probabilistic rules may have exceptions. In other words, probabilistic rules are valid most of the time, but as they may have exceptions we apply them with a probability of less than 100%. This probability is extracted from the corpus.
 Negative vs. positive rules: Negative rules find and tag negative examples which are the structures which never take an Ezafe, while the positive rules determine structures which need an Ezafe.
 Syntactic, morphological and lexical rules: Syntactic rules use part-of-speech to tag words (either as the target word or a neighboring word) in a sequence to determine the Ezafe tag, while morphological rules consider internal and morphological structures of a word to do this task, and lexical rules consider real words.
In the rest of this section these categories are discussed in more detail some examples are given from each category.

1) Syntactic Rules
Some POS categories enforced a special tag on words or on neighboring words. The accusative case marker ‫را‬ (rã:) and verbs were among this set.
 Verbs o In the Persian language the Ezafe is not used with verbs.
The following rule dictated that verbs, which were shown in the corpus with the POS tag V, never take an Ezafe. If POS(X) =V Then EZ-Tag(X) =false o If a verb appears as a stand-alone, the word before it does not take an Ezafe. We presented this by the following rule: If POS(X) =V Then EZ-Tag(X-1) =false o If the verb is not a stand-alone and appears as an attachment (enclitic) to another word, then the previous word (before the combination) may take an Ezafe. This is the case for some of the enclitics representing the copula verb 'to be' such as ‫ی‬ (i:) "to be-single second person" and َ ‫م‬ (aem) "to be-single first person". These enclitics are ambiguous and, in addition to copula verbs, can be interpreted as an indefinite marker or as a single firstperson possessive pronoun, respectively. For example, the word ‫شاعری‬ (∫ã:?eri:) may mean ‫هستی‬ ‫شاعر‬ (∫ã:?er haesti:) "you are poet" or ‫ی‬ ‫شاعر‬ ‫ک‬ (jek ∫ã:?er) "a poet". In the first case, even though the whole word was tagged as a verb in the corpus, it is actually a combination of a noun and a verb. Even though its verb part and its previous word do not take an Ezafe, the word before the noun part of it may take one. As another example, in the following sentence the word ‫دولتم‬ (dolaetaem) "I'm government" is an abbreviation of ‫هستم‬ ‫دولت‬ (dolaet haestaem) "I am government". In Peykareh corpus, this word was tagged as verb with POS tag V,AJCC. However, the previous word takes an Ezafe.
‫دولتم.‬ ‫استخدام‬ ‫در‬ ‫من‬ I am a government employee. Thus, we used POS tag V, ACJJ for this kind of verbs to prevent applying the previous rule for them.  The accusative case marker ‫را‬ (rã:) o The Persian language has an accusative case marker ‫را‬ (rã:) that follows the direct object, adverb or prepositional object. The following rule dictated that the accusative case marker, which was shown in the corpus with the POS tag POSTP, never takes an Ezafe. If POS(X) =POSTP Then EZ-Tag(X)=false o The word before ‫را‬ (rã:) does not take an Ezafe too.
If POS(X) =POSTP Then EZ-Tag(X-1)=false o In some cases the accusative case marker is attached to the previous noun or pronoun. For example the word ‫مرا‬ 4 (maerã:) is an abbreviation of ‫را‬ ‫من‬ (maen rã:), in which ‫را‬ (rã:) is an object marker and ‫من‬ (maen) "me" is a pronoun. This rule was written as follows: If postfix(X) = accusative-case-marker EZ_Tag(X) =false

2) Morphological rules
The following rules are examples of morphological rules that determine structures that take an Ezafe.
 When a word ending in the plural suffix ‫ها‬ (hã:) needs the 4 Sometimes it means 'my' and other times it means 'me' Ezafe, the letter ‫ی‬ (j) must be attached to the end of the word in writing. Thus, if a plural word ends in ‫ها‬ (hã:), this word should not be followed by an Ezafe unless it is followed by a ‫ی‬ (j) clitic.
If postfix(X)=‫ها‬ Then EZ-Tag(X)=false Consider the following example.
The word ‫ها‬ ‫نامه‬ (nã:me hã:) "letters" is the plural form of ‫نامه‬ (nã:me) "letter". When this word requires Ezafe, we add ‫ی‬ (j) at the end of it.  If the last character of a word is ‫ا‬ (Tanvin) 5 , then it does not take an Ezafe.

3) Lexical rules
Lexical rules consider real form of words as shown in the following examples:  Prepositions Reference [23] showed that the class of prepositions in the Persian language is not uniform with respect to the Ezafe. Some prepositions reject the Ezafe (These prepositions were called Class P1.), while others either permit or require it. We divided the latter group into two classes. The first class which always requires an Ezafe was called Class P2. The other class which permits an Ezafe but does not necessarily require one was called Class P3. Table I shows some examples of each class. We applied the following rules to handle prepositions: If POS(X) =P and WORD(X)ClassP1 Then EZ-Tag(X)=false If POS(X) =P and WORD(X)ClassP2 Then EZ-Tag(X)=true ‫به‬ (be) "to" ‫از‬ (aez) "from" ‫با‬ (bã:) "with" ‫در‬ (daer) " in, on" Class P2 ‫وسط‬ (vaesaet) "in the middle" ‫دور‬ (du:r) "around" ‫بیرون‬ (bi:ru:n) "outside" ‫داخل‬ (dã:xel) "inside" Class P3 ‫زیر‬ (zi:r) "under" ‫رو‬ (ru:) "on" ‫باال‬ (bã:lã:) "up" ‫جلو‬ (dʒoulou) "in front of"  Conjunctions Same as prepositions, we divided conjunctions into two classes. Some conjunctions never take an Ezafe (These conjunctions were called Class C1), while others always take an Ezafe (These conjunctions were called Class C2). In order to determine these classes we searched 300 files of Peykareh corpus which were selected as held-out data. Some examples of Class C1 and Class C2 are presented in Table II. The following rules applied to conjunctions: If POS(X) =CONJ and WORD(X)ClassC1 Then EZ-Tag(X)=false If POS(X) =CONJ and WORD(X)ClassC2 Then EZ-Tag(X)=true ‫و‬ (vae) "and" ‫زیرا‬ (zi:rã:) "because" ‫ی‬ ‫ا‬ (jã:) "or" ‫که‬ (ke) "that" ‫یعنی‬ (jaeni:) "means"

4) Probabilistic rules
We also defined 5 probabilistic rules which were correct and valid in most cases but had some exceptions in a few cases. Defining each rule, the probability of that rule was calculated according to the corpus. The lowest probability among these rules was 0.95. Here, we discuss some of the probabilistic rules.  Long vowels There are three long vowels in Persian: ‫ا‬ (ã:), ‫ی‬ (i:) and ‫و‬ (u:). Generally, when a word ending in ‫ا‬ (ã:) or ‫و‬ (u:) needs an Ezafe, the letter ‫ی‬ (j) is added to the end of it. However, this rule has some exceptions.
In the case of ‫ا‬ (ã:), these exceptions happen when we replace ‫أ‬ (Alef Hamze) by ‫ا‬ (ã:) (single alef). Alef Hamze is a single Arabic character that represents the two-character combination of Alef plus Hamze and in Persian writing is sometimes replaced by the letter ‫ا‬ (ã:). Consider the following example: ‫داند.‬ ‫می‬ ‫فقر‬ ‫را‬ ‫فساد‬ ‫منشا‬ ‫احمدی‬ ‫آقای‬ "Mr. Ahamdi believes that the source of evil is poverty." The word ‫آقا‬ (ã:Gã:) "Mr." takes a ‫ی‬ (j) at the end because it requires the Ezafe; however, the word ‫منشا‬ (maen∫ae) "source" also ends in ‫ا‬ (ã:) and needs the Ezafe, but it does not get ‫ی‬ (j). In fact, the last character of this word is ‫أ‬ (Alef Hamze) which is written the same as ‫ا‬ (ã:).
To compute the probability of the rule, the algorithm searched the held-out data set and computed the percentage of words ending with ‫ا‬ (ã:) and the Ezafe which had the letter ‫ی‬ (j) added. In other words, this probability was computed by the following formula: (1) Thus, the following rule was defined with a 95% probability: If LastChars(X) ‫ای=‬ Then with a 0.95 probability EZ-Tag(X) =true In the same way, the following rules were defined: If LastChars(X) ‫ا=‬ Then with a 0.9978 probability EZ-Tag(X) =false If LastChars(X) ‫وی=‬ Then with a 0.96 probability EZ-Tag(X) =true

 Tanvin
Tanvin is a sign which is derived from Arabic. The following rule says that with a probability of 96.24% the word preceding a word that has ‫ا‬ (Tanvin) as the final character does not take an Ezafe: If LastChars(X) = ‫ا‬ Then with a 0.9624 probability EZ-Tag(X-1) =false Frequency counts for the rule-categories are shown in Table IV. After running the rule-based model, some of the tokens remained untagged. Thus, a genetic based algorithm was used to tag the remaining words.

B. Genetic tagging algorithm
The proposed genetic algorithm receives a natural language sentence and assigns a corresponding tag according to previously computed training information from the annotated corpus. Formally, given a sequence of n words and corresponding POS tags, the aim is to find the most probable Ezafe tag sequence.
In our implementation, each gene can take values: true or false. Individuals of the first generation were produced randomly. After producing an individual, all tokens of a given sentence were assigned Ezafe tags (some of tokens get Ezafe tag by the rule-based model and others get Ezafe tag by the genetic algorithm).
An initial population was created randomly by assigning a random value to each untagged gene (some genes were assigned Ezafe tags from the rule-based model). These individuals were sorted according to fitness value of individuals from high to low.
Three genetic operations were used for producing the next generation.  Selection: All individuals in the population are sorted according to fitness, so the first individual was the best fit in the generation. To perform crossover, the ith and (i+1)th individuals of the current generation were selected, where i=1,2,...,[(p+1)/2] and p was the population size. The aim of selection was to choose the fitter individuals.  Crossover: Selected two chromosomes, crossover exchanges portioned of a pair of chromosomes at a randomly chosen point called the crossover point.  Mutation: Selected an untagged gene randomly and toggled its value, for example if its value was true, it was reset to false and vice versa.

1) Fitness Functions
To evaluate the quality of Ezafe tags generated for an individual, four functions were used; F1, F2, F3 and F4. These functions considered the context in which a word appeared. Context consisted of a current word, one tag to the left and another to the right and the previous and next word.
F1 considered the sequence of POS tags of a sentence. The probability of the sequence of POS tags of a sequence of n words was as follows: (2) Where, represents the probability that the current word with POSi tag gets the Ezafe when the next word has the POSi+1 tag. The probability of assigning the Ezafe to a word given the next POS tag was computed as: Where, count was the number of occurrences of the sequence within the training corpus, and was the number of occurrences when the first token has an Ezafe within the same corpus. In order to compute F1 function, the HMM model can be used with the Viterbi algorithm [24].
For computing F2 function, a data driven approach was applied to calculate the probability that a specific word has the Ezafe.
(4) F2 was defined because some words in Persian are mostly assigned a special tag. For example, the word ‫تقریبا‬ (taeGri:baen) "approximately" never take an Ezafe.
F3 function was the probability that a token gets an Ezafe when it occurs before a specific word in the training corpus. (5) This function was defined to handle compound words such as ‫نظر‬ ‫اختالف‬ (extelã:f-Ezafe naezaer) "difference in opinion".
In Persian, some words such as ‫سایر‬ (sã:jer) "other" get the Ezafe most of the time. Therefore, we defined the F4 function to consider these words. The F4 function was the probability that a specific word occurs after a word with the Ezafe in the training corpus.
The following fitness function was used to evaluate the genetic algorithm: Where are constant parameters chosen from [0,1) and show relative importance of syntactic and lexical information. It was assumed that 0 is a legal value to show the effect of removing one or more functions from the formula. To adjust parameters in the fitness function formula, variable structure learning automata were applied on chunked held-out data. For more information you can see [25]. Finally, the values of w1, w2, w3 and w4 were set to 0.8, 0.5, 0.1 and 0.1 respectively.

IV. EXPERIMENTS
The proposed algorithm was implemented using java language and was run on a Pentium IV processor. First, the rule-based model was run followed by the genetic algorithm, and the best solution was selected. Approximately 78% of the tokens were tagged with the rule-based model, because about 80% of the tokens selected as test data did not require an Ezafe, and most of them were tagged by the rule-based system.
To evaluate the performance of our proposed algorithm, three measures were taken: accuracy (the percentage of correctly tagged tokens), precision (the percentage of predicted tags that were correct) and recall (the percentage of predictable tags that were found).
Since performance was related to both precision and recall, the F-measure was given as the final evaluation.

A. Tuning Parameters of the Genetic Algorithm
The efficiency of a genetic algorithm greatly depends on how its parameters are tuned. To adjust the genetic parameters, a subset of 34,832 tokens from held-out data set was selected. Then, the proposed algorithm was run on this set.
Beginning with a baseline configuration, such as Dejong's setting [26] with 1000 generations, 50 chromosomes in each generation and 0.6 for crossover probability, the algorithm was run for different mutation probabilities (Pm) from 0.01 to 0.3. Fig. 1 shows that the best results were obtained using the mutation probability 0.05. In the same way, crossover probability was set to 0.6. In Fig. 2 the results of running the genetic algorithm using mutation probability 0.05, crossover probability 0.6, population size 50 and different number of generations are shown.

B. Effectiveness of the Proposed Algorithm
The experiment applied 423,721 annotated tokens as the training set and 39,850 tokens as the test set. Parameter settings shown in Table V were used for the genetic algorithm.
Table VI compares our approach with a baseline method and other available methods based on PCFG [19] and CART [20]. We also implemented the binary Markov model with Viterbi decoding (a typical algorithm widely used for stochastic tagging). As can be seen, our proposed algorithm outperformed these algorithms in terms of F-measure. The baseline assumed all words have an Ezafe, resulting in 100% recall but very low precision (15.79%). We could define another baseline where no word has the Ezafe tag. In this case, we would achieve 84.21% precision, but the recall would be 0. Since we had no access to the corpus that was used for training in the CART method [20] or a description of the exact features used, we could not regenerate the exact results. For this reason, we used two approaches for comparison. In the first approach, we compared results of our proposed method with the best results reported in [20], and in the second one, we implemented the CART method using the same features as our proposed method and tested it on our test corpus. Since the performance of the second approach was much lower than what was reported in [20], we only presented the results of the first approach in Table VI. In the above-mentioned experiments, correct POS tags were used, because results from the proposed algorithm were compared to those from other available Ezafe taggers. Since these taggers had used correct tags, we also used the correct tags to enable the comparison. By using a Tnt tagger, the proposed algorithm achieved a 95.08% accuracy, while with correct tags it achieved an accuracy of 95.49%. This indicates that the tagging error decreased the Ezafe detection accuracy by only about 0.41%. The reason for this is that both the rulebased model and the genetic algorithm consider other features besides POS tags, and these features can, to some extent, cover the errors of the POS tagger.
Considering accuracy as the percentage of correctly assigned tags, we evaluated the performance of the proposed algorithm from two different aspects: (1) the overall accuracy by taking all tokens in the test corpus into account, and (2) the accuracy for words with an Ezafe and without an Ezafe, respectively. Table VII shows that the overall accuracy of the proposed algorithm was around 95.26%. Additionally, the accuracy for detecting words without an Ezafe was significantly higher than that for words with an Ezafe (96.89% versus 88.81%).    Table IX compares the accuracy of the rule-based model versus the genetic algorithm. In RBM1, we ran the rule-based model and assigned the false tag to tokens which did not get the Ezafe tag after applying the rules. In contrast, the untagged tokens got true tags in RBM2. We also ran the genetic algorithm alone (without the rule-based model). Results show that the combination of the rule-based model and the genetic algorithm outperformed both individual algorithms. As might be expected, the main problem of the RBM models was missing rules, which caused some tokens remained untagged, and the main problem of the genetic algorithm was special cases that could not be handled by general patterns. Since the ratio of words with an Ezafe to words without an Ezafe was low, the Kappa coefficient was used to evaluate the proposed algorithm. This measure was first suggested for linguistic classification tasks [27] and has since been used to avoid dependency of the score on the proportion of non-breaks in the text. The Kappa coefficient (K) was calculated as: (9) Where, Pr(A) was accuracy, and Pr(E) was the ratio of words without an Ezafe to total words.    In the final experiment, we assessed the impact of training corpus size on the performance of the proposed algorithm. The corpus size was reduced slightly until it reached 32% of the initial training corpus size. The results are presented in Fig. 3. As can be seen, the proposed algorithm's accuracy did not show a significant drop when reducing the training corpus size from 100% to 60%. This paper proposes a framework for recognizing the position of Ezafe constructions in Persian written texts that combines genetic algorithms with rule-based models. Genetic algorithms provide a search strategy to learn general Ezafe patterns in text optimizing a measure of probability that is effective globally. However, the rule-based model handles special cases and exceptions to general patterns. Results of the tests reported in this study show that the proposed algorithm outperformed other algorithms for Persian text Ezafe tagging and classical HMM based methods.
Although this paper presents an algorithm for Persian Ezafe tagging, the principles can be applied to other NLP tasks such as POS tagging or chunking in any language. A genetic algorithm can be used for any language to find common statistical patterns for tagging. Obviously, there may be exceptions to these patterns, so some rules are defined to handle exceptions in the rule-based model that serve to improve performance of the genetic algorithm. In fact, combining a genetic algorithm with the rule-based model improves performance of the tagging process.
In addition, we showed that the accuracy of the proposed algorithm does not depend highly on the training corpus size. This feature is advantageous for practical applications, because annotating training corpora for text analysis purposes is an extremely demanding task.
In future work, linguistic rules may be extended by analyzing errors of test data. It is also observed that input of the Ezafe-tag set has a major influence on accuracy. Errors in the training data have caused some problems, and these can be reduced by correcting the training data.
In addition, it is intended that new attributes be added to the fitness function of the genetic algorithm. One advantage of the genetic algorithm compared to other classical approaches such as HMM based methods is that new attributes can be added to the system and this facilitates examination of the effect of different attributes on tagging without altering the system's basic structure. Thus, tests will be done on new attributes applied to the fitness function of the genetic algorithm and to evaluate effects on tagging accuracy.
It was also observed that high accuracy is extremely influenced by input tag set. A richer tag set with POS information produces more accurate results. For example, we can consider additional information with a POS noun, such as time, location, and so on. In addition, in [5] there is a class of lexical words called eventive adjectives, and they cannot cooccur with an Ezafe in contrast with other lexical words. We are going to enrich the tagset with more POS tags such as eventive adjectives to define more accurate rules.