Segmentation-free Word Spotting for Handwritten Arabic Documents

— In this paper we present an unsupervised segmentation-free method for spotting and searching query, especially, for images documents in handwritten Arabic, for this, Histograms of Oriented Gradients (HOGs) are used as the feature vectors to represent the query and documents image. Then, we compress the descriptors with the product quantization method. Finally, a better representation of the query is obtained by using the Support Vector Machines (SVM).


I. Introduction
T he search for information in Arabic manuscripts is not a simple process, for this, the conception of recognition system knows today a great expansion and seems as a necessity so as to exploit the wealth of information contained in ancient manuscripts. The manual manipulation repetitive of fragile documents could destroy them, for this, many digitization projects have been developed such as DEBORA (Digital access to Books Of Renaissance) [1], EAMMS (Electronic Access to Medieval Manuscripts) [2], Better Access to Manuscripts and Browsing of Images (BAMBI) [3] and Manuscript Access Through Standards for Electronic Records(MASTER) [4]…treat Latin scripts. For this reason above, and in order to develop a complete system for recognition Arabic handwriting, the first step for the creation of this system is presented in this article. In the survey of literature, it is found that many researchers of keyword spotting methods are inspired by one of two following categories: • Learning-based methods use supervised machine learning techniques to train models of the words that the user wants to spot.
• Example-based methods, receive as input an instance of the word that the user wants to retrieve.
Most of the authors prefer a learning-based approaches for applications where the keywords to spot are a priori known and fixed. In same handwritten Arabic documents, the segmentation step is not usually easy " Fig.1", any segmentation errors affect the subsequent word representations and matching steps, this dependence motivated the researchers to move towards complete segmentation-free methods.
From literature survey of handwritten word spotting techniques, we found that some researchers has done work on the handwritten Arabic documents where a million documents are written in various disciplines between the seventh and fourteenth centuries, whereas many works treat a Latin's manuscripts documents. For a given query image, Y. Leydier and al. [5] extract and encode interest points by a simple descriptor based on gradient information. The word spotting is then performed by trying to locate zones of the document images with similar interest points. Only the ones sharing the same spatial configuration than the query model are returned. Kamble 1. Process of the proposed system characters using Rectangle Histogram Oriented Gradient, in this work; Feed-Forward Artificial Neural Network is used for classification step. For a query image, Rath et al. [7] extract discrete feature vectors that describe query, which are then used to estimate similarity after training step of the probabilistic classifier. Jon et al. [8] represent the document with a grid of HOG descriptors and use a sliding-window approach for word spotting in document images. A similar approach is used in [9] which the retrieval step is performed by using an exemplar support vector machine framework.
The approach used in [10] combines three slanted windows, vertical, left and right. They proposed this method to remedy with the problem of writing inclination, overlapping and diacritical marks. HMM-based classifier is used at the decision step. Kessentini et al. propose to use Multi-stream hidden Markov Models for off-line handwritten Arabic word recognition [11]. The approach used combines density based features and contour based features of sliding windows. In [12], each character is a state in Variable Duration Hidden Markov Models and has a variable duration to model a character model of multiple segments.
The remainder of this paper is organized as follows: Section 2 present the indexation system process. Section 3 describes HOG feature extraction approach for word-spotting and product quantization method. Afterwards, in section 4, classification using SVM classifier is presented. Section 5 discuss experimental results. Finally, conclusions are summarized in section 6.

II. Present Work
In the present work, the document images have been preprocessed to enhance them after scanning the collected datasheets. For this purpose, a model for the restoration of the degradations [13], which uses a series of multi-level classifiers [14] applied to document images. Then, the window slid on the image document Histograms of Oriented Gradients (HOG) features are then extracted from each word image to represent and to compare the query with the region of the document. The application of Product Quantization [15] to encode the HOG descriptors reduce the size of the descriptors provides better performance in time of descriptor computation. Finally, Support Vector Machines is used to produce a better representation of the query and to classify feature vectors. Identical positive set is produced by slightly the window around the query and sample negative set is obtained by taking a sampler random regions. The regions with high similarity will be used in reranking step. The general process of the proposed system is shown in " Fig.2". The proposed indexation system is achieved in the following steps: • Image preprocessing

III. Hog Feature Extraction
Histogram of oriented gradients feature descriptors are used in computer vision and image processing for the object recognition purpose. The main idea behind the HOG descriptors is that local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions. The implementation of these descriptors can be achieved by dividing the image into small connected regions, called cells, and for each cell computing a histogram of gradient directions; histograms are also normalized based on their energy (regularized L2 norm). The combination of these histograms represents the descriptor.

A. Feature extraction
The application of HOG features descriptors and Product Quantization for word spotting in handwritten Arabic documents is the main contribution of this paper. The feature extraction is a most important part of word spotting system, that mean transforming the input query into the set of features. Histogram Oriented Gradient is used to detect and extract feature of Arabic handwritten documents " Fig. 3". Initially, we remove noise for sliding-window and region of documents with Gaussian filter. After smoothing, the Sobel kernel is used to calculate the horizontal and vertical components of the gradients. Let, is the smoothed image and the horizontal and vertical

B. Product Quantization
The main drawback of sliding-window based methods is the cost of re-computing the descriptors of every image with every new query. However, it's possible to be pre-computing and storing the HOG descriptors, that necessary a large amount of memory to keep them. For this, we propose to encode the HOG descriptors by means of Product Quantization (PQ) proposed by Jégou et al. in [15]. This method both to reduce the amount of memory needed to store the descriptors and reduce the computational cost of searching of the descriptor. This technique has shown excellent results on approximate nearest neighbor tasks. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately.

IV. Classification
The main objective of query recognition system is to achieve robust performance to identification of query. In our case we have used SVM classifier with linear function for the recognition. The Width of the margin between the classes is the major optimization criterion, the empty area around the decision boundary, defined by the distance to the nearest training pattern. These patterns called support vectors, which finally define the function for classification.
The kernel linear function used in SVM classification is: The SVM attempts to find the hyper-plane < w, b > that maximizes the margin.
The positive and negative support vectors respectively is : So we can deduce that maximize the margin amounts to minimizing. This can be casted as an optimization problem as: x y x y w c L y w x c L y w x c Is a regularization parameter, 1 y + = + and 1 y − = − x + ∈ Ρ is constructed by deforming the query, To produce the negative set , the sample random regions over all the documents .

V. Experiments
In this section, we present the result of the approach proposed for searching query on Ibn Sina handwritten manuscripts datasets. MATLAB is used to measure all score and running times of the different sections (computing the HOG descriptors, calculating the scores with query and training the SVM). " Table 1" shows the mean average precision of the approach proposed and the time of descriptor computation per image document.
As we see in " Fig. 5" and "Fig. 6", the precision and recall mean recall results depending on the query's size.

VI. Conclusion
In this paper we have presented segmentation-free word spotting method for Handwritten Arabic documents using HOG descriptor. This method has been evaluated on a large amount of handwritten Ibn Sina. The experimental results " Table 1" and " Fig. 5" show that the used of HOG based feature extraction method and SVM classifier with linear function provides good results in mAP and the time of descriptor computation. In future, this work can extend to enhance the performance and reducing the time of descriptor by adding some more relevant features.  His research activities concern the error correcting codes, implementation of signal processing algorithms on dedicated circuits, image processing, indexation of old manuscripts, handwriting and machine print recognition..  research activities concern the image processing and its applications in medicine, heritage preservation (indexing of old manuscripts), societal dimension of applications (applications for the blind and visually impaired), handwriting and machine print recognition.