Comparative Study of Clustering Algorithms in Text Mining Context

The spectacular increasing of Data is due to the appearance of networks and smartphones. Amount 42% of world population using internet [1]; have created a problem related of the processing of the data exchanged, which is rising exponentially and that should be automatically treated. This paper presents a classical process of knowledge discovery databases, in order to treat textual data. This process is divided into three parts: preprocessing, processing and post-processing. In the processing step, we present a comparative study between several clustering algorithms such as KMeans, Global KMeans, Fast Global KMeans, Two Level KMeans and FWKmeans. The comparison between these algorithms is made on real textual data from the web using RSS feeds. Experimental results identified two problems: the first one quality results which remain for algorithms, which rapidly converge. The second problem is due to the execution time that needs to decrease for some algorithms.

Text mining is a set of techniques, which aim to process those huge amounts of data and gain value from it.Introduced by Ronen and Dagan as KDT [2], we find as main branches of text mining: text extraction, summarizing, categorization, etc.
In opposite of Data mining, KDT aims to process unstructured texts, complex and over dimensioned data.Generally KDT is based on an automatic process to analyze the entire contents.
The paper is organized on three sections: In the first section, we present a text clustering system for KDT.In the second section, a number of classification algorithms are described.The third section presents a comparative study of clustering algorithms in a KDT context.At the last section some conclusions are drawn.

II. exIsTIng clusTeRIng sysTeM
It is generally based on automatic method to analyze and process the whole text.Among these methods there is the process show in [10], which is spread over three main stages:

A. Preprocessing
The aim of this step is to clean data and reduce noise: [11]: 1. Removal of empty words (defined articles, punctuation marks, etc).b) Probabilistic Model: the representation of the content of a document is made in the context of a probabilistic method [13].Compared to a given query, this model gives a probability estimation of document relevance.c) Vector Model: sometimes called semantic Vector [14].It is the representation of a document contents through and algebraic method that consider the semantics.Very famous method, Vector model is used to represent documents in a vector shape.The application of clustering with this model becomes easy.

B. Processing
In this step, we use one of the clustering algorithms to create the corpus [15].Alternatively we will distribute documents on several clusters.The elements of each cluster have a common characteristic.Clustering algorithms is presented in the next section.

C. Post Processing
Some works uses the ontologies to give meaning to clusters.They use labeling [8] and visualization of clustering results to present hierarchical relations between the resulting clusters and evaluation.They are generally looking for the most frequent terms in each cluster to present them using semantic relations of ontologies.

III. clusTeRIng algoRIThMs kMeans
Partitioning Algorithms put data in a predetermined number of clusters.The clustering approaches can be divided into two categories according to their input parameters: KMeans is the most known algorithm in the first category.The second category includes Two-Level-KMeans for example.

All partitioning algorithms use the following concepts:
We note: K is the number of desired clusters, n is the cardinality of data.
It aims to minimize the distance between points belonging to each cluster (score).
µ is the center of the cluster i .Xj is an element of the set.
Alternatively, we try to find clusters with minimum of inter-cluster distance and the maximum of intra-class distance.

A. Algorithms KMeans
Based on the previous concept, five partitioning algorithms are presented in the following: The classic and most widely used algorithm is KMeans [4].It takes two input parameters: the number of clusters K and a set of data.This algorithm initializes arbitrarily the center of these clusters.Before running in several iterations, KMeans affected data to the nearest cluster, and the centers are recalculated at the end of each iteration.The algorithm stops once there are no more new assignments.

b) Global-K Means
The majority of proposed solutions previously cannot ensure the convergence to a global optimum, Global KMeans [3] proposed a new version, which overcomes this problem.
The main idea of the algorithm is to start with a single cluster that contains all the data set.
Then each iteration creating a new cluster with the center that minimizes the squared error.The algorithm stops once reached the number of clusters specified by the user.

c) Fast Global KMeans
This algorithm is proposed by Jim Z.C and all [6], to improve the Global KMeans.The fast global k-means algorithm constitutes a straightforward method to accelerate the global k-means algorithm.e) FW-KMeans FW-KMeans applies on the vector space model (VSM).Unlike KMeans that treats all elements of the set fairly, the FWK-Means uses the notion of weight to highlight the most common elements.In addition, FW-Means introduced a constant (very low value) to avoid the noise problem.

Suppose we are in
[ ] ,..., , 2 1 are the k cluster centers; The proposal [8] FW KMeans had the aim of reducing the noise.This reduction is made by a weighting method to decrease the influence of noise, by the concentration of comparisons on the main axes.

Iv. woRks and ResulTs
At first, we recover items from RSS feed in XML format.Then, we transform the whole to flat files.After removing empty words, we proceed to lemmatization stage: for each word we look for its root using the TreeTagger Tool [9] and the last part of the preprocessing is reserved to convert results into vectors Using TF-IDF formula:

A. Runtime Environment
The data used are taken from the actual web processed in a machine 2.5 GHz CPU (Core i5) and 8 GB of Ram, running on Windows 7.

B. Used Data
The actual used data is retrieved from several info websites (le monde, 01net, JDN etc.) Usually the documents do not have the same size (number of words), or the same natural category (political subject, computer etc.).In total 727 documents were recovered.
These documents went through a preprocessing cycle before the application of the clustering step.

C. Results
Table 1 presents obtained results using the five partitioning algorithms described previously.The time execution is also reported in this table.All algorithms are launched five times (except Global KMeans: one time) and took the average value of the index in Tables 1.For the Two-Level-KMeans algorithm, we fix threshold as the average value of all vectors.The initialization of the centers of all algorithms (except Global KMeans) is made arbitrarily.For the β parameter of FW-KMeans, it was set at 1.5. 1, we find that the classic KMeans is the fastest in terms of execution time.

By analyzing the results in Table
This rapid convergence to a local optimum lets results become heuristic.
For the global KMeans present the longest (execution time).It is normal because it treats all possible cases.The application of this method in a real-time context is not possible.Fast Global KMeans is very quick in comparison to the Global KMeans, and gives an interesting result compared to the execution time.Two level KMeans clustering is very useful if it used for large volume of data.
A reduction in the processing time is confirmed.This reduction according to the literature [7] maintains the quality of the clustering.But it was tested on uniform data.In our case we tested them on various data.This application showed that the two level KMeans is inefficient in this case.
The FW-KMeans presents good results compared to the majority of tested algorithms.It takes into account the concept of weighting in its partitioning which increases the quality of the results but the execution time is bit little high.

v. conclusIon
In this paper, we have implemented a process of text mining.Initially, we have performed a preprocessing on the data from the web, and then we applied five clustering algorithms (KMeans Global KMeans, Fast Global KMeans, Two-level KMeans and FW-KMeans) on the data.The evaluation of the classification results is performed with Squar erreur and DBIndex.
We found similar results in the literature on our data: rapid convergence of the KMeans clustering and less performance of two level KMeans clustering without having good quality.Medium or high quality with Global KMeans, fast GKMeans and FWKmeans, but with a long execution time.
It has been found that Two-level-KMeans is ineffective in a KDT context, and its results are closed to KMeans ones.In addition, the choice of the threshold value is still a problem The FW KMeans improves the k-means algorithm by adding a new step, in which the weights of features for different groups are calculated.The experiences show that FW KMeans can deal with a large and sparse data.The results of own experiment on text data provided from the real world show that the FW KMeans is better than the TWO Level KMeans.
The indices used for clustering evaluation are less specific and depend on the treated area.In future research, we will work on two axes: • Integration of ontologies in different stages of the process to improve our results in terms of quality and execution time.
• The parallelization algorithms for the different stages.
Taking a threshold τ as an input parameter, the Two-level-KMeans[7] is executed in two steps.The implementation of the classic KMeans is made first.After clusters that do not verify the threshold condition, are selected for subdivision into y radius of the cluster and n is the data cardinality.

1
is the set of weight vectors for all clusters in which each or dissimilarity measure between the th j document and the center of the th l cluster on the th i feature.In text clustering, they use the Euclidean distance because the frequencies of terms are numeric values.
Where ij n is the occurrence of the word i n in the document j .

n
Is the total number of words in the document t D : total number of documents in the corpus and vector is in Fig.1

Fig 1 :
Fig 1: part of the vector file