Topic Models and Fusion Methods: a Union to Improve Text Clustering and Cluster Labeling

Topic modeling algorithms are statistical methods that aim to discover the topics running through the text documents. Using topic models in machine learning and text mining is popular due to its applicability in inferring the latent topic structure of a corpus. In this paper, we represent an enriching document approach, using state-of-the-art topic models and data fusion methods, to enrich documents of a collection with the aim of improving the quality of text clustering and cluster labeling. We propose a bi-vector space model in which every document of the corpus is represented by two vectors: one is generated based on the fusion-based topic modeling approach, and one simply is the traditional vector model. Our experiments on various datasets show that using a combination of topic modeling and fusion methods to create documents’ vectors can significantly improve the quality of the results in clustering the documents


I. Introduction
W hile we are overwhelming by the increasing amount of available texts, we simply do not have the human power to read and study them to provide browsing and organizing experience over such the huge amount of texts. To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information. Topic modeling algorithms are statistical methods which are able to find the themes (topics) running through the text documents by analyzing their words. Using topic models in machine learning and text mining is popular due to its applicability in inferring the latent topic structure of a corpus. In document clustering, a topic model could be directly used to map the original high-dimensional representation of documents (word features) to a low dimensional representation (topic features) and then apply a standard clustering algorithm like k-means in the new feature space, or we can consider each topic as a feature of a document, thus documents with highest proportion of same topic (same feature) are located in the same cluster [1]. Specifically, in the classification problem, topic models can be interpreted as the soft (or fuzzy) classification of the collection of documents into latent classes, which means a document does not belong fully to one class but it has different degrees of membership in several classes. Besides, the results of topic models could be used to produce a hard classifier in which a document can only have one and only one category.
In this work, we present a novel approach to improve the quality of clustering using topic models [2] and fusion methods [3]. The core idea of our approach is to enrich the vectors of the documents in order to improve the quality of clustering. To this end, we apply a statistical approach to discover and annotate a corpus with thematic information represented in form of different proportions over different topics for each document. Our approach is an unsupervised method and the topics, used for enriching, are produced by the unsupervised learning method. Further, final enriched vectors, representing documents, are clustered through kmeans clustering and produced classes are hard classification classes extracted from the Latent Dirichlet Allocation (LDA) results.
We first run topic modeling several times with different parameters over the collection, we then specify a set of topics in each iteration as the special topics for each document. Finally, we combine all the special topics in each iteration to generate a single topic for every document. These generated topics are indeed the vectors which are used later in the clustering of the collection. Furthermore, we use these topics to generate labels for each cluster.

II. Related Works
In this section, we briefly summarize related works on text representation models for vector-word based text clustering.
The basic text representation model, i.e., Bag of Words (BOW) model, is widely used for text clustering and classification. In this model each term is weighted by various schemes such as TF, TF-IDF [4], and its variants [5]. Using BOW representation is popular but in the short text it generates a sparse vector for the document.
To overcome data sparseness, there are several works that exploit external knowledge (e.g., Wikipedia, WordNet, etc) to extend content of the documents. Banerjee et al. [6] uses Wikipedia knowledge base to enrich document representation vector with additional features, and Hotho et al. [7] uses WordNet knowledge base to enrich the representation vectors. There are some works that use feature selection approaches to reduce the high dimensionality. Revanasiddappa

Abstract
Topic modeling algorithms are statistical methods that aim to discover the topics running through the text documents. Using topic models in machine learning and text mining is popular due to its applicability in inferring the latent topic structure of a corpus. In this paper, we represent an enriching document approach, using state-ofthe-art topic models and data fusion methods, to enrich documents of a collection with the aim of improving the quality of text clustering and cluster labeling. We propose a bi-vector space model in which every document of the corpus is represented by two vectors: one is generated based on the fusion-based topic modeling approach, and one simply is the traditional vector model. Our experiments on various datasets show that using a combination of topic modeling and fusion methods to create documents' vectors can significantly improve the quality of the results in clustering the documents.
Lu et al. in [1] investigated performance of two probabilistic topic models Probabilistic Latent Semantic Analysis (PLSA) and LDA in document clustering. Authors used the topic models to generate a number1 of topics which are treated as specific features of documents. Therefore, for clustering, documents that have highest probability in a same feature (same topic) are clustered into the same cluster. In a similar way, Yau et al. [9] aims to elaborate on the ability of further other topic modeling algorithms Correlated Topic Model (CTM), Hierarchical LDA, and Hierarchical Dirichlet Process (HDP) to cluster documents. We highlight two main problems here: first, we do not know the exact number of topics running through the corpus, besides, because of frequency-based nature of topic models, we cannot claim the topic with the highest probability for a document is the main topic by which the documents must be clustered. These two problems are considered as our hypothesis in dealing with topics running through the corpus.
The supervised approaches in text classification domain [10,11,12] exploit topic models to enrich document representation. Vo and Ock [11] used the LDA model for topic analysis but presented new methods for enhancing features by combining external texts modeled from various types of universal datasets. In other studies [10,12] their authors propose an approach to learn word vectors together with topics.
There are also some neural embedding methods word2vec [13] and doc2vec [14] that produce vector representations of words and documents by processing a corpus. Word2vec is a two-layer network with the main assumption that words with similar contexts have similar meaning. According to this assumption, word2vec describes semantic correlations between words in the corpus. Doc2vec (or Paragraph Vectors) is an extension of word2vec that requires labels to associate arbitrary documents with the labels. Indeed, Doc2vec learns to correlate labels and words rather than words with other words. These algorithms prefer to describe real semantic information embedded in words, sentences and documents rather than statistical relationships of the term occurrences.
In this paper, we propose a method to enrich document representation vectors to be used in partitional text clustering and cluster labeling. Our method is an unsupervised approach, needless of any external knowledge, with the aim of overcoming the two main problems about sparse vector and traditional LDA representation explained above. Since the main goal of our method is to enrich document vectors according to the statistical relationships of the term occurrences rather than real semantic information embedded in terms, we compared our results with two strong baselines in this domain. To this end, we use two unsupervised baselines: first baseline, i.e., BOW text representation with TF-IDF terms weighting, and second baseline, i.e., unsupervised usage of LDA in document representation [1,9].
To the best of our knowledge, our work is the first to suggest a topic modeling solution to improve the quality of clustering and to perform cluster labeling based on the fusion methods.

III. Preliminary
Before we explain the main approach proposed in this paper, we briefly describe topic models and explain LDA as the topic model that we apply in our approach. We also explain two well-known data fusion methods which are used in this paper.

A. Topic Models
Topic models are based on the idea that documents are created by a mixture of topics, where a topic is a probability distribution over words. Specifically, a topic model is a statistical model by which we can create all the documents of a collection. Assume that we want to fill up every document of a corpus with the words, topic model says each document contains multiple topics and exhibits the topics in different proportion. Thus, for each document, there is a distribution over topics that according to this distribution, a topic is chosen for every word of that document, and then from that topic (i.e. distribution over vocabulary) a word is drawn [2].

B. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a topic model widely used in the information retrieval field. Specifically, LDA is a probabilistic model that says each document of a corpus is generated by a distribution over topics, and each topic is characterized by a distribution over words. The process of generating a document defines a joint probability distribution over both observed (i.e. words of corpus) and hidden (i.e. topics) random variables. The data analysis is performed by using that joint distribution to compute the conditional distribution of the hidden variables given the observed variables. Formally, LDA is described as follows: where are topics where each is a distribution over words of the corpus (i.e. vocabulary), are topic proportions for the th document, are the topic assignments for the th document where is the topic assignment for the th word in document , which specifies the topic that th word in belongs to, and are the observed words for document where is the th word in document .

C. Fusion Methods
We now introduce two baseline state-of-the-art data fusion methods, frequently used for various information retrieval tasks, namely the CombSUM and CombMNZ fusion methods [3].
Suppose there are ranked lists which are created by different systems over a collection of items D. Each system S i provides a ranked list of items L , d , … , d and a relevance score s is assigned to each of the items in the list. Data fusion techniques use some algorithms to merge these n ranked lists into one [3].
CombSUM uses the following equation: If does not appear in any L i , a default score (e.g., 0) is assigned to it. According to the global score the items can be ranked as a new list.
Another method CombMNZ uses the equation: where is the number of lists in which item appears.
The linear combination (i.e. general form of CombSUM) uses the equation: where is the weight assigned to system .

IV. Our Method
To create an enriched vectorial representation for documents of a corpus, we propose an unsupervised technique, called Fusion-and Topic-based Enriching (FT-Enrich). Let be the collection of documents that we wish to be clustered, we run LDA algorithm several times over the collection, every time with different specified number of topics. We used LDA because we want to manually specify and change the number of topics. The intuition behind using different topics in each iteration is to bring in variety of topics being discussed in documents with an ensemble approach. We start with a number of topics close to the number of clusters, for example, assuming K is the number of clusters we wish to have, the beginning number for topics is where κ is a small integer 1 . The reason of starting with is to emphasize the topics in an iteration which has a number of topics close to the number of clusters. Finally, for every document of there is a set , ℬ , … , ℬ where ℬ shows topics belonging to iteration , and indicates the number of iterations. At first which is increased by one in each iteration. Number of iterations depends on the maximum number of topics, i.e., bigger than number of clusters, involving in determining of special topics. It could be an expectation of different topics among the corpus. The clustering results in our experiments are obtained by 25 iterations. Therefore, for clustering a corpus into 4 clusters, sequence of the topics number for 25 iterations with is 3,4,5, … ,27.
In every iteration, for each document, we generate a set of topics, namely, special topics, which are selected from the topics within iteration . To generate these topics, we construct a graph comprising the documents of and the topics generated in iteration . Fig. 1 shows three examples of graph in different iterations. Every circular node corresponds to a document of the collection, and the square nodes correspond to the topics generated in that iteration. The connection between a circular node and a square node indicates the proportion of the corresponding topic in the document. Therefore, ℙ indicates topic proportions of the documents in iteration where shows topic proportions for document in graph where ∑ . Therefore, the elements of special topics for document , within iteration , include: • the topic with highest proportion of for document , • the topic by which document finds its best couple, • the topic by which is selected as the best couple for a document. Given the topics of iteration th, the best couple for document is a document for which the following equation returns the highest value: where the denominator in case of equals 0.1. Specifically, Equation (5) is to find documents which are similar together in a specific topic, considering their proportion in the topic. Therefore, for each document in a specific iteration, there is a special topics set where p | | |ℬ | . We take into account the effect of special topics for each document by combining elements of . Our goal is to generate a representing vector for each document to be used in clustering where this vector is a combination of some special topics. We use the data fusion method CombSUM in two phases to generate a single topic (vector) for each document in the corpus.
In the first phase, all the topics within are combined to generate a single vector for each document in iteration . Formally, let denotes 's normalized score given in distribution (topic) , the general form of CombSUM fusion method then simply sums over the normalized 'scores given by various topics in .
where is the proportion of document in topic .
In the second phase, all the single vectors generated in iterations are combined to generate a unique vector for document . Formally, given , , let denotes 's normalized score given in vector , therefore, the CombSUM fusion method sums over the normalized 's scores given by various vectors in .
Finally, a trade-off between and traditional vector, i.e., a vector generated based on TF-IDF for document , are used to generate the final vector. Which is the representing vector for th document in clustering. Formally: (8) where indicates traditional vector for th document, and .

V. cluster Labelling
To label a cluster , we use CombMNZ data fusion method which provides good results in combining several ranked lists [3] [15]. First, we create where is a list of terms corresponding to the vector within , we then rank/ sort the terms of based on the scores/probabilities obtained for its corresponding vector in Equation (8). Therefore, ℒ is updated with the new ranked lists. We then create candidate labels which are Top-M terms within list . Therefore, let ℒ ⋃ denotes the overall candidate-labels pool which are generated based on the union of all Top-M scored labels selected from for cluster . The CombMNZ is to boost label based on the number of times that appears in various lists. Formally:

VI. Experimental Setup
The principal idea of the experiments is to show the efficacy of an ensemble approach of topic modeling on clustering results through a manually predefined categorization of the corpus.

A. Datasets
We explore the utility of using representation vectors of documents generated by our method in addition to label the clusters. To this end, we used three different datasets: Classic4: This dataset is often used as a benchmark for clustering and co-clustering 2 . It consists of 7095 documents classified into four classes denoted MED, CISI, CRAN and CACM. For our experiments, we extract randomly 500 documents from each class.

B. Preprocessing
Preprocessing is an essential step in text mining. The first classical preprocessing regards stop words removal and lower case conversion. In addition, we used L2-norm to normalize the topics/vectors generated by MALLET. The normalized vector of is a vector with the same direction but with length one. It is denoted by | | , where | | .

C. Vectors Similarity Measure
For evaluating similarity of two represented vectors, we used comparative traditional measure Cosine Similarity that measures the cosine of the angle between two none zero vectors of an inner product space. Given two vectors of attributes, and , the cosine similarity, , is represented as follows: Cosine similarity is a judgment of orientation and not magnitude of two vectors commonly used with text data represented by word counts: its results range from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality, and in-between values indicating intermediate similarity or dissimilarity. In Information Retrieval (IR), since the term frequencies (TF-IDF weights) cannot be negative, the cosine similarity of two represented vectors of two documents will range from 0 to 1.

D. Clustering Evaluation Measures
We used two external criteria Purity and F1-measure for evaluating the clustering results.

Purity:
The purity is a simple and transparent evaluation measure which is related to the entropy concept [19]. To compute the purity criterion, each cluster is assigned to its majority class. Then we consider the percentage of correctly assigned documents, given the set of documents in the majority class:  (12) where is the number of all documents, is the set of clusters and is the set of classes.

F1-measure:
The F1-measure is defined as a harmonic mean of precision and recall [20]. Formally, F1-measure is defined as follows: (13) where (Precision) is defined in Equation (11), and (Recall) is formally defined as follows: where L is the majority class.

E. Labelling Evaluation Measures
For evaluating the quality of cluster labeling, we use the frameworks represented in [21]. Therefore, for each given cluster, its ground truth labels where obtained by manual (human) labeling and are used for the evaluation.
We use Match@N (Match at top N results) and MRR@N (Mean Reciprocal Rank) measures proposed in [21] to evaluate the quality of the labels. They consider the categories of Open Directory Project (ODP) as the correct labels and then evaluate a ranked list of proposed labels by using the following criteria: • Match@N: It is a binary indicator, and returns 1 if the top N proposed labels contain at least one correct label. Otherwise it returns zero.
• MRR@N: It returns the inverse of the rank of the first correct label in the top-N list. Otherwise it returns zero.
A proposed label for a given cluster is considered correct if it is identical, an inflection, or a WordNet synonym of the cluster's correct label [16].

A. Evaluating Results of Clustering
In our experiments, we use the software package CLUTO 4 which is used for clustering low-and high-dimensional datasets. The algorithm adopted for clustering is Partitional, and the measure of the similarity between two vectors is Cosine similarity. Every document of the corpus is represented by two vectors: one is generated based on FT-Enrich method, and one simply is the traditional vector (BOW)-classical TF-IDF weighting of terms-model.
We tested and evaluated clustering with/without applying FT-Enrich, to show the improvements in clustering purity due to a capable combination of fusion and topic modeling approaches. The obtained results of such improvement are shown in Table II and Table IV on two various datasets BBC and Classic 4 . The obtained results in Table II on BBC indicate that representing documents by only using FT-Enrich ( ) considerably improve the quality of clustering compared to using traditional TF-IDF method (first baseline) shown in Table I. We can see in Table II the best improvement in total purity (%22) and average of F1-measures (%23) are obtained by entirely using FT-Enrich method (α=1). Furthermore, in Table I and Table II, it can be observed in cluster 4 we have about %50 improvement in purity of the cluster. We investigated the variation of by considering the amount of dispersion of documents' sizes. Our experiments show that contribution of FT-Enrich method in creating the representation vectors for corpus with low Standard Deviation (SD) with respect to its mean (ME) is major compared to the one with the high SD. Table IV shows the clustering result with  on Classic4 for  which , but on the other hand, the clustering result shown in Table II is obtained by for which .
We also compared our method with the second baseline, i.e., unsupervised LDA document representation. To this end, we considered the number of topics for each dataset corpus is equal to the number of classes manually specified for the corpus. For example, for dataset BBC with 5 manually specified classes, we ran LDA topic modeling with 5 topics over the corpus. Therefore, each document of BBC news corpus is represented by 5 different representation vectors/ topics. Finally, documents that have highest probability/proportion in a same topic are clustered into the same cluster. The results of the clustering are shown in Table V and Table VI. As it can be observed, clustering using LDA representation alone returns worse result on dataset Classic4 compared to the results obtained by using traditional TF-IDF method (first baseline) shown in Table III.

B. Evaluating Results of Cluster Labeling
We use 20NG benchmark for our experiments in cluster labeling. Therefore, we first show the result of clustering on this dataset using representation vectors generated by our method which indeed are used in cluster labeling. We further compare our result with the clustering result obtained by using the traditional representation vectors. The results of the clustering are shown in Table VII. It shows a remarkable improvement (%68) in the total purity of clustering (TP = 0.64) which leads to achieve significant result in cluster labeling as well.
The cluster labeling method represented in this work is a direct cluster labeling method in which the candidate labels for clusters are directly extracted from content of the clusters without using external sources (e.g. Wikipedia). One of the baseline direct approaches that several clustering systems apply for cluster labeling [17] is to select the top-n terms with maximal weights from the cluster centroid as the candidate labels. In our experiments we use this approach as a baseline for comparison. Specifically, we explore the effectiveness of using candidate labels generated by our approach in addition to the highest weighted terms extracted from cluster centroid provided by: TF-IDF and FT-Enrich method.
As an example of cluster labeling, Table VIII shows top-15 labels produced by the three above explained labeling methods over first cluster of 20News dataset which is labeled "Atheism" by experts. It can be observed in Table VIII   As it can be observed, using the highest weighted terms extracted from clusters' centroids provided by FT-Enrich method is more effective than the ones provided by TF-IDF. It further shows that using fusion method (CombMNZ(FT-Enrich)) on the representation vectors generated by FT-Enrich method provides the best performance for both label quality measures. We can further observe that, for the Match@N measure, baseline method with FT-Enrich based cluster centroid requires at list 18 terms to cover %80 of the clusters with a correct label, while the same effectiveness is achieved by a list of 7 terms only using FT-Enrich method. It is also interesting that with CombMNZ (FT-Enrich) method covers %100 of the clusters with a correct label.

VIII. Conclusion
In this paper, we presented a fusion-and topic-based enriching approach in order to improve the quality of clustering. We applied a statistical approach, namely topic model, to enrich the representation vectors of the documents. To this end, an ensemble topic modeling with using different parameters for each model are represented, and then, using a fusion approach, all the generated results are combined to provide a single vectorial representation for each document. Our experiments on the different datasets show significant improvement in clustering results. We further show that putting such representation vectors in a fusion method provides interesting results in cluster labeling as well.
As a future work, we plane to exploit external sources (e.g. WordNet) in both the clustering and cluster labeling to explore the effectiveness of using topic models as well as the resources in corresponding domains.  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  TP  A  CombMNZ (topic-based) over first cluster of 20News dataset "Atheism"