Effectiveness Assessment of the Cover Coefficient Based Clustering Methodology
Abstract
An algorithm for document clustering is introduced. The basic concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds, and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the IR effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method which is known to have good performance. The experiments also show that the algorithm 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm, and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS and INSPECT, the later is a common database with 12684documents.