An empirical analysis of similarity measures for unstructured data

Authors

  • Goswami M CHRIST (Deemed to be University), Bangalore 560074, India
  • Purkayastha BS Assam University, Silchar, Assam, India

Keywords:

similarity, cosine similarity, jaccard similarity, commonality, pearson, spearman‟s correlation

Abstract

With fast growth in size of digital text documents over internet and digital repositories, the pools of digital document is piling up day by day. Due to this digital revolution and growth, an efficient and effective technique is required to handle such an enormous amount of data. It is extremely important to understand the documents properly to mine them. To find coherence among documents text similarity measurement pays a humongous role. The goal of similarity computation is to identify cohesion among text documents and to make the text ready for the required applications such as document organization, plagiarism detection, query matching etc. This task is one of the most fundamental task in the area of information retrieval, information extraction, document organization, plagiarism detection and text mining problems. But effectiveness of document clustering is highly dependent on this task. In this paper four similarity measures are implemented and their descriptive statistics is compared. The results are found to be satisfactory. Graphs are drawn for visualization of results.

References

. Wajeed, M. A., & Adilakshmi, T. (2011, September). Different similarity measures for text classification using KNN. In 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011) (pp. 41-45). IEEE.

. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.

. Patidar, A. K., Agrawal, J., & Mishra, N. (2012). Analysis of different similarity measure functions and their impacts on shared nearest neighbor clustering approach. International Journal of Computer Applications, 40(16), 1-5.

. K.Chandan, B. R. B. (2017). A Novelistic Querying procedure for clustering the Legal Precedents. International Journal of Engineering and Computer Science, 6(1). Retrieved from http://www.ijecs.in/index.php/ijecs/article/view/2073.

. Sindhiya, B., & Tajunisha, N. (2013). Concept and Term Based Similarity Measure for Text Classification and Clustering, 9(3), PP 28-33,

International Journal of Engineering research & development.

. Goswami, M., Babu, A., Purkayastha B.S., (2018) . A Comparative Analysis of Similarity Measures to find Coherent Documents”. Applied Sciences and Managment , ,8(11),786-797

. A, V.S.P. (2013), Space and Cosine Similarity measures for Text Document Clustering. International Journal of Engineering Research & Technology 2(2), 2278–0181

. Amorim, de, R. C., & Hennig, C. (2015). Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences, 324, 126-145.

. Chim, H., & Deng, X. (2008). Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, 20(9), 1217-1229.

. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).

. Kalaivendhan, K., & Sumathi, P. (2014). An efficient clustering method to find similarity between the documents. Int. J. Innov. Res. Comput. Commun. Eng, 1.

. Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Morgan Kaufmann Publishers, Elsevier, Waltham, MA, USA. ISBN: 978-0-12-381479-1.

. Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, Hoboken, New Jersey.

. Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE transactions on knowledge and data engineering, 26(7), 1575-1590.

. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.

. Sruthi, K., & Reddy, M. B. V. (2013). Document clustering on various similarity measures. International Journal of Advanced Research in Computer Science and Software Engineering, 3(8), 1269-1273.

. Karypis, G., Kumar, V., & Steinbach, M. (2000, August). A comparison of document clustering techniques. In KDD Workshop on Text Mining.

. Tan, P. N. (2005). Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA.

. S Siddiqui N., Islam S. (2019) k-Factor-Based Cosine Similarity Measurement. In: Satapathy S., Joshi A. (eds) Information and Communication Technology for Intelligent Systems. Smart Innovation, Systems and Technologies, vol 107. Springer, Singapore.

. Sohangir, S., & Wang, D. (2017). Improved sqrtcosine similarity measurement. Journal of Big Data, 4(1), 25.

. Murty, M. N., & Devi, V. S. (2011). Pattern recognition: An algorithmic approach. Universities Press (India) Pvt. Ltd, Springer-Verlag London.

. Lee, S., Jin, X., & Kim, W. (2016). Sentiment classification for unlabeled dataset using doc2vec with jst. In Proceedings of the 18th Annual International Conference on Electronic Commerce: eCommerce in Smart connected World (p. 28). ACM. [feature selection , vector space model, tf-idf]

. Bilgin, M., & Şentürk, İ. F. (2017). Sentiment analysis on Twitter data with semi-supervised Doc2Vec. In 2017 International Conference on Computer Science and Engineering (UBMK) (pp. 661-666). IEEE.

. Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.

. Markov, I., Gómez-Adorno, H., Posadas-Durán, J. P., Sidorov, G., & Gelbukh, A. (2016, October). Author profiling with doc2vec neural network-based document embeddings. In Mexican International Conference on Artificial Intelligence (pp. 117-131). Springer, Cham.

. Kim, D., & Koo, M. W. (2017). Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. Journal of KIISE, 44(7), 742-747.

. Trieu, L. Q., Tran, H. Q., & Tran, M. T. (2017, December). News classification from social media using twitter-based doc2vec model and automatic query expansion. In Proceedings of the Eighth International Symposium on Information and Communication Technology (pp. 460-467). ACM.

. Maslova, N., & Potapov, V. (2017, September). Neural network doc2vec in automated sentiment analysis for short informal texts. In International Conference on Speech and Computer (pp. 546-554). Springer, Cham.

. Lee, H., & Yoon, Y. (2017). Engineering doc2vec for automatic classification of product descriptions on O2O applications. Electronic Commerce Research, 1-24.

Downloads

Published

2024-02-26

How to Cite

Goswami, M., & Purkayastha, B. (2024). An empirical analysis of similarity measures for unstructured data. COMPUSOFT: An International Journal of Advanced Computer Technology, 8(08), 3302–3306. Retrieved from https://ijact.in/index.php/j/article/view/519

Issue

Section

Original Research Article

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.