Duplicate Detection in Hierarchical XML Multimedia Data Using Improved Multidup Method
Keywords:
Duplicate detection, Data cleaning, XML Data, Bayesian network, PruningAbstract
Today‟s important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplications in large databases to increase the efficiency of data mining and to make data mining effective. Recently new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now a day the data is being stored in more complex and semistructured or hierarchical structure and the problem arose is how to detect duplicates on this XML data. Due to differences between various data models we cannot apply same algorithms which are for single relation on XML data. The objective of this paper is to detect duplicates in hierarchical data which contain textual and multimedia data, like images, audio and video. Also to act according to user choice on that data like delete, update etc. Also to prune the duplicate data by using pruning algorithm that is included in proposed system. Here Bayesian network will be used for duplicate detection, and by experimenting on both artificial and real world datasets the MULTIDUP method will be able to perform duplicate detection with high efficiency and effectiveness. This method will compare each level of XML tree from root to the leaves. The first step is to go through the structure of tree comparing each descendant of both datasets inputted and find duplicates despite difference in data.
References
Luis Leitao, Pavel Calado, and Melanie Herschel, “Efficient and Effective Duplicate Detection in Hierarchical Data” IEEE Trans. on Knowledge and Data Engineering, Vol. 25, No. 5, May 2013.
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
J.C.P. Carvalho and A.S. da Silva, ”Finding Similar Identities among Objects from Multiple Web Sources,” Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.
M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.
D. Milano, M. Scannapieco, and T. Catarci, ”Structure Aware XML Object Identification,” Proc. VLDB Workshop Clean Databases (CleanDB), 2006.
L. Leitao, P. Calado, and M. Weis, ”Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection,” Proc. 16th ACM Intl Conf. Information and Knowledge Management, pp. 293-302, 2007.
K.-H. Lee, Y.-C. Choy, and S.-B. Cho, ”An efficient algorithm to compute differences between structured documents” IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 16, no. 8, pp. 965979, Aug. 2004.
E. Rahm and H.H. Do, ”Data Cleaning: Problems and Current Approaches” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
L. Leitao and P. Calado, ”Duplicate Detection through Structure Optimization,” Proc. 20th ACM Intl Conf. Information and Knowledge Management, pp. 443-452, 2011.
M.A. Hernandez and S.J. Stolfo, ”The Merge/Purge Problem for Large Databases,” Proc. ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.
S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, ”Approximate XML Joins,” Proc. ACM SIGMOD Conf. Management of Data, 2002.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2015 COMPUSOFT: An International Journal of Advanced Computer Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.
©2023. COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY by COMPUSOFT PUBLICATION is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY. Permissions beyond the scope of this license may be available at Creative Commons Attribution 4.0 International Public License.