Duplicate Detection in Hierarchical XML Multimedia Data Using Improved Multidup Method

Pramod. B. Gosavi; M.A. Patel

Authors

Gosavi PB Associate Professor and HOD- IT Dept. Godavari College of Engineering, Jalgaon,
Patel MA Student- ME- Computer Science and Engineering Godavari College of Engineering, Jalgaon

Keywords:

Duplicate detection, Data cleaning, XML Data, Bayesian network, Pruning

Abstract

Today‟s important task is to clean data in data warehouses which has complex hierarchical structure. This is possibly done by detecting duplications in large databases to increase the efficiency of data mining and to make data mining effective. Recently new algorithms are proposed that consider relations in a single table; hence by comparing records pairwise they can easily find out duplications. But now a day the data is being stored in more complex and semistructured or hierarchical structure and the problem arose is how to detect duplicates on this XML data. Due to differences between various data models we cannot apply same algorithms which are for single relation on XML data. The objective of this paper is to detect duplicates in hierarchical data which contain textual and multimedia data, like images, audio and video. Also to act according to user choice on that data like delete, update etc. Also to prune the duplicate data by using pruning algorithm that is included in proposed system. Here Bayesian network will be used for duplicate detection, and by experimenting on both artificial and real world datasets the MULTIDUP method will be able to perform duplicate detection with high efficiency and effectiveness. This method will compare each level of XML tree from root to the leaves. The first step is to go through the structure of tree comparing each descendant of both datasets inputted and find duplicates despite difference in data.

References

Luis Leitao, Pavel Calado, and Melanie Herschel, “Efficient and Effective Duplicate Detection in Hierarchical Data” IEEE Trans. on Knowledge and Data Engineering, Vol. 25, No. 5, May 2013.

R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.

J.C.P. Carvalho and A.S. da Silva, ”Finding Similar Identities among Objects from Multiple Web Sources,” Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.

M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.

D. Milano, M. Scannapieco, and T. Catarci, ”Structure Aware XML Object Identification,” Proc. VLDB Workshop Clean Databases (CleanDB), 2006.

L. Leitao, P. Calado, and M. Weis, ”Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection,” Proc. 16th ACM Intl Conf. Information and Knowledge Management, pp. 293-302, 2007.

K.-H. Lee, Y.-C. Choy, and S.-B. Cho, ”An efficient algorithm to compute differences between structured documents” IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 16, no. 8, pp. 965979, Aug. 2004.

E. Rahm and H.H. Do, ”Data Cleaning: Problems and Current Approaches” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.

L. Leitao and P. Calado, ”Duplicate Detection through Structure Optimization,” Proc. 20th ACM Intl Conf. Information and Knowledge Management, pp. 443-452, 2011.

M.A. Hernandez and S.J. Stolfo, ”The Merge/Purge Problem for Large Databases,” Proc. ACM SIGMOD Conf. Management of Data, pp. 127-138, 1995.

S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, ”Approximate XML Joins,” Proc. ACM SIGMOD Conf. Management of Data, 2002.

Duplicate Detection in Hierarchical XML Multimedia Data Using Improved Multidup Method

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Download

Indexing

Information