DATA EXTRACTION AND ALIGNMENT USING TAGS AND VALUE SIMILARITY
Keywords:
Data Extraction, QRRs, HTML DOM, Value SimilarityAbstract
Web databases generate query result pages based on a user’s query. Automatically extracting these data from query result pages is very important for many applications, such as data integrations, which needs to cooperate with multiple web databases. This system presents a novel data extraction and alignment method called DATVS that combines both tag and value similarity. DATVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the data segmentation QRRs into a table, in which the data values from the same each attributes the put into the same column. Specifically, This propose new techniques to handle the case when the QRRs is not contiguous, which may be due to presence of an auxiliary information, such a comment, recommendation or advertisement and for handling they any nested structure that may exist in the QRRs. The new system is a design and the new record alignment algorithm that aligns the attributes in a record and first pair wise and they holistically, by combines the tag and data value similar information. Experimental results show that DATVS achieves high precision and outperforms existing state-of-the-art data extraction methods.
References
Ronald R. Yager and Frederick E. Petry,”Hyper matching: Similarity Matching With Extreme Values” IEEE Transactions On Fuzzy Systems, Vol. 22, No. 4, August 2014.
Fanman Meng, Hongliang Li, Guanghui Liu, and King Ngi Ngan, “From Logo to Object Segmentation” IEEE Transactions On Multimedia, Vol. 15, No. 8, December 2013.
Weifeng Su, Jiying Wang, Frederick H. Lochovsky, and Yi Liu” Combining Tag and Value Similarity for Data Extraction and Alignment” IEEE Transactions On Knowledge And Data Engineering, Vol. 24, No. 7, July 2012.
Jun Kong, Omer Barkol, Ruth Bergman, Ayelet Pnueli, Sagi Schein, Kang Zhang, and Chunying Zhao” Web Interface Interpretation Using Graph Grammars” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews, Vol. 42, No. 4, July 2012.
Jer Lang Hong” Data Extraction for Deep Web Using Word Net” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews, Vol. 41, No. 6, November 2011.
Alessandro Bozzon, Marco Brambilla, Stefano Ceri, and Silvia Quarteroni” A Framework for Integrating, Exploring, and Searching Location-Based Web Data” Published by the IEEE Computer Society 2011.
Mohammed Kayed and Chia-Hui Chang, Member,” FiVaTech: Page-Level Web Data Extraction from Template Pages”, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 2, February 2010.
Tak-Lam Wong and Wai Lam,” Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach”, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 4, April 2010.
Wei Liu, Xiaofeng Meng, and Weiyi Meng, “ViDE: A Vision-Based Approach for Deep Web Data Extraction”, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, No. 3, March 2010.
Yanhong Zhai and Bing Liu,” Structured Data Extraction from the Web Based on Partial Tree Alignment”, IEEE Transactions On Knowledge And Data Engineering, Vol. 18, No. 12, December 2006
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2014 COMPUSOFT: An International Journal of Advanced Computer Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.
©2023. COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY by COMPUSOFT PUBLICATION is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY. Permissions beyond the scope of this license may be available at Creative Commons Attribution 4.0 International Public License.