Data extraction and label assignment for web databases
Keywords:
Web databases, Information extraction, visual featuresAbstract
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are en wrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). The structured data that Extracting from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a too many number of techniques have been proposed to address this problem, but all of them have limitations because they are Web-page-programming-language dependent.
References
M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda, “Extracting lists of data records from semistructured web pages,” Data Knowl. Eng., vol. 64, no. 2, pp. 491–509, Feb. 2008.
A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” in Proc. 2003 ACM SIGMOD, San Diego, CA, USA, pp. 337–348.
J. L. Arjona, R. Corchuelo, D. Ruiz, and M. Toro, “From wrapping to knowledge,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 2, pp. 310–323, Feb. 2007.
F. Ashraf, T. Özyer, and R. Alhajj, “Employing clustering techniques for automatic information extraction from HTML documents,” IEEE Trans. Syst. Man Cybern. C, vol. 38, no. 5, pp. 660–673, Sept. 2008.
M. E. Califf and R. J. Mooney, “Bottom-up relational learning of pattern matching rules for information extraction,” J. Mach. Learn. Res., vol. 4, pp. 177–210, May 2003.
A. Carlson and C. Schafer, “Bootstrapping information extraction from semi-structured web pages,” in Proc. ECML/PKDD, Berlin, Germany, 2008, pp. 195–210.
C.-H. Chang and S.-C. Kuo, “OLERA: Semisupervised web-data extraction with visual support,” IEEE Intell. Syst., vol. 19, no. 6, pp. 56–64,
Nov./Dec. 2004.
C.-H. Chang and S.-C. Lui, “IEPAD: Information extraction based on pattern discovery,” in Proc. 10th Int. Conf. WWW, Hong Kong, China, 2001, pp. 681–688.
C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
W. W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in HTML documents,” in Proc. 11th Int. Conf. WWW, 2002, pp. 232–24.
Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction Hassan A. Sleiman and Rafael Corchuelo
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2015 COMPUSOFT: An International Journal of Advanced Computer Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.
©2023. COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY by COMPUSOFT PUBLICATION is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY. Permissions beyond the scope of this license may be available at Creative Commons Attribution 4.0 International Public License.