Data extraction and label assignment for web databases

T. Rajesh; T. Prathap; S.Naveen Nambi; A.R. Arunachalam

Authors

Rajesh T UG Student, Department of CSE, Bharath University
Prathap T UG Student, Department of CSE, Bharath University
Nambi SN UG Student, Department of CSE, Bharath University
Arunachalam AR Assistant Professor, Department of CSE, Bharath University

Keywords:

Web databases, Information extraction, visual features

Abstract

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are en wrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). The structured data that Extracting from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a too many number of techniques have been proposed to address this problem, but all of them have limitations because they are Web-page-programming-language dependent.

References

M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda, “Extracting lists of data records from semistructured web pages,” Data Knowl. Eng., vol. 64, no. 2, pp. 491–509, Feb. 2008.

A. Arasu and H. Garcia-Molina, “Extracting structured data from web pages,” in Proc. 2003 ACM SIGMOD, San Diego, CA, USA, pp. 337–348.

J. L. Arjona, R. Corchuelo, D. Ruiz, and M. Toro, “From wrapping to knowledge,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 2, pp. 310–323, Feb. 2007.

F. Ashraf, T. Özyer, and R. Alhajj, “Employing clustering techniques for automatic information extraction from HTML documents,” IEEE Trans. Syst. Man Cybern. C, vol. 38, no. 5, pp. 660–673, Sept. 2008.

M. E. Califf and R. J. Mooney, “Bottom-up relational learning of pattern matching rules for information extraction,” J. Mach. Learn. Res., vol. 4, pp. 177–210, May 2003.

A. Carlson and C. Schafer, “Bootstrapping information extraction from semi-structured web pages,” in Proc. ECML/PKDD, Berlin, Germany, 2008, pp. 195–210.

C.-H. Chang and S.-C. Kuo, “OLERA: Semisupervised web-data extraction with visual support,” IEEE Intell. Syst., vol. 19, no. 6, pp. 56–64,

Nov./Dec. 2004.

C.-H. Chang and S.-C. Lui, “IEPAD: Information extraction based on pattern discovery,” in Proc. 10th Int. Conf. WWW, Hong Kong, China, 2001, pp. 681–688.

C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A survey of web information extraction systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.

W. W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in HTML documents,” in Proc. 11th Int. Conf. WWW, 2002, pp. 232–24.

Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction Hassan A. Sleiman and Rafael Corchuelo

Data extraction and label assignment for web databases

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Download

Indexing

Information