Improving the Performance of Crawler Using Body Text Normalization
Keywords:
Crawler with url normalization, Crawler with whole page content MD5, Crawler with Body text normalizationAbstract
Search engine is comprised of components like crawler, repository, indexing, querying and ranking. Work of crawler is to crawl the web and download pages. These pages are then stored in repository. The crawler mechanism should be smart enough to identify the pages that it had or had not crawled before. Here we propose a suitable mechanism that will avoid downloading of duplicate page contents and also avoid unnecessary URL extraction time. So as to meet the desired mechanism we introduce MD5 digest of body text of every page.
References
Brin, Sergey, and Page, Lawrence, “The Anatomy of a large scale hyper textual web Search Engine”, In Proceedings of the Seventh World-Wide Web Conference, 1998.
Hersovici, M., Jacovi, M., Maarek, Y., Pelleg, D., Shtalheim, M. and Ur Sigalit, “The Shark-Search Algorithm – an application: tailored web site mapping”, Computer Networks and ISDN systems, Special Issue on 7th WWW conference, Brisbane, Australia, 30(1-7), 1998.
P. De Bra, G.-J. Houben, Y. Kornatzky, and R. Post, “Information retrieval in distributed hypertexts”, Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.
Chakrabarti, S., Berg, M.V.D., and Dom, B., “Focused crawling: a new approach to topic-specific Web resource discovery”, In Eighth International World Wide Web Conference, pp.545–562, May 1999.
Mark Najork, Janet L. Weiner, “Breadth First search Crawling Yeilds high quality pages”WWW10 Proceedings in May 2-5 2001, Honk Kong.
Berners-Lee, T., Fielding, R., Masinter, L., “Uniform Resource Identifier (URI): General Syntax”, available at http://gbiv.com/protocols/uri/rfc/rfc3986.html
David Hawkings Web Search Engines: Part 1 David Hawking is a principal research scientist at CSIRO ICT Centre, Canberra, Australia, and Chief Scientist at funnelback.com.
Web Crawler with URL Signature – A Performance Study Lay-Ki Soon, Yee-Ern Ku ,Sang Ho Lee. 2012 4th Conference on Data Mining and Optimization (DMO) 02-04 September 2012, Langkawi, Malaysia 978-1-4673-2718-3/12/$31.00 ©2012 IEEE
The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfc132130
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2013 COMPUSOFT: An International Journal of Advanced Computer Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.
©2023. COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY by COMPUSOFT PUBLICATION is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at COMPUSOFT: AN INTERNATIONAL OF ADVANCED COMPUTER TECHNOLOGY. Permissions beyond the scope of this license may be available at Creative Commons Attribution 4.0 International Public License.