Improving the Performance of Crawler Using Body Text Normalization

Authors

  • Qureshi F Research Scholar, Computer Science Dept, V.I.T.S Kareemnagar
  • Khan AA Asst. Professor, JNTU University

Keywords:

Crawler with url normalization, Crawler with whole page content MD5, Crawler with Body text normalization

Abstract

Search engine is comprised of components like crawler, repository, indexing, querying and ranking. Work of crawler is to crawl the web and download pages. These pages are then stored in repository. The crawler mechanism should be smart enough to identify the pages that it had or had not crawled before. Here we propose a suitable mechanism that will avoid downloading of duplicate page contents and also avoid unnecessary URL extraction time. So as to meet the desired mechanism we introduce MD5 digest of body text of every page.

References

Brin, Sergey, and Page, Lawrence, “The Anatomy of a large scale hyper textual web Search Engine”, In Proceedings of the Seventh World-Wide Web Conference, 1998.

Hersovici, M., Jacovi, M., Maarek, Y., Pelleg, D., Shtalheim, M. and Ur Sigalit, “The Shark-Search Algorithm – an application: tailored web site mapping”, Computer Networks and ISDN systems, Special Issue on 7th WWW conference, Brisbane, Australia, 30(1-7), 1998.

P. De Bra, G.-J. Houben, Y. Kornatzky, and R. Post, “Information retrieval in distributed hypertexts”, Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.

Chakrabarti, S., Berg, M.V.D., and Dom, B., “Focused crawling: a new approach to topic-specific Web resource discovery”, In Eighth International World Wide Web Conference, pp.545–562, May 1999.

Mark Najork, Janet L. Weiner, “Breadth First search Crawling Yeilds high quality pages”WWW10 Proceedings in May 2-5 2001, Honk Kong.

Berners-Lee, T., Fielding, R., Masinter, L., “Uniform Resource Identifier (URI): General Syntax”, available at http://gbiv.com/protocols/uri/rfc/rfc3986.html

David Hawkings Web Search Engines: Part 1 David Hawking is a principal research scientist at CSIRO ICT Centre, Canberra, Australia, and Chief Scientist at funnelback.com.

Web Crawler with URL Signature – A Performance Study Lay-Ki Soon, Yee-Ern Ku ,Sang Ho Lee. 2012 4th Conference on Data Mining and Optimization (DMO) 02-04 September 2012, Langkawi, Malaysia 978-1-4673-2718-3/12/$31.00 ©2012 IEEE

The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfc132130

Downloads

Published

2024-02-26

How to Cite

Qureshi, F., & Khan, A. A. (2024). Improving the Performance of Crawler Using Body Text Normalization. COMPUSOFT: An International Journal of Advanced Computer Technology, 2(07), 215–220. Retrieved from https://ijact.in/index.php/j/article/view/40

Issue

Section

Original Research Article