Improving the Performance of Crawler Using Body Text Normalization


Search engine is comprised of components like crawler, repository, indexing, querying and ranking. Work of crawler is to crawl the web and download pages. These pages are then stored in repository. The crawler mechanism should be smart enough to identify the pages that it had or had not crawled before. Here we propose a suitable mechanism that will avoid downloading of duplicate page contents and also avoid unnecessary URL extraction time. So as to meet the desired mechanism we introduce MD5 digest of body text of every page.


