Indexing Crawler Issues

The crawler is an indexing application that spiders hosts and puts the results into the search engine. Like Google, Bing and other search engines, the system searches out new contents on the web continually and adds the content to the search engine database. Usually, these types of activities cause little issues for those whose sites are being indexed, and in fact, over the years an etiquette system based on rules placed in the robots.txt file of a web site has emerged.

Robots.txt files provide a rule set for search engine behaviors. They indicate what areas of a site a crawler may index and what sections of the site are to be avoided. Usually this is used to protect overly dynamic areas of the site where a crawler could encounter a variety of problems or inputs that can have either bandwidth or application issues for either the crawler, the web host or both. 

Sadly, many web crawlers and index bots do not honor the rules of robots.txt. Nor do attackers who are indexing your site for a variety of attack reasons. Given the impacts that some of these indexing tools can have on bandwidth, CPU use or database connectivity, other options for blocking them are sometimes sought. In particular, there are a lot of complaints about and their aggressive parsing, application interaction and deep site inspection techniques. They clearly have been identified as a search engine that does not seem to respect the honor system of robots.txt. A Google search for “ ignores robots.txt” will show you a wide variety of complaints.

In our monitoring of the HITME traffic, we have observed many deep crawls by from a variety of IP ranges. In the majority of them, they either never requested the robots.txt file at all, or they simply ignored the contents of the file altogether. In fact, some of our HITME web applications have experienced the same high traffic cost concerns that other parts of the web community have been complaining about. In a couple of cases, the cost for supporting the scans of represent some 30+% of the total web traffic observed by the HITME end point. From our standpoint, that’s a pain in the pocketbook and in our attention span, to continually parse their alert traffic out of our metrics.

Techniques for blocking more forcibly than robots.txt have emerged. You can learn about some of them by searching “blocking”. The easiest and what has proven to be an effective way, is to use .htaccess rules. We’ve also had some more modest success with forcibly returning redirects to requests with known url parameters associated with, along with some level of success by blocking specific IPs associated with them via an ignore rule in HoneyPoint.

If you are battling crawling and want to get some additional help, drop us a comment or get in touch via Twitter (@lbhuston, @microsolved). You can also give an account representative a call to arrange for a more technical discussion. We hope this post helps some folks who are suffering increased bandwidth use or problems with their sites/apps due to this and other indexing crawler issues. Until next time, stay safe out there!