Yandex.ru Indexing Crawler Issues

The yandex.ru crawler is an indexing application that spiders hosts and puts the results into the yandex.ru search engine. Like Google, Bing and other search engines, the system searches out new contents on the web continually and adds the content to the search engine database. Usually, these types of activities cause little issues for those whose sites are being indexed, and in fact, over the years an etiquette system based on rules placed in the robots.txt file of a web site has emerged.

Robots.txt files provide a rule set for search engine behaviors. They indicate what areas of a site a crawler may index and what sections of the site are to be avoided. Usually this is used to protect overly dynamic areas of the site where a crawler could encounter a variety of problems or inputs that can have either bandwidth or application issues for either the crawler, the web host or both. 

Sadly, many web crawlers and index bots do not honor the rules of robots.txt. Nor do attackers who are indexing your site for a variety of attack reasons. Given the impacts that some of these indexing tools can have on bandwidth, CPU use or database connectivity, other options for blocking them are sometimes sought. In particular, there are a lot of complaints about yandex.ru and their aggressive parsing, application interaction and deep site inspection techniques. They clearly have been identified as a search engine that does not seem to respect the honor system of robots.txt. A Google search for “yandex.ru ignores robots.txt” will show you a wide variety of complaints.

In our monitoring of the HITME traffic, we have observed many deep crawls by yandex.ru from a variety of IP ranges. In the majority of them, they either never requested the robots.txt file at all, or they simply ignored the contents of the file altogether. In fact, some of our HITME web applications have experienced the same high traffic cost concerns that other parts of the web community have been complaining about. In a couple of cases, the cost for supporting the scans of yandex.ru represent some 30+% of the total web traffic observed by the HITME end point. From our standpoint, that’s a pain in the pocketbook and in our attention span, to continually parse their alert traffic out of our metrics.

Techniques for blocking yandex.ru more forcibly than robots.txt have emerged. You can learn about some of them by searching “blocking yandex.ru”. The easiest and what has proven to be an effective way, is to use .htaccess rules. We’ve also had some more modest success with forcibly returning redirects to requests with known url parameters associated with yandex.ru, along with some level of success by blocking specific IPs associated with them via an ignore rule in HoneyPoint.

If you are battling yandex.ru crawling and want to get some additional help, drop us a comment or get in touch via Twitter (@lbhuston, @microsolved). You can also give an account representative a call to arrange for a more technical discussion. We hope this post helps some folks who are suffering increased bandwidth use or problems with their sites/apps due to this and other indexing crawler issues. Until next time, stay safe out there!

Which Application Testing is Right for Your Organization?

Millions of people worldwide bank, shop, buy airline tickets, and perform research using the World Wide Web. Each transaction usually includes sharing private information such as names, addresses, phone numbers, credit card numbers, and passwords. They’re routinely transferred and stored in a variety of locations. Billions of dollars and millions of personal identities are at stake every day. In the past, security professionals thought firewalls, Secure Sockets Layer (SSL), patching, and privacy policies were enough to protect websites from hackers. Today, we know better.

Whatever your industry — you should have a consistent testing schedule completed by a security team. Scalable technology allows them to quickly and effectively identify your critical vulnerabilities and their root causes in nearly any type of system, application, device or implementation.

At MSI, our reporting presents clear, concise, action-oriented mitigation strategies that allows your organization to address the identified risks at the technical, management and executive levels.

There are several ways to strengthen your security posture. These strategies can help: application scanning, application security assessments, application penetration testing, and risk assessments.

Application scanning can provide an excellent and affordable way for organizations to meet the requirements of due diligence; especially for secondary, internal, well-controlled or non-critical applications.

Application security assessments can identify security problems, catalog their exposures, measure risk, and develop mitigation strategies that strengthen your applications for your customers. This is a more complete solution than a scan since it goes deeper into the architecture.

Application penetration testing uses tools and scripts to mine your systems for data and examine underlying session management and cryptography. Risk assessments include all policies and processes associated with the specific application, and will be reviewed depending on the complexity of your organization.

In order to protect your organization against security breaches (which are only increasing in frequency), consider conducting an application scan, application  assessment, application penetration test, or risk assessment on a regular basis. If you need help deciding which choice is best for you, let us know. We’re here to help!