In the last month or so, I have encountered a lot of bots (around a thousand) scraping a forum I'm admin in. After some digging, I've seen that the most of the bots come from 173.192.x.x segment.
I went to a "whois" site which suggested that the segment is part of softlayer data centers:
What the? My site is located in Israel, and it's in Hebrew, so there's no reason for them to scan my site.NetRange: 220.127.116.11 - 18.104.22.168 CIDR: 22.214.171.124/15 OriginAS: AS36351 NetName: SOFTLAYER-4-8 NetHandle: NET-173-192-0-0-1 Parent: NET-173-0-0-0-0 NetType: Direct Allocation Comment: SoftLayer provides on-demand IT infrastructure, dedicated servers and cloud resources. RegDate: 2009-07-21 Updated: 2012-03-09 Ref: http://whois.arin.net/rest/net/NET-173-192-0-0-1
But, after googling around I've found this:
The Aboundex Crawler is a bot from Aboundex Search, currently operating out of the Softlayer network with the IP Address 126.96.36.199.According to this, the Aboundex Crawler bot ignores the robots.txt file. So why just not ban them?
Reports about the Aboundex crawler claim it ignores rules in robots.txt, and is a fast page scraper which may switch IP's when blocked from spidering pages.
Well, I think if some new search engine or whatever want to make a good reputation, then it must follow some simple rules, and of course one of them is the robots.txt. So maybe something is wrong with my site? Let's check out what the Aboundex site suggest.
The site doesn't seem to be working, as it says "under construction" when you try to search something, but there is an about page with this info (the only link on the site):
How do i stop Aboundexbot from indexing my website? If you have a concern about Aboundexbot, we hope you give us a chance to address it via the email below but if you need to block Aboundexbot, the robots.txt file will allow you to accomplish that goal.I guess it's a good thing to try it. What you think? I'll update later as I'll add it to the forums.
To block Aboundexbot from your entire web site you add this to your robots.txt file:
Hope you enjoyed :)