Monday, September 3, 2012

Aboundex bot

It's my first post with some story that may help you. So enjoy!
In the last month or so, I have encountered a lot of bots (around a thousand) scraping a forum I'm admin in. After some digging, I've seen that the most of the bots come from 173.192.x.x segment.
I went to a "whois" site which suggested that the segment is part of softlayer data centers:
NetRange:       173.192.0.0 - 173.193.255.255
CIDR:           173.192.0.0/15
OriginAS:       AS36351
NetName:        SOFTLAYER-4-8
NetHandle:      NET-173-192-0-0-1
Parent:         NET-173-0-0-0-0
NetType:        Direct Allocation
Comment:        SoftLayer provides on-demand IT infrastructure, dedicated servers and cloud resources.
RegDate:        2009-07-21
Updated:        2012-03-09
Ref:            http://whois.arin.net/rest/net/NET-173-192-0-0-1
What the? My site is located in Israel, and it's in Hebrew, so there's no reason for them to scan my site.
But, after googling around I've found this:
The Aboundex Crawler is a bot from Aboundex Search, currently operating out of the Softlayer network with the IP Address 173.192.34.95.
Reports about the Aboundex crawler claim it ignores rules in robots.txt, and is a fast page scraper which may switch IP's when blocked from spidering pages.
According to this, the Aboundex Crawler bot ignores the robots.txt file. So why just not ban them?
Well, I think if some new search engine or whatever want to make a good reputation, then it must follow some simple rules, and of course one of them is the robots.txt. So maybe something is wrong with my site? Let's check out what the Aboundex site suggest.
The site doesn't seem to be working, as it says "under construction" when you try to search something, but there is an about page with this info (the only link on the site):
How do i stop Aboundexbot from indexing my website? If you have a concern about Aboundexbot, we hope you give us a chance to address it via the email below but if you need to block Aboundexbot, the robots.txt file will allow you to accomplish that goal.

To block Aboundexbot from your entire web site you add this to your robots.txt file:

User-agent: Aboundexbot
Disallow: /  
I guess it's a good thing to try it. What you think? I'll update later as I'll add it to the forums.

Hope you enjoyed :)

No comments:

Post a Comment

Ansible and Jinja2: Check if variable is defined and it's True

Jinja2 provides you with a built in test: http://jinja.pocoo.org/docs/2.10/templates/#defined So you can simply use: However, if you...