Author Topic: Band width, spiders and spammers.  (Read 1935 times)

Offline Julian

  • Administrator
  • Oil baron
  • *******
  • Posts: 6390
    • Used Cooking Oil Collection website
  • Location: East Surrey, UK.
Band width, spiders and spammers.
« on: February 19, 2013, 12:40:25 PM »
Just been looking through the stats, prompted by RM flagging up the "Who's where" page.  The "who's where" page frequently shows "Guests" logging on and registering, but these "Guests" never seem to appear on the site ... my guess is that they are spammers.

The Monthly stats show a healthy increase in traffic, but looking at the forum figures the highest number of people we've had on line in a day is 30, and the usual figure is between 10 and 20.  I assume wiki pages are included in the monthly stats, so that comparison may be a little misleading.

Anyhow, delving a bit further, the "Visiting Spiders" page shows quite a few spiders using up a lot of "M"s and few return traffic ...

googlebot ... 78.2M ... 92.4% of referrals
msnbot\-media ... 30.5M  ... can't find any referral figures.
baiduspider ... 51.8M ... Chinese spider ... returns no traffic.
bot[ :,\.\;\/\\-] ... 36.2M ... can't find any info ... returns no traffic
robot 1, ...18.5M ... can't find any info ... returns no traffic
yandex ... 14.8M ... appears to be Russian ... 3.3% of referrals ... find that strange!
[ :,\.\;\/\\-]bot ... 11.6M  ... can't find any info ... returns no traffic
unknown  ... 51.6k ... ?
agabondo ...16.1M ... appears to be Italian ... returns no traffic
crawl ... 7.0M  ... can't find any info ... returns no traffic
slurp ... 2.8M ... doesn't appear in the referrers list but I think it's an old, genuine search engine.
mj12bot ... 3.4M ... can't find much info, but people on Google complaining about wasting band width ... returns no traffic
magpie ... 3.1M ... may be something to do with twitter, people complaining about wasting band width ... returns no traffic
ia_archiver ... 3.3M ... alexa,  creates some sort of archive but never puts traffic on a site
cfnetwork ... 1.9M ... no idea about this one, lots of info on Google I don't understand! ... returns no traffic.
no_user_agent ... 8.6M ... don't understand this. People on Google saying  "no user agent eating up bandwidth" ... returns no traffic
bspider 1.2M ... believe Japanese ... returns no traffic.

These are the stats for search engines that do refer traffic ...

 Google ... 92.4%
 yandex ... 3.3%
 Yahoo! ... 2.2%
 Search.com ... 1.3%
 Ask ... 0.4%
 Windows Live ... 0.2%
 AOL ... 0.1%
 Google Images ... 0.0%

So, after all that is it worth adding to our robots.txt file to ban all the spiders and bots which pointlessly use bandwidth?  Or perhaps ban all spiders except for the friendly, known ones.  Of the big list above only two seem to be genuine and only one actually appears to return traffic (not sure about the Russian spider!).  I'm guessing these spiders might screw up the "Visitors by country" figures too.

In addition, if we are listed on obscure search engines, could that be a source of targets for spammer, who then use additional band width attempting to register.


I know little about all this but I'm guessing this might be an issue on the VOD as Paddy seem to have excessive bandwidth usage.  If someone in the know can verify that my reasoning has a modicum of logic, I'll post on the VOD suggesting someone takes checks it out
Used Cooking Oil Collection website ... http://www.surreyusedcookingoilcollection.palmergroup.co.uk

Offline Tony

  • Administrator
  • Oil baron
  • *******
  • Posts: 5110
  • Fo' shizzle, biodizzle
    • Southampton Waste Oil Collection
  • Location: Southampton
Re: Band width, spiders and spammers.
« Reply #1 on: February 20, 2013, 10:59:57 AM »
Shouldn't really be a problem, the traffic from web spiders (useful or not) is relatively minor compared to everything else.

Yes it does skew the stats a little but then all stats must be interpreted with an understanding of how they were generated in order to properly understand them and the pitfalls.

I suspect that most of the "guest logging in" that gets shown are scripts trying known username/password combinations to get access to the forum, or from automated spam software that has gone through the registration process seeing if it was successful in completing a registration.  Not a lot we can do about those.