What Are Web Crawlers And How to Control Them

Written by Tabaré Patiño
1335-12439685856LxcA web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web Crawling or Spidering, and like most things in life there are those that are good, and those that are bad.


Many legitimate sites (in particular search engines such as Google and Microsoft Bing), use spidering as a means of providing up-to-date data. Web crawlers visit sites, take a a copy of the pages they visit and then index them to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links or validating HTML code. Unfortunately, crawlers also exist that have more sinister intentions, such as harvesting email addresses from web pages for spamming purposes or submitting spam comments to your website forms or blogs.

How can I see crawler activity on my site?

It's normal to see crawlers on your website, and if you want your site searchable via search engines you definitely want them to visit! We offers an application called AWStats within your account cPanel interface which will show you common crawlers that have visited your site. To access it, log in to your control panel, and click on AWstat. You can then select which domain name you wish to view statistics for (if you have more than one).

AWStats will open a new window, and from there you will be see a complete analysis of the visitors that have been accessing your website, including the web crawlers/spiders.

You should see something similar to the following screenshot:



In this example, you can see that the site received thousands of visits from web crawlers which consumed quite a large amount of bandwidth.

What problems can web crawlers cause my website or server?

The should bear in mind that not all web crawlers are 'friendly', and even those who are (Google etc) can affect server performance.

Sometimes, a web crawler will attempt to spider your site too aggressively, which can result in the server overloading. In these instances, our automated protection systems may briefly suspend your site to prevent inconveniencing other clients by causing server overloads or slow-downs. Bots can also consume quite large amounts of bandwidth. Consider the following daily snapshot for one client site which was suddenly aggressively spidered by Bing:



And a monthly view:


What can I do to protect myself from these web crawlers?

Consider the following scenarios and solutions:

1) My website has a web form and I'm receiving lot of spam

Install a good captcha system, such as Recaptcha from Google. If you have a third party application, check for plugins or extensions that add captchas to web forms.

2) I receive lots of spam in my personal/company e-mail address, which is displayed on my website

If you need to publish your personal/company e-mail address on your website, you need to make sure it is hidden in the source code. Most web crawlers don't work like humans - they just check web page source codes looking for e-mail addresses and collect them. There are ways to combat this, and we'll look at two ways to avoid your email addresses being collected below:

a) Instead of writing your e-mail address in plain text, use an image. You can use any image editing program such as Paint (Windows) or Gimp (Linux, Mac OS) to create a small image that contains your email address. See the following example created using Paint:



To display on your website, you should save it as .gif or .png file and add it to your website as an image in line with your text.

b) If you still want your e-mail address in plain text so visitors can easily copy and paste your e-mail address, you can try the following solution:

http://www.maurits.vdschee.nl/php_hide_email/

Here you will find several examples and tools that will help you to hide your e-mail address. Once you implement one of those methods correctly, you should see something like the following:


3) My website is consuming an abnormal amount of bandwidth and I have confirmed that this is due to these web crawlers.

This can be quite tricky to deal with, as you don't want to block 'good' web crawlers or your SEO may be affected. It is however possible to block some known bad bots. To do so, edit or create the .htaccess file in your public_html folder and add the following code found at the following address at the top of your file: http://pastebin.com/L397kQ9A

Note: Once you access to that url, please click the toggle button 

4) Search engine bots are visiting and indexing my website so often that is overloading my account/server:

You can try to prevent this by setting the number of seconds between successive requests to your website by the web crawlers. To do so, edit or create a robots.txt file in the domain's folder on the server (usually public_html unless it is an add-on domain) and add the following lines:
User-agent: * 
Crawl-delay: 5

It is also recommended that you block the access to folders where you have sensitive data that should not be accessed by web crawlers. For example, if you have a WordPress application installed, you can tell the bots that you don't wish them to access folders and files that shouldn't be indexed by adding the following additional to the robots.txt file:
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-*

Finally, Google does not obey the Crawl-delay directive as they have a crawl rate setting available in their webmaster tools system. To access this system you will first need to sign up and activate them for your site with Google here:

http://www.google.com/webmasters/

You can then change the crawl rate for Google's web crawlers by following this article:

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=48620

Note, you will need to set the crawl rate every 90 days, or it will revert to its default value.

We hope that you have found this information useful. As always, if you have any questions please don't hesitate to contact us!

Don't miss out!

Get our updates on web development, online marketing, customer support and (of course) web hosting!

About Kualo: Jo

About the Author

Tabaré is a systems administrator at Kualo. He's primarily responsible for making sure that our servers purr along, and has worked tirelessly to improve automation at Kualo.