Guide to robots.txt: Optimising Web Crawling

A well-configured robots.txt file helps search engines crawl the right parts of your site while protecting sensitive areas and conserving server resources.

A robots.txt file is one of the simplest and most effective tools for controlling how search engine bots interact with your website. Configuring it correctly helps protect sensitive areas, conserve server resources, and guide crawlers towards your most important content.

What is robots.txt?

A robots.txt file is a plain text file placed in the root directory of your website (for example, https://example.com/robots.txt). It tells web crawlers which parts of your site they may or may not access. Configuring it properly prevents search engines from indexing irrelevant or sensitive areas, which helps both SEO and security.

Locating or creating your robots.txt file

Check for an existing file: Visit your root domain followed by /robots.txt to see whether a file already exists.
Create a new file: If no file exists, create a plain text file named robots.txt and upload it to the root directory of your web server.

Understanding robots.txt syntax

The three core directives are:

User-agent: - defines which crawler the rule applies to. User-agent: Googlebot targets only Google's crawler; User-agent: * applies to all crawlers.
Disallow: - lists the URLs or directories you want to block. For example, Disallow: /private/ prevents crawlers from accessing anything under /private/.
Allow: - explicitly permits crawling of a URL that falls within a disallowed directory, which is useful for complex site structures.

Common rules and practices

Secure sensitive directories: Block administrative areas such as /admin/ or /private/.
Enable efficient crawling: Allow access to important public directories, particularly those containing media files, to support SEO.
Use comments: Add comments with the # symbol to explain the purpose of each rule.

# Block access to the admin area
User-agent: *
Disallow: /admin/

Special guidelines for WordPress websites

WordPress sites benefit greatly from a customised robots.txt file, particularly to manage the visibility of plugin and theme directories.

WordPress automatically generates a virtual robots.txt that disallows access to core directories, but this may not cover all non-essential areas.

Recommended WordPress configuration

Block plugin and theme directories: Prevent crawling of /wp-content/plugins/ and /wp-content/themes/ to avoid exposing potentially sensitive files.
Allow media uploads: Keep the /wp-content/uploads/ directory crawlable to maintain content visibility.

User-agent: *
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

Testing and validating your WordPress robots.txt

Use an SEO plugin: Plugins such as Yoast SEO let you edit and manage your robots.txt directly from the WordPress admin panel.
Google Search Console: Use the robots.txt testing tool in Google Search Console to verify that your rules are blocking and allowing access as expected.

If you manage WordPress through WP Toolkit in cPanel, check whether your SEO plugin or WP Toolkit is generating a virtual robots.txt, as a physical file in the root directory will take precedence.

Advanced techniques

Handling specific crawlers

You can write separate rule blocks for different bots. For example, if Google is your primary traffic source, you could disallow all crawlers except Googlebot.

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

Crawl-delay directive

The Crawl-Delay directive limits how quickly a bot crawls your site, which can reduce server load during heavy traffic periods. Note that Google does not support this directive; use Google Search Console's crawl rate settings instead.

User-agent: Bingbot
Crawl-Delay: 10

Dynamic robots.txt

For large sites with frequent changes, consider generating a dynamic robots.txt programmatically so that rules adapt automatically to different scenarios or promotional events.

Do not rely on robots.txt alone to protect sensitive content. A determined crawler can ignore the file entirely. Use proper authentication and access controls for anything that must remain private.

A well-configured robots.txt file directs search engine traffic to the right parts of your website while protecting server resources and sensitive data. Tailoring it to your platform - whether WordPress or another - helps keep your site both secure and SEO-friendly.

Guide to robots.txt: optimising web crawling