< !DOCTYPE html> The Ultimate Guide to robots.txt - Rank-O-Saur

🤖 The Ultimate Guide to robots.txt

The robots.txtfile is the gateway to your website for search engines and AI crawlers. It tells web robots (crawlers) which pages or files they can or cannot request from your site. This guide covers everything you need to know, from basic syntax to advanced AI-bot blocking.

1. What is a robots.txt file?

A robots.txtfile is a simple text file that uses the Robots Exclusion Protocol (REP). It is primarily used to manage crawler traffic to your site and prevent your server from being overwhelmed by requests.

Important:It is nota mechanism for keeping a web page out of Google. To keep a page out of the index, you must use noindextags or password protect the page.

2. Where does the robots.txt belong?

The file must be placed in the top-level directory (root)of your website and must be named exactly robots.txt(all lowercase).

3. Basic Syntax & Directives

A robots.txtfile consists of one or more "groups" of rules. Each group starts with a User-agentline, followed by Disallowor Allowrules.

4. Common Code Examples

Scenario A: Allow everything (Default)

If you don't have a robots.txt, or if it is completely empty, bots assume they can crawl everything. You can also explicitly state this:

User-agent: * Disallow: 

(Notice the empty space after Disallow)

Scenario B: Block the entire website

Used for staging environments or sites still in development.

User-agent: * Disallow: /

Scenario C: Block a specific directory

Prevents crawling of internal or admin pages.

User-agent: * Disallow: /admin/ Disallow: /internal-search/

Scenario D: Block a specific bot

Allow everyone, but block a specific crawler (e.g., a toxic SEO tool crawler).

User-agent: AhrefsBot Disallow: / User-agent: * Disallow:

Scenario E: Allow a specific file inside a blocked directory

User-agent: * Disallow: /images/ Allow: /images/logo.png

Scenario F: Adding a Sitemap

User-agent: * Disallow: /private/ Sitemap: https: //www.rankosaur.com/sitemap_index.xml

5. Advanced Pattern Matching

Googlebot and Bingbot support regular expressions (RegEx) for more complex rules.

Block all URLs containing a specific parameter (e.g., internal search):

User-agent: * Disallow:
    /*?search=

Block all files of a specific type (e.g., PDFs):

User-agent: *
Disallow: /*.pdf$

(This blocks document.pdf but allows document.pdf?version=2 unless you remove the $)

6. Controlling AI & LLM Bots (Important!)

With the rise of ChatGPT, Claude, and Google SGE, AI companies are aggressively scraping the web to train their Large Language Models (LLMs). You can block them while still allowing regular search engine indexing.

Note: Rank-O-Saur has a built-in feature to instantly visualize if these AI bots are blocked on the current page you are viewing!

Code snippet to block the most common AI/LLM scrapers:

# Block OpenAI (ChatGPT & training bots)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /

# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /

# Block Google's Extended AI Training (Does NOT block Googlebot for regular search)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (Often used for open-source AI training like LLaMA)
User-agent: CCBot
Disallow: /

7. Crucial Rules & Best Practices

  1. Case Sensitivity: Directory names are case-sensitive. Disallow: /Admin/ will not block /admin/.
  2. Order of Precedence: The most specific Allow or Disallow rule usually wins for Googlebot based on the length of the URL path.
  3. Group Isolation: Rules for a specific User-agent only apply to that agent. If you have rules for User-agent: Googlebot and rules for User-agent: *, Googlebot will only obey the Googlebot block and ignore the * block.
  4. File Size Limit: Google currently enforces a 500 KB size limit for robots.txt. Anything beyond that is ignored.
  5. UTF-8 Encoding: The file must be UTF-8 encoded.

8. robots.txt vs. noindex

This is the most common SEO misconception.

⚠️ Warning: If you put a URL in the robots.txt AND give it a noindex tag, Google will never crawl it, will never see the noindex tag, and might keep it in the search index indefinitely! Never block a page in robots.txt if your goal is to de-index it.