robots.txt: The Complete Guide

By Christoph Hein · Head of SEO & Search Consultant · Updated June 2026

The robots.txt file is the gateway to your website for search engines and AI crawlers. It tells web robots (crawlers) which pages or files they can or cannot request from your site. This guide covers everything you need to know, from basic syntax to advanced AI-bot blocking.

Test your robots.txt now Use our free robots.txt Validator to check instantly whether any URL is allowed or blocked — for Googlebot, AI crawlers or a custom user-agent.

Open the Validator →

1. What is a robots.txt file?

A robots.txt file is a simple text file that uses the Robots Exclusion Protocol (REP). It is primarily used to manage crawler traffic to your site and prevent your server from being overwhelmed by requests.

Important: It is not a mechanism for keeping a web page out of Google. To keep a page out of the index, you must use noindex tags or password protect the page.

2. Where does the robots.txt belong?

The file must be placed in the top-level directory (root) of your website and must be named exactly robots.txt (all lowercase).

Correct: https://www.rankosaur.com/robots.txt
Incorrect: https://www.rankosaur.com/assets/robots.txt
Incorrect: https://www.rankosaur.com/Robots.TXT

3. Basic Syntax & Directives

A robots.txt file consists of one or more "groups" of rules. Each group starts with a User-agent line, followed by Disallow or Allow rules.

User-agent: Identifies the specific bot the rule applies to (e.g., Googlebot, Bingbot). An asterisk (*) targets all bots.
Disallow: Tells the user-agent not to crawl a specific URL path or directory.
Allow: Tells the user-agent it can crawl a specific URL or directory. This is often used to override a broader Disallow rule.
Sitemap: Points crawlers to your XML sitemap. This does not need to be attached to a specific User-agent.
Crawl-delay: Tells non-Google bots how many seconds to wait between requests (Googlebot ignores this; you must use Google Search Console for Google's crawl rate).
# (Comments): Anything after a hash is ignored by crawlers.

4. Common Code Examples

Scenario A: Allow everything (Default)

If you don't have a robots.txt, or if it is completely empty, bots assume they can crawl everything. You can also explicitly state this:

User-agent: *
Disallow:

(Notice the empty space after Disallow)

Scenario B: Block the entire website

Used for staging environments or sites still in development.

User-agent: *
Disallow: /

Scenario C: Block a specific directory

Prevents crawling of internal or admin pages.

User-agent: *
Disallow: /admin/
Disallow: /internal-search/

Scenario D: Block a specific bot

Allow everyone, but block a specific crawler (e.g., a toxic SEO tool crawler).

User-agent: AhrefsBot
Disallow: /
User-agent: *
Disallow:

Scenario E: Allow a specific file inside a blocked directory

User-agent: *
Disallow: /images/
Allow: /images/logo.png

Scenario F: Adding a Sitemap

User-agent: *
Disallow: /private/
Sitemap: https://www.rankosaur.com/sitemap_index.xml

5. Advanced Pattern Matching

Googlebot and Bingbot support regular expressions (RegEx) for more complex rules.

* (Wildcard): Represents any sequence of characters.
$ (End of URL): Indicates the exact end of a URL string.

Block all URLs containing a specific parameter (e.g., internal search):

User-agent: *
Disallow: /*?search=

Block all files of a specific type (e.g., PDFs):

User-agent: *
Disallow: /*.pdf$

(This blocks document.pdf but allows document.pdf?version=2 unless you remove the $)

6. Controlling AI & LLM Bots (Important!)

With the rise of ChatGPT, Claude, and Google SGE, AI companies are aggressively scraping the web to train their Large Language Models (LLMs). You can block them while still allowing regular search engine indexing.

Note: Rank-O-Saur has a built-in feature to instantly visualize if these AI bots are blocked on the current page you are viewing!

Code snippet to block the most common AI/LLM scrapers:

# Block OpenAI (ChatGPT & training bots)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /

# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /

# Block Google's Extended AI Training (Does NOT block Googlebot for regular search)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (Often used for open-source AI training like LLaMA)
User-agent: CCBot
Disallow: /

7. Crucial Rules & Best Practices

Case Sensitivity: Directory names are case-sensitive. Disallow: /Admin/ will not block /admin/.
Order of Precedence: The most specific Allow or Disallow rule usually wins for Googlebot based on the length of the URL path.
Group Isolation: Rules for a specific User-agent only apply to that agent. If you have rules for User-agent: Googlebot and rules for User-agent: *, Googlebot will only obey the Googlebot block and ignore the * block.
File Size Limit: Google currently enforces a 500 KB size limit for robots.txt. Anything beyond that is ignored.
UTF-8 Encoding: The file must be UTF-8 encoded.

8. robots.txt vs. noindex

This is the most common SEO misconception.

robots.txt (Disallow): Stops a bot from crawling the page. However, if the page is linked from somewhere else, Google might still index it (showing only the URL in search results without a description).
noindex (Meta Tag or HTTP Header): Tells the bot not to index the page. Crucial: For Google to see the noindex tag, it MUST be able to crawl the page.

Warning: If you put a URL in the robots.txt AND give it a noindex tag, Google will never crawl it, will never see the noindex tag, and might keep it in the search index indefinitely! Never block a page in robots.txt if your goal is to de-index it.

About the Author

Christoph Hein

Head of SEO at Popken Fashion Group & independent Search Consultant

Christoph has spent 10+ years in search, currently steering organic strategy for 5 fashion brands across 13 countries and more than 30 domains. Alongside his in-house and consulting work, he founded niche content portals such as Angelmagazin.de and BaristaCompass.com, and built the Rank-O-Saur extension to make technical SEO audits effortless. Every guide here is grounded in hands-on, data-driven practice rather than theory.

christophhein.com ↗ LinkedIn ↗