🤖 The Ultimate Guide to robots.txt
The robots.txtfile is the gateway to your website for search engines and AI crawlers. It
tells web robots (crawlers) which pages or files they can or cannot request from your site. This
guide covers everything you need to know,
from basic syntax to advanced AI-bot blocking.
📑 Table of Contents
1. What is a robots.txt file?
A robots.txtfile is a simple text file that uses the Robots Exclusion Protocol
(REP). It is primarily used to manage crawler traffic to your site and prevent your
server from being overwhelmed by requests.
Important:It is nota mechanism for keeping a web page out of Google. To
keep a page out of the index,
you must use noindextags or password protect the page.
2. Where does the robots.txt belong?
The file must be placed in the top-level directory (root)of your website and must be
named exactly robots.txt(all lowercase).
- ✅ Correct:
https: //www.rankosaur.com/robots.txt - ❌ Incorrect:
https: //www.rankosaur.com/assets/robots.txt - ❌ Incorrect:
https: //www.rankosaur.com/Robots.TXT
3. Basic Syntax & Directives
A robots.txtfile consists of one or more "groups" of rules. Each group starts with a
User-agentline,
followed by Disallowor Allowrules.
User-agent:Identifies the specific bot the rule applies to (e.g.,Googlebot,Bingbot). An asterisk (*) targets all bots.Disallow:Tells the user-agent not to crawl a specific URL path or directory.Allow:Tells the user-agent it cancrawl a specific URL or directory. This is often used to override a broaderDisallowrule.Sitemap:Points crawlers to your XML sitemap. This does not need to be attached to a specific User-agent.Crawl-delay:Tells non-Google bots how many seconds to wait between requests (Googlebot ignores this; you must use Google Search Console for Google's crawl rate).#(Comments): Anything after a hash is ignored by crawlers.
4. Common Code Examples
Scenario A: Allow everything (Default)
If you don't have a robots.txt, or if it is completely empty, bots assume they can crawl
everything. You can also explicitly state this:
User-agent: * Disallow:
(Notice the empty space after Disallow)
Scenario B: Block the entire website
Used for staging environments or sites still in development.
User-agent: * Disallow: /
Scenario C: Block a specific directory
Prevents crawling of internal or admin pages.
User-agent: * Disallow: /admin/ Disallow: /internal-search/
Scenario D: Block a specific bot
Allow everyone, but block a specific crawler (e.g., a toxic SEO tool crawler).
User-agent: AhrefsBot Disallow: / User-agent: * Disallow:
Scenario E: Allow a specific file inside a blocked directory
User-agent: * Disallow: /images/ Allow: /images/logo.png
Scenario F: Adding a Sitemap
User-agent: * Disallow: /private/ Sitemap: https: //www.rankosaur.com/sitemap_index.xml
5. Advanced Pattern Matching
Googlebot and Bingbot support regular expressions (RegEx) for more complex rules.
*(Wildcard): Represents any sequence of characters.$(End of URL): Indicates the exact end of a URL string.
Block all URLs containing a specific parameter (e.g., internal search):
User-agent: * Disallow:
/*?search=
Block all files of a specific type (e.g., PDFs):
User-agent: *
Disallow: /*.pdf$
(This blocks document.pdf but allows document.pdf?version=2 unless you
remove the $)
6. Controlling AI & LLM Bots (Important!)
With the rise of ChatGPT, Claude, and Google SGE, AI companies are aggressively scraping the web to train their Large Language Models (LLMs). You can block them while still allowing regular search engine indexing.
Note: Rank-O-Saur has a built-in feature to instantly visualize if these AI bots are blocked on the current page you are viewing!
Code snippet to block the most common AI/LLM scrapers:
# Block OpenAI (ChatGPT & training bots)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Block Anthropic (Claude)
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
# Block Google's Extended AI Training (Does NOT block Googlebot for regular search)
User-agent: Google-Extended
Disallow: /
# Block Common Crawl (Often used for open-source AI training like LLaMA)
User-agent: CCBot
Disallow: /
7. Crucial Rules & Best Practices
- Case Sensitivity: Directory names are case-sensitive.
Disallow: /Admin/will not block/admin/. - Order of Precedence: The most specific
AlloworDisallowrule usually wins for Googlebot based on the length of the URL path. - Group Isolation: Rules for a specific
User-agentonly apply to that agent. If you have rules forUser-agent: Googlebotand rules forUser-agent: *, Googlebot will only obey the Googlebot block and ignore the*block. - File Size Limit: Google currently enforces a 500 KB size limit for
robots.txt. Anything beyond that is ignored. - UTF-8 Encoding: The file must be UTF-8 encoded.
8. robots.txt vs. noindex
This is the most common SEO misconception.
robots.txt(Disallow): Stops a bot from crawling the page. However, if the page is linked from somewhere else, Google might still index it (showing only the URL in search results without a description).noindex(Meta Tag or HTTP Header): Tells the bot not to index the page. Crucial: For Google to see thenoindextag, it MUST be able to crawl the page.
⚠️ Warning: If you put a URL in the robots.txt AND give it a
noindex tag, Google will never crawl it, will never see the noindex
tag, and might keep it in the search index indefinitely! Never block a page in
robots.txt if your goal is to de-index it.