Robots.txt Generator & Validator

Name: Robots.txt Generator & Validator
Author: Thibault Besson-Magdelain

Developer working on web crawler configuration

Last updated: February 10, 2026 Reading time: 8 min Algorithm: Robots Exclusion Protocol pattern matching Author: Thibault Besson-Magdelain

Key Takeaways

Robots.txt controls crawling, not indexing; use meta noindex to prevent indexing
AI crawlers (GPTBot, CCBot) can be blocked individually or as a group
Google supports wildcards (*) and end-of-string ($) in robots.txt paths
Always include your sitemap URL in robots.txt for better crawl discovery

Quick Templates

Rules

Sitemap URL

Crawl-delay (seconds, optional)

Generated robots.txt

URL Tester

The Robots Exclusion Protocol (1994)

The Robots Exclusion Protocol was created in June 1994 by Martijn Koster as a response to a poorly behaved web crawler that was causing server overloads. At the time, the World Wide Web was still in its infancy, and there was no standardized way for website operators to communicate with automated crawlers about which parts of their sites should not be accessed. The protocol proposed a simple solution: a plain text file called robots.txt, placed at the root of a website, containing directives that crawlers should follow.

The original specification was intentionally simple. It defined two directives: User-agent (specifying which crawler the rules apply to) and Disallow (specifying paths that should not be crawled). The wildcard asterisk (*) for User-agent means "all crawlers." Despite its simplicity, this protocol has remained remarkably stable for over three decades, forming the foundation of how billions of pages communicate their crawling preferences to thousands of different bots.

In 2019, Google proposed formalizing the Robots Exclusion Protocol as an internet standard through the IETF (Internet Engineering Task Force), published as RFC 9309 in September 2022. This formalization clarified ambiguities in the original specification, defined handling of edge cases (like very large robots.txt files or encoding issues), and established the Allow directive as an official part of the standard. The RFC also specified how crawlers should handle temporary HTTP errors when fetching robots.txt, recommending that crawlers assume full allow on 4xx errors and full disallow on 5xx errors.

How Crawlers Interpret robots.txt

When a well-behaved crawler visits a website, its first action is to fetch the robots.txt file at the site's root URL. The crawler then parses the file to find rule groups that match its User-agent string. If a specific rule group matches the crawler's name, those rules are used. If no specific match exists, the crawler falls back to the wildcard (*) rules. If no robots.txt file exists or the server returns a 404, the crawler assumes it may crawl everything.

Rule matching follows a longest-match-wins principle for most crawlers. If a robots.txt contains both Disallow: /images/ and Allow: /images/logo.png, the longer, more specific Allow rule takes precedence for the logo.png path. Google specifically follows this longest-match behavior, evaluating all matching rules and applying the most specific one. This enables fine-grained control where you can block an entire directory but allow specific files within it.

Path matching in robots.txt is case-sensitive and supports pattern matching with wildcards. The asterisk (*) matches any sequence of characters within a path, while the dollar sign ($) matches the end of the URL. For example, Disallow: /*.pdf$ blocks all URLs ending in .pdf regardless of their directory path, while Disallow: /private*/ blocks any directory starting with "private." These patterns provide powerful control without requiring individual rules for every possible URL.

Google-Specific Behaviors

Google's implementation of the Robots Exclusion Protocol includes several behaviors that website operators should understand. As documented in Google's robots.txt specification, Googlebot follows the RFC 9309 standard but also has some practical behaviors worth noting.

Google caches robots.txt for up to 24 hours (or until the cache-control header expires), meaning changes to your robots.txt may take up to a day to take effect. Google also imposes a size limit of 500 kibibytes (512,000 bytes) on robots.txt files. Content beyond this limit is ignored, and Google treats the file as if it allows everything not explicitly blocked within the size limit. For most websites this is not a concern, but sites with very complex rule sets should be aware of this limitation.

Google uses multiple crawlers with different User-agent names: Googlebot (web search), Googlebot-Image (image search), Googlebot-Video (video search), Googlebot-News (Google News), Google-Extended (AI training data), Storebot-Google (shopping), and GoogleOther (generic crawling). You can set different rules for each crawler. Notably, Google-Extended was introduced in 2023 specifically to allow website operators to opt out of their content being used for training Google's AI models while still appearing in search results.

AI Crawler Blocking Controversy

The rise of large language models (LLMs) has created intense debate around web crawling for AI training data. Companies like OpenAI (GPTBot), Anthropic (anthropic-ai), Common Crawl (CCBot), Apple (Applebot), and others operate crawlers that collect web content for various AI-related purposes. Many website operators feel that their content should not be used to train commercial AI models without consent or compensation, leading to a wave of robots.txt updates blocking these crawlers.

As of 2025, studies show that over 25% of the top 1,000 websites block at least one AI crawler through robots.txt. Major publishers including The New York Times, The Washington Post, and Reuters have blocked GPTBot and similar crawlers. However, the effectiveness of this approach is debated. Robots.txt is a voluntary protocol, and while major AI companies claim to respect it, there is limited ability to verify compliance. Furthermore, content may have already been collected before blocking was implemented.

The controversy has spurred discussions about a new generation of machine-readable permissions for AI, beyond what robots.txt was designed to handle. Proposals include TDMRep (Text and Data Mining Reservation Protocol), which allows publishers to express permissions specifically for text and data mining activities. Until a new standard emerges, robots.txt remains the primary mechanism for communicating crawling preferences to AI bots. Our generator includes all major AI crawler user-agents so you can easily block or allow them based on your preferences.

Crawl Budget Optimization

Crawl budget refers to the number of pages that search engine crawlers will crawl on your site within a given timeframe. For small to medium websites (under 10,000 pages), crawl budget is rarely a concern because search engines can easily crawl the entire site. For large sites with hundreds of thousands or millions of pages, crawl budget becomes a critical optimization factor.

Robots.txt plays a direct role in crawl budget optimization by preventing crawlers from wasting time on low-value pages. Common targets for blocking include internal search result pages, URL parameter variations (sort, filter, session IDs), print versions of pages, admin and staging areas, and duplicate content paths. By blocking these low-value URLs, you ensure that crawlers spend their limited budget on your most important content pages.

Other factors affecting crawl budget include server response time (faster servers get crawled more), content freshness (frequently updated content gets crawled more often), internal linking (well-linked pages get discovered faster), and XML sitemaps (which help crawlers discover all important URLs). Using robots.txt in conjunction with these other signals creates an efficient crawling environment where search engines find and index your valuable content quickly.

Robots.txt vs. Noindex

One of the most common mistakes in technical SEO is using robots.txt to try to prevent a page from appearing in search results. Robots.txt prevents crawling, not indexing. If other websites link to a URL that you have blocked in robots.txt, search engines may still index that URL based on the anchor text and context from the linking pages. The indexed result will appear without a description (since the crawler never fetched the page content) and with a message indicating that no information is available.

To prevent a page from appearing in search results, use the meta robots noindex tag (<meta name="robots" content="noindex">) or the X-Robots-Tag HTTP header (X-Robots-Tag: noindex). Critically, the page must be crawlable for the search engine to see the noindex directive. If you block a page in robots.txt AND add a noindex tag, the search engine cannot crawl the page to discover the noindex tag, potentially leaving the URL indexed indefinitely.

The correct approach depends on your goal. If you want to conserve crawl budget by preventing access to low-value pages that have no external links, use robots.txt. If you want to ensure a page never appears in search results (like a thank-you page, internal dashboard, or staging content), use meta robots noindex while keeping the page crawlable. For sensitive content that should not be publicly accessible at all, use proper authentication rather than relying on either robots.txt or noindex.

Frequently Asked Questions

What is a robots.txt file?

A robots.txt file is a plain text file placed at the root of a website that tells web crawlers which pages or sections they are allowed or not allowed to crawl. It follows the Robots Exclusion Protocol standard originally established in 1994 and formalized as RFC 9309 in 2022. Every major search engine and responsible web crawler respects robots.txt directives.

Where should robots.txt be placed?

The robots.txt file must be placed at the root of your domain, accessible at yourdomain.com/robots.txt. It only applies to the specific protocol, host, and port where it is placed. Subdomains (such as blog.example.com) need their own separate robots.txt files. The file must be plain text with UTF-8 encoding.

Does robots.txt block indexing?

No. Robots.txt blocks crawling, not indexing. If other pages link to a URL blocked in robots.txt, search engines may still index it without a snippet. To prevent indexing, use the meta robots noindex tag instead, and ensure the page remains crawlable so search engines can read the noindex directive.

How do I block AI crawlers like GPTBot?

Add "User-agent: GPTBot" followed by "Disallow: /" to your robots.txt. For comprehensive AI blocking, also add rules for CCBot, anthropic-ai, Google-Extended, ClaudeBot, and other AI training crawlers. Our Block AI Crawlers template adds all major AI crawlers at once.

What is crawl budget and why does it matter?

Crawl budget is the number of pages a search engine will crawl on your site within a given timeframe. For large sites (100,000+ pages), robots.txt helps optimize crawl budget by blocking low-value pages like internal search results, parameter variations, and duplicate content paths.

Can robots.txt use wildcards?

Yes. Google and most major crawlers support the asterisk (*) as a wildcard matching any sequence of characters, and the dollar sign ($) matching the end of the URL. For example, "Disallow: /*.pdf$" blocks all PDF files regardless of their directory location.

What is the difference between robots.txt and meta robots noindex?

Robots.txt prevents crawling (the crawler never fetches the page), while meta robots noindex prevents indexing (the page is fetched but not added to the search index). Use robots.txt for crawl budget management and noindex for preventing pages from appearing in search results. Never combine both for the same page, as the crawler cannot see the noindex tag if the page is blocked.

Thibault Besson-Magdelain
SEO specialist and web developer building free technical SEO tools.
LinkedIn Profile

External resources: Google Robots.txt Specifications | RFC 9309 - Robots Exclusion Protocol

Back to SEO Tools