The short version
robots.txt is a text file at the root of your website that tells web crawlers which pages they should and shouldn't access. It's defined in RFC 9309 and respected by search engines like Google and Bing.
Here's the critical thing to understand: it's a polite request, not a security mechanism. Crawlers can (and do) ignore it.
How it works
The file lives at yoursite.com/robots.txt. Here's a typical example:
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://yoursite.com/sitemap.xml
- User-agent: Which crawler the rules apply to (
*means all). - Disallow: Paths the crawler should skip.
- Allow: Paths within a Disallowed directory that should be crawled.
- Sitemap: Tells crawlers where to find your sitemap.
Google's robots.txt documentation explains how Googlebot interprets these directives.
What robots.txt is NOT
It's not access control. Anyone can visit yoursite.com/admin/ in their browser. robots.txt doesn't stop them. It only asks search engine crawlers not to index those pages.
It's not private. Your robots.txt file is publicly accessible. Anyone can read it at yoursite.com/robots.txt. If you list /secret-admin-panel/ in your Disallow rules, you've just told every attacker exactly where to look.
It's not a replacement for authentication. If a page should be restricted, use authentication and authorization. Don't rely on robots.txt to keep it hidden.
The security angle
From a security perspective, robots.txt is interesting for two reasons:
What it reveals
Attackers routinely check robots.txt first when scanning a site. It's a roadmap of paths the site owner considers sensitive. Disallowed paths like /admin, /backup, /wp-admin, /phpmyadmin, or /api/internal tell attackers exactly where to focus.
What it doesn't protect
If you have a page at /admin that's only "protected" by a robots.txt Disallow rule, it's not protected at all. The page is accessible to anyone who knows the URL.
When robots.txt is useful
SEO management. Preventing search engines from indexing duplicate content, staging environments, paginated pages, or filtered views. This is the legitimate use case.
Crawl budget optimization. For large sites, robots.txt helps search engines focus on important pages instead of wasting time on irrelevant ones.
Blocking known bad bots. Some aggressive scrapers respect robots.txt. Blocking them by User-agent can reduce load. But determined scrapers will ignore it.
Best practices
- Use robots.txt for SEO, not security. Control what gets indexed, not what's accessible.
- Don't list sensitive paths. If you Disallow
/secret-backup/, you've just advertised it. - Use authentication for restricted content. Pages that shouldn't be public need login protection.
- Use
noindexmeta tags for finer control. If you want a page accessible but not indexed, use<meta name="robots" content="noindex">instead. - Include your sitemap URL. This helps search engines find and index your content efficiently.
A sensible robots.txt
User-agent: *
Disallow: /api/
Sitemap: https://yoursite.com/sitemap.xml
Short, simple, doesn't reveal anything sensitive. Your API endpoints don't need to be indexed, and the sitemap helps crawlers find your content.
Check your site
Want to know if your site has this issue? Scan it now and find out in 60 seconds.
Frequently Asked Questions
- Does robots.txt block access to pages?
- No. robots.txt is a suggestion, not an access control mechanism. Well-behaved crawlers respect it, but browsers, attackers, and malicious bots ignore it completely.
- Should I put sensitive paths in robots.txt?
- No. Listing paths in robots.txt tells everyone (including attackers) exactly where your sensitive pages are. It's like putting a 'Do Not Enter' sign on a door with no lock.
- Do I need a robots.txt file?
- For SEO purposes, yes. It helps search engines crawl your site efficiently. For security, it's irrelevant. Don't rely on it to hide anything.
Your AI writes the code. We find what it missed.
Paste your URL. Security audit in 60 seconds.
Scan my app