Am I Hackable?
Back to Learn

robots.txt: What It Does (and Doesn't Do) for Security

Benji··3 min read

The short version

robots.txt is a text file at the root of your website that tells web crawlers which pages they should and shouldn't access. It's defined in RFC 9309 and respected by search engines like Google and Bing.

Here's the critical thing to understand: it's a polite request, not a security mechanism. Crawlers can (and do) ignore it.

How it works

The file lives at yoursite.com/robots.txt. Here's a typical example:

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://yoursite.com/sitemap.xml

Google's robots.txt documentation explains how Googlebot interprets these directives.

What robots.txt is NOT

It's not access control. Anyone can visit yoursite.com/admin/ in their browser. robots.txt doesn't stop them. It only asks search engine crawlers not to index those pages.

It's not private. Your robots.txt file is publicly accessible. Anyone can read it at yoursite.com/robots.txt. If you list /secret-admin-panel/ in your Disallow rules, you've just told every attacker exactly where to look.

It's not a replacement for authentication. If a page should be restricted, use authentication and authorization. Don't rely on robots.txt to keep it hidden.

The security angle

From a security perspective, robots.txt is interesting for two reasons:

What it reveals

Attackers routinely check robots.txt first when scanning a site. It's a roadmap of paths the site owner considers sensitive. Disallowed paths like /admin, /backup, /wp-admin, /phpmyadmin, or /api/internal tell attackers exactly where to focus.

What it doesn't protect

If you have a page at /admin that's only "protected" by a robots.txt Disallow rule, it's not protected at all. The page is accessible to anyone who knows the URL.

When robots.txt is useful

SEO management. Preventing search engines from indexing duplicate content, staging environments, paginated pages, or filtered views. This is the legitimate use case.

Crawl budget optimization. For large sites, robots.txt helps search engines focus on important pages instead of wasting time on irrelevant ones.

Blocking known bad bots. Some aggressive scrapers respect robots.txt. Blocking them by User-agent can reduce load. But determined scrapers will ignore it.

Best practices

  1. Use robots.txt for SEO, not security. Control what gets indexed, not what's accessible.
  2. Don't list sensitive paths. If you Disallow /secret-backup/, you've just advertised it.
  3. Use authentication for restricted content. Pages that shouldn't be public need login protection.
  4. Use noindex meta tags for finer control. If you want a page accessible but not indexed, use <meta name="robots" content="noindex"> instead.
  5. Include your sitemap URL. This helps search engines find and index your content efficiently.

A sensible robots.txt

User-agent: *
Disallow: /api/
Sitemap: https://yoursite.com/sitemap.xml

Short, simple, doesn't reveal anything sensitive. Your API endpoints don't need to be indexed, and the sitemap helps crawlers find your content.

Check your site

Want to know if your site has this issue? Scan it now and find out in 60 seconds.

Frequently Asked Questions

Does robots.txt block access to pages?
No. robots.txt is a suggestion, not an access control mechanism. Well-behaved crawlers respect it, but browsers, attackers, and malicious bots ignore it completely.
Should I put sensitive paths in robots.txt?
No. Listing paths in robots.txt tells everyone (including attackers) exactly where your sensitive pages are. It's like putting a 'Do Not Enter' sign on a door with no lock.
Do I need a robots.txt file?
For SEO purposes, yes. It helps search engines crawl your site efficiently. For security, it's irrelevant. Don't rely on it to hide anything.

Your AI writes the code. We find what it missed.

Paste your URL. Security audit in 60 seconds.

Scan my app