Robots.txt: What It Actually Does (And Common Mistakes to Avoid)
What Robots.txt Does
The robots.txt file tells search engine crawlers which parts of your site they are allowed to access. It lives at yoursite.com/robots.txt and is one of the first things crawlers check before exploring your site.
Critical misunderstanding: robots.txt controls crawling, not indexing. Blocking a URL in robots.txt prevents Google from crawling it, but if other pages link to that URL, Google may still index the URL based on external information. To prevent indexing, you need a noindex tag instead.
Basic Robots.txt Syntax
The file uses simple directives:
- User-agent: Specifies which crawler the rules apply to. Use * for all crawlers.
- Disallow: Blocks a specific path from being crawled.
- Allow: Permits crawling of a specific path (overrides a broader Disallow).
- Sitemap: Tells crawlers where to find your XML sitemap.
Common Mistakes That Kill SEO
Blocking Your Entire Site
A blank Disallow directive allows everything. A "/" Disallow blocks everything. The difference is one character, and getting it wrong makes your entire site invisible to Google.
After every robots.txt change, test it using Google's robots.txt Tester in Search Console.
Blocking CSS and JavaScript Files
If Google cannot access your CSS and JS files, it cannot render your pages properly. This hurts rankings because Google evaluates the rendered page, not just the raw HTML.
Check that your robots.txt does not block /wp-includes/, /assets/, or wherever your CSS and JS files live.
Blocking Important Directories
Sites sometimes block directories like /blog/, /products/, or /category/ without realizing the SEO impact. Audit your Disallow rules and verify that every blocked path is intentionally blocked.
Leftover Development Restrictions
During development, sites often include a blanket Disallow: / to prevent indexing. If this is not removed before launch, the production site remains invisible to search engines. This happens more often than anyone admits.
Different Robots.txt for Staging and Production
Ensure your staging site has restrictive robots.txt (blocking everything) and your production site has the correct, permissive version. Mixing these up is a common deployment mistake.
What to Block
Legitimate uses of Disallow:
- Admin pages: /admin/, /wp-admin/
- User-specific pages: /account/, /cart/, /checkout/
- Internal search results: /search/
- Filtered/faceted URLs that create infinite crawl paths
- Thank-you or confirmation pages
- API endpoints that should not be crawled
What Not to Block
Never block:
- CSS and JavaScript files
- Image directories
- Any page you want indexed
- Your sitemap files
Robots.txt vs Noindex
Use robots.txt to manage crawl budget — preventing Google from wasting time on low-value URLs. Use noindex to prevent specific pages from appearing in search results.
You should not combine both on the same URL. If you block a page in robots.txt, Google cannot see the noindex tag because it never crawls the page.
Testing and Monitoring
- Test changes using Search Console's robots.txt Tester before deploying
- Monitor crawl stats in Search Console after any changes
- Set up alerts for unauthorized robots.txt modifications
- Review your robots.txt quarterly as your site structure evolves
A well-configured robots.txt is invisible when it works. A misconfigured one can quietly destroy your organic traffic.