How do I create a sitemap for my website?

Enter your website URL and click 'Start Crawling'. Our crawler automatically discovers up to 500 pages on your site (1,500 with a free account). Once complete, download your sitemap in XML, HTML, CSV, or JSON format.

Is this sitemap generator really free?

Yes! Our sitemap generator is 100% free. Without an account, crawl up to 500 URLs. With a free account, get 1,500 URLs per crawl (3X more!), save up to 5 crawls, and crawl up to 25,000 URLs/month. No credit card required.

How many URLs can I include in my sitemap?

Without an account, you can crawl up to 500 URLs per sitemap. With a free account, you can crawl up to 1,500 URLs per sitemap and up to 25,000 URLs per month total. This is plenty for most small to medium websites.

What's the difference between discovering and generating a sitemap?

Discovery finds existing sitemaps on any website (useful for competitor research or finding your own sitemap). Generation creates a new sitemap.xml file by automatically crawling your website and discovering all URLs.

Do I need to submit my sitemap to Google?

Yes, after generating your sitemap, you should submit it to Google Search Console and Bing Webmaster Tools. This helps search engines discover and index your pages faster.

Is my data private and secure?

Yes. We only extract URL links from pages you crawl. We do NOT store full page content or personal data from crawled sites. For logged-in users, saved crawls are stored securely and only accessible to you. We use encryption for all data transmission.

Can you crawl JavaScript-based websites?

Yes! With a free account, you can enable JavaScript Rendering to crawl dynamic websites built with React, Vue, Angular, and other JavaScript frameworks. Our crawler uses a real browser to render pages, ensuring all dynamically-loaded content is discovered. Note that JS rendering is slower than standard crawling due to the additional processing required.

How to Create a Robots.txt File

The robots.txt file tells search engines and other crawlers which parts of your site they can and cannot access. Every website should have one—it's a fundamental piece of technical SEO that takes minutes to set up but can prevent indexing problems, wasted crawl budget, and unwanted content appearing in search results.

What Is Robots.txt?

Robots.txt is a plain text file that lives in your website's root directory. When a crawler visits your site, it checks this file first to understand what it's allowed to access. The file uses a simple syntax to specify rules for different crawlers (called "user agents") and which URL paths they should avoid.

The file must be:

Named exactly robots.txt (lowercase)
Located at your domain's root (e.g., https://example.com/robots.txt)
Accessible via HTTP/HTTPS
Plain text format (not HTML)

Basic Robots.txt Syntax

A robots.txt file consists of one or more rule sets, each containing a user-agent declaration followed by directives.

User-Agent

The User-agent line specifies which crawler the following rules apply to:

User-agent: Googlebot

Use an asterisk to target all crawlers:

User-agent: *

Disallow

The Disallow directive tells crawlers not to access specific paths:

User-agent: * Disallow: /admin/ Disallow: /private/

An empty Disallow means nothing is blocked:

User-agent: * Disallow:

Allow

The Allow directive permits access to specific paths, useful for overriding broader Disallow rules:

User-agent: * Disallow: /images/ Allow: /images/public/

This blocks /images/ but permits /images/public/.

Sitemap

The Sitemap directive points crawlers to your XML sitemap:

Sitemap: https://example.com/sitemap.xml

This can appear anywhere in the file and applies globally (not to specific user agents).

Creating Your First Robots.txt File

Step 1: Create the File

Open any plain text editor (Notepad, TextEdit, VS Code, Sublime Text). Create a new file and save it as robots.txt.

Do not use word processors like Microsoft Word—they add hidden formatting that breaks the file.

Step 2: Add Your Rules

Start with a basic configuration that allows all crawlers:

User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml

This tells all crawlers they can access everything and points them to your sitemap.

Step 3: Upload to Your Root Directory

Upload robots.txt to your website's root directory via FTP, SFTP, or your hosting control panel. The file should be accessible at:

https://yourdomain.com/robots.txt

Step 4: Verify It Works

Visit the URL in your browser. You should see your robots.txt content displayed as plain text. If you see a 404 error or HTML content, the file isn't in the right location or isn't named correctly.

Common Robots.txt Configurations

Allow Everything (Default)

User-agent: * Disallow: Sitemap: https://example.com/sitemap.xml

Use this when you want all content indexed. The empty Disallow explicitly permits everything.

Block Specific Directories

User-agent: * Disallow: /admin/ Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/ Sitemap: https://example.com/sitemap.xml

Blocks administrative areas and temporary directories while allowing everything else.

Block Specific File Types

User-agent: * Disallow: /*.pdf$ Disallow: /*.doc$ Disallow: /*.xls$ Sitemap: https://example.com/sitemap.xml

Prevents indexing of PDF and Office documents. The $ indicates end of URL.

Block URL Parameters

User-agent: * Disallow: /*?* Allow: /*?page= Sitemap: https://example.com/sitemap.xml

Blocks URLs with query parameters but allows pagination parameters. This helps prevent duplicate content from filtered or sorted views.

Block Specific Crawlers

User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap.xml

Blocks specific AI crawlers entirely while applying standard rules to everyone else. More specific user-agent rules take precedence.

Different Rules for Different Bots

User-agent: Googlebot Disallow: /private/ Allow: / User-agent: Bingbot Disallow: /private/ Allow: / User-agent: * Disallow: / Sitemap: https://example.com/sitemap.xml

Allows only Google and Bing to crawl the site, blocking all other bots. The wildcard rule at the end catches everything not explicitly named.

Block Everything (Development/Staging Sites)

User-agent: * Disallow: /

Prevents all crawlers from accessing any page. Use this for staging environments to prevent accidental indexing. Remember to remove or update this before launching.

Robots.txt Pattern Matching

Robots.txt supports two wildcard characters for flexible pattern matching.

Asterisk (*) Wildcard

Matches any sequence of characters:

Disallow: /images/*.gif

Blocks all .gif files in the /images/ directory and subdirectories.

Disallow: /*?sessionid=

Blocks any URL containing ?sessionid= parameter.

Dollar Sign ($) End Matcher

Indicates the URL must end with the specified string:

Disallow: /*.php$

Blocks URLs ending in .php but not /page.php?id=1 (which doesn't end with .php).

Allow: /directory$

Allows exactly /directory but not /directory/page or /directory/.

Platform-Specific Instructions

WordPress

WordPress creates a virtual robots.txt automatically, but you can override it:

Option 1: Physical File Create robots.txt and upload it to your WordPress root directory (where wp-config.php lives). This overrides the virtual file.

Option 2: Plugin SEO plugins like Yoast and Rank Math let you edit robots.txt through the admin panel under their settings.

Option 3: Theme Functions Add custom rules programmatically via functions.php:

Shopify

Shopify generates robots.txt automatically and doesn't allow direct editing. To customize it:

Go to Online Store → Themes
Click Actions → Edit code
Create a new template file called robots.txt.liquid
Add your custom rules using Liquid syntax

Drupal

Drupal includes a physical robots.txt file in the root directory. Edit it directly or use the RobotsTxt module to manage it through the admin interface.

For Drupal 8+, the default file blocks several administrative paths. Review and customize based on your needs.

Magento

Magento 2 includes a default robots.txt in the pub directory. Edit it directly at /pub/robots.txt or configure through:

Stores → Configuration → General → Design → Search Engine Robots

Static Sites / Manual Upload

Simply create the file locally and upload to your web root via FTP/SFTP. Ensure it's named correctly and accessible at your domain root.

Testing Your Robots.txt

Google Search Console

The robots.txt Tester in Google Search Console lets you validate your file and test specific URLs:

Open Search Console for your property
Go to Settings → robots.txt (under Crawling)
View your current file and any errors
Test specific URLs to see if they're blocked

Bing Webmaster Tools

Bing offers similar testing under Configure My Site → Block URLs → robots.txt validator.

Manual Testing

Check if a URL is blocked by matching it against your rules:

User-agent: * Disallow: /private/

/private/ — Blocked ✓
/private/page.html — Blocked ✓
/privateer/ — Not blocked (different path)
/my-private/ — Not blocked

Remember that Disallow: /private/ blocks the directory and everything in it, but not URLs that merely contain "private" elsewhere.

Common Mistakes to Avoid

Blocking CSS and JavaScript

Blocking CSS and JS files prevents search engines from rendering your pages properly. Google needs these resources to understand your content and may rank pages lower if it can't render them.

Blocking Important Content Accidentally

Overly broad rules can hide content you want indexed:

This blocks /search/ but also /research/, /job-search-tips/, and any other page with "search" in the URL.

Using Robots.txt for Security

Robots.txt is publicly visible—anyone can view it. Don't rely on it to protect sensitive content:

Use proper authentication for sensitive areas instead.

Forgetting the trailing slashes

Be intentional about whether you want to block a directory specifically or anything starting with that string.

Case Sensitivity Issues

Robots.txt paths are case-sensitive:

This blocks /Private/ but not /private/. If your server treats URLs as case-insensitive, add rules for both.

Multiple User-Agent Lines

Wrong:

The rule only applies to Bingbot. For multiple bots with the same rules, repeat the full block:

Or use the wildcard if the rule applies to everyone:

Robots.txt vs. Meta Robots vs. X-Robots-Tag

These three mechanisms serve different purposes:

Robots.txt — Controls crawl access at the URL level. Crawlers check this before requesting pages. Cannot control indexing directly.

Meta Robots Tag — HTML tag on individual pages that controls indexing behavior:

X-Robots-Tag — HTTP header that provides the same controls as meta robots, useful for non-HTML files:

Important distinction: robots.txt blocks crawling, not indexing. If a page is blocked by robots.txt but linked from other sites, search engines may still index the URL (without content) based on anchor text. To prevent indexing, use noindex via meta tag or header—but the page must be crawlable for search engines to see that directive.

Monitoring Robots.txt Effectiveness

Check Crawl Stats

In Google Search Console, monitor crawl statistics to see if blocked paths are being respected. Unusual crawl patterns might indicate robots.txt issues.

Review Index Coverage

The Index Coverage report shows the pages Google has discovered and their indexing status. Pages you've blocked should appear as "Blocked by robots.txt" if Google found them through links.

Log Analysis

Parse your server logs to verify crawlers are respecting your rules:

bash
grep "Googlebot" access.log | grep "/blocked-path/"

If you see requests to blocked paths, the rules might not be working as expected—or the crawler is ignoring them.