How do I create a sitemap for my website?

Enter your website URL and click 'Start Crawling'. Our crawler automatically discovers up to 500 pages on your site (1,500 with a free account). Once complete, download your sitemap in XML, HTML, CSV, or JSON format.

Is this sitemap generator really free?

Yes! Our sitemap generator is 100% free. Without an account, crawl up to 500 URLs. With a free account, get 1,500 URLs per crawl (3X more!), save up to 5 crawls, and crawl up to 25,000 URLs/month. No credit card required.

How many URLs can I include in my sitemap?

Without an account, you can crawl up to 500 URLs per sitemap. With a free account, you can crawl up to 1,500 URLs per sitemap and up to 25,000 URLs per month total. This is plenty for most small to medium websites.

What's the difference between discovering and generating a sitemap?

Discovery finds existing sitemaps on any website (useful for competitor research or finding your own sitemap). Generation creates a new sitemap.xml file by automatically crawling your website and discovering all URLs.

Do I need to submit my sitemap to Google?

Yes, after generating your sitemap, you should submit it to Google Search Console and Bing Webmaster Tools. This helps search engines discover and index your pages faster.

Is my data private and secure?

Yes. We only extract URL links from pages you crawl. We do NOT store full page content or personal data from crawled sites. For logged-in users, saved crawls are stored securely and only accessible to you. We use encryption for all data transmission.

Can you crawl JavaScript-based websites?

Yes! With a free account, you can enable JavaScript Rendering to crawl dynamic websites built with React, Vue, Angular, and other JavaScript frameworks. Our crawler uses a real browser to render pages, ensuring all dynamically-loaded content is discovered. Note that JS rendering is slower than standard crawling due to the additional processing required.

Do AI Crawlers Need XML Sitemaps?

As AI systems increasingly crawl the web to train models and power answer engines, site owners are asking whether their XML sitemaps matter for this new category of bots. The short answer: sitemaps can help AI crawlers discover your content, but they're not as critical as they are for traditional search engines—and many AI companies are building their own discovery methods anyway.

How AI Crawlers Differ from Search Engine Bots

Traditional search engine crawlers like Googlebot have one primary goal: discover and index pages so they can appear in search results. XML sitemaps directly support this by providing a roadmap of URLs to crawl.

AI crawlers serve different purposes. Some crawl to build training datasets for large language models. Others power real-time retrieval systems that pull information to generate answers. A few do both. These different use cases change how sitemaps fit into the picture.

Training Data Crawlers

Companies like OpenAI (GPTBot), Anthropic (ClaudeBot), and others operate crawlers that collect web content for model training. These crawlers typically:

Prioritize breadth over depth, trying to capture diverse content across the web
Don't need to recrawl frequently since training happens periodically
Often rely on existing link graphs and web archives rather than real-time discovery

For training crawlers, your sitemap is one signal among many. They'll likely find your content through links from other sites, Common Crawl archives, or their own web-wide discovery mechanisms. A sitemap won't hurt, but it's not the primary discovery method.

Retrieval and Answer Engine Crawlers

Systems like Perplexity, Google's AI Overviews, and Bing's Copilot need fresher content to generate accurate answers. These crawlers behave more like traditional search bots—they need to discover new content quickly and track changes over time.

For retrieval-focused AI systems, sitemaps provide more value. The <lastmod> tag helps these systems identify recently updated content worth recrawling. Your sitemap serves as a hint about what content exists and how current it is.

Which AI Crawlers Are Out There?

Several major AI crawlers are currently active:

GPTBot (OpenAI): Used for training data collection. Identifies itself as GPTBot in the user agent string.

ChatGPT-User (OpenAI): Fetches content in real-time when ChatGPT users enable web browsing. Different from GPTBot in that it serves live retrieval rather than training.

ClaudeBot (Anthropic): Anthropic's crawler for training data and potentially future retrieval features.

PerplexityBot: Powers Perplexity's answer engine with real-time web content.

Google-Extended: Google's crawler specifically for AI/ML training purposes, separate from Googlebot which handles search indexing.

Bytespider (ByteDance): TikTok's parent company operates this crawler, likely for various AI applications.

CCBot (Common Crawl): The non-profit Common Crawl project, whose archives many AI companies use as a data source.

Each crawler may handle sitemaps differently. There's no universal standard for how AI systems should interact with sitemap files, and most AI companies haven't published clear documentation about their sitemap usage.

Do AI Crawlers Actually Read Sitemaps?

Here's where things get murky. Unlike Google, which explicitly documents how Googlebot uses sitemaps, AI companies have been largely silent on the topic.

Based on observed crawler behavior and the limited documentation available:

Likely yes for retrieval systems: Crawlers powering real-time answer engines probably use sitemaps similarly to search engines—as a discovery aid and freshness signal. These systems benefit from knowing which URLs exist and when they last changed.

Unclear for training crawlers: Large-scale training crawls may not prioritize sitemap-based discovery. When you're trying to capture a broad slice of the web, following links and using existing indexes (like Common Crawl) is often more efficient than parsing millions of individual sitemaps.

Inconsistent implementation: Even if an AI crawler reads sitemaps, it may not respect all the signals. Priority and changefreq are already ignored by most search engines—AI crawlers are even less likely to pay attention to these optional hints.

What Sitemaps Can't Do for AI Crawlers

XML sitemaps have significant limitations when it comes to managing AI crawler access:

Sitemaps Don't Control Access

Including a URL in your sitemap doesn't grant permission to crawl it. Excluding a URL doesn't block crawling. Sitemaps are informational—they tell crawlers what exists, not what they're allowed to access.

To actually block AI crawlers, you need robots.txt rules:

User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: /

No AI-Specific Sitemap Extensions

The sitemap protocol doesn't include any AI-specific elements. You can't signal that content is suitable for training, specify licensing terms, or indicate preferences for how AI should use your content. These decisions happen through other mechanisms—robots.txt, terms of service, or emerging standards like the robots.txt ai.txt proposals that haven't yet achieved wide adoption.

No Guaranteed Compliance

AI crawlers don't universally respect robots.txt, let alone sitemap hints. Some companies have committed to honoring robots.txt blocks, but enforcement varies. Your sitemap might be read by crawlers you'd prefer to block, and blocked crawlers might ignore your sitemap anyway.

Best Practices for AI Crawler Discovery

If you want AI systems to find and use your content, standard sitemap best practices apply:

Maintain an accurate sitemap: Include all pages you want discovered, exclude pages you don't want indexed. Keep <lastmod> dates accurate so crawlers can identify fresh content.

Submit to search engines: Google Search Console and Bing Webmaster Tools submissions help your sitemap reach AI systems integrated with these platforms, like Google's AI Overviews and Bing's Copilot.

Use clear robots.txt directives: If you want to allow some AI crawlers but not others, specify this explicitly:

User-agent: ChatGPT-User Allow: / User-agent: GPTBot Disallow: /

Structure content well: AI systems extracting information from your pages benefit from clear HTML structure, proper heading hierarchy, and schema markup. This doesn't relate to sitemaps directly, but it affects how useful your content is once discovered.

If You Want to Block AI Crawlers

Sitemaps are irrelevant here—use robots.txt and consider these additional steps:

Block known user agents: Add disallow rules for specific AI crawler user agents. Keep your list updated as new crawlers emerge.

Monitor your logs: Watch for unidentified crawlers with AI-like behavior patterns (rapid requests, systematic URL patterns) and block suspicious agents.

Use technical barriers carefully: Rate limiting, requiring JavaScript rendering, or implementing bot detection can reduce unwanted crawling, but may also affect legitimate users and search engines.

Understand the limitations: Determined scrapers can circumvent most technical barriers. Blocking AI crawlers reduces exposure but doesn't guarantee your content won't end up in training data through other paths (archives, licensed datasets, etc.).

The Bottom Line

XML sitemaps serve AI crawlers in roughly the same way they serve search engines—as a discovery aid and freshness indicator. They're useful but not essential, and they provide no access control.

For most sites, the practical recommendation is simple: maintain a good sitemap for search engines, and it will serve AI crawlers adequately as a side effect. If AI crawler management is a priority for your organization, focus your energy on robots.txt rules and monitoring rather than sitemap optimization.

The AI crawling landscape is evolving rapidly. New standards for communicating AI usage preferences may emerge, and AI companies may publish clearer documentation about their crawler behavior. For now, sitemaps remain a basic building block of web discoverability—for humans, search engines, and AI systems alike.