How do I create a sitemap for my website?

Enter your website URL and click 'Start Crawling'. Our crawler automatically discovers up to 500 pages on your site (1,500 with a free account). Once complete, download your sitemap in XML, HTML, CSV, or JSON format.

Is this sitemap generator really free?

Yes! Our sitemap generator is 100% free. Without an account, crawl up to 500 URLs. With a free account, get 1,500 URLs per crawl (3X more!), save up to 5 crawls, and crawl up to 25,000 URLs/month. No credit card required.

How many URLs can I include in my sitemap?

Without an account, you can crawl up to 500 URLs per sitemap. With a free account, you can crawl up to 1,500 URLs per sitemap and up to 25,000 URLs per month total. This is plenty for most small to medium websites.

What's the difference between discovering and generating a sitemap?

Discovery finds existing sitemaps on any website (useful for competitor research or finding your own sitemap). Generation creates a new sitemap.xml file by automatically crawling your website and discovering all URLs.

Do I need to submit my sitemap to Google?

Yes, after generating your sitemap, you should submit it to Google Search Console and Bing Webmaster Tools. This helps search engines discover and index your pages faster.

Is my data private and secure?

Yes. We only extract URL links from pages you crawl. We do NOT store full page content or personal data from crawled sites. For logged-in users, saved crawls are stored securely and only accessible to you. We use encryption for all data transmission.

Can you crawl JavaScript-based websites?

Yes! With a free account, you can enable JavaScript Rendering to crawl dynamic websites built with React, Vue, Angular, and other JavaScript frameworks. Our crawler uses a real browser to render pages, ensuring all dynamically-loaded content is discovered. Note that JS rendering is slower than standard crawling due to the additional processing required.

How to Increase AI Crawling: Technical Best Practices

Getting your content discovered by AI systems—whether for inclusion in training data or retrieval by answer engines—requires making your site accessible and easy to parse. While you can't force AI crawlers to index your content, you can remove barriers and provide signals that make crawling more efficient. This guide covers the technical optimizations that improve AI crawler access.

Allow AI Crawlers in Robots.txt

The first and most critical step: don't block the bots you want to crawl your site.

Identify Which Crawlers to Allow

Major AI crawlers and their user agents:

Configure Explicit Allow Rules

Don't assume AI crawlers have access by default. Some inherit restrictions meant for other bots. Explicitly allow the crawlers you want:

User-agent: GPTBot Allow: /
User-agent: ChatGPT-User Allow: /
User-agent: ClaudeBot Allow: /
User-agent: Claude-Web Allow: /
User-agent: PerplexityBot Allow: /
User-agent: Google-Extended Allow: /

If you have sections you want to exclude (like user account areas), be specific:

User-agent: GPTBot
Allow: /
Disallow: /account/
Disallow: /checkout/
Disallow: /admin/

Avoid Overly Restrictive Wildcard Rules

Check for existing rules that might inadvertently block AI crawlers:

This blocks everything not explicitly allowed:

User-agent: * Disallow: /

If you use restrictive defaults, add AI crawler exceptions before the wildcard block:

User-agent: GPTBot Allow: /
User-agent: ClaudeBot Allow: /
User-agent: * Disallow: /

Robots.txt rules are processed top-to-bottom, with the most specific user-agent match taking precedence.

Maintain Comprehensive XML Sitemaps

While AI crawlers don't depend on sitemaps as heavily as search engines, sitemaps still aid discovery—particularly for retrieval-focused systems that need to find fresh content quickly.

Include All Valuable Content

Your sitemap should list every page worth indexing:

Core content pages
Blog posts and articles
Product pages
Category and collection pages
Resource libraries, documentation, guides
Author pages with substantial content

Keep lastmod Accurate

The <lastmod> timestamp helps crawlers prioritize recently updated content:

<url>
  <loc>https://example.com/guides/ai-optimization/</loc>
  <lastmod>2025-01-20T14:30:00+00:00</lastmod>
</url>

Only update this value when the content actually changes. False freshness signals (like updating timestamps without content changes) erode trust in your sitemap data.

Use Sitemap Index Files for Large Sites

Organize large content libraries into logical sitemap groups:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2025-01-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2025-01-18</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-docs.xml</loc>
    <lastmod>2025-01-15</lastmod>
  </sitemap>
</sitemapindex>

This structure helps crawlers identify which content categories have been updated without processing every URL.

Optimize Page Load Performance

AI crawlers operate at scale. Slow pages get deprioritized or abandoned—crawl budgets aren't infinite.

Target Sub-Second Server Response

Aim for server response times (TTFB) under 500ms. Crawlers care less about visual rendering metrics like Largest Contentful Paint, but they do care about how quickly they receive HTML.

Improvements that help:

Enable server-side caching (Redis, Memcached, Varnish)
Use a CDN for static assets and consider full-page CDN caching
Optimize database queries that block page generation
Ensure adequate server resources during high-traffic periods

Don't Block on JavaScript

Most AI crawlers don't execute JavaScript—they parse raw HTML responses. Content loaded dynamically via JavaScript is often invisible to these bots.

This matters for:

Single-page applications (SPAs) with client-side rendering
Lazy-loaded content that requires scroll or interaction
Dynamic content pulled from APIs after page load

Solutions include server-side rendering (SSR), static site generation (SSG), or hybrid approaches that pre-render content for bots.

Verify Bot Accessibility

Test what crawlers actually see using tools that fetch pages without JavaScript:

bash

curl -A "GPTBot" https://yourdomain.com/page/

If the returned HTML is missing key content, that content won't be indexed by most AI systems.

Structure Content for Machine Readability

AI crawlers extract meaning from HTML structure, not visual presentation. Clean, semantic markup improves content extraction accuracy.

Use Semantic HTML Elements

Prefer semantic tags over generic divs:

AI systems parsing semantic HTML can better understand content hierarchy and relationships.

Implement Proper Heading Structure

Use headings in logical order to establish a document outline:

Skipping levels (h1 to h3 with no h2) or using headings purely for styling creates confusing document structure.

Add Schema.org Structured Data

Structured data provides explicit metadata that AI systems can parse reliably:

Useful schema types for AI extraction:

Article, NewsArticle, BlogPosting for editorial content
Product for e-commerce pages
FAQPage for Q&A content
HowTo for instructional content
Organization for company information

Provide Clear Content Boundaries

Help crawlers identify primary content versus navigation, ads, and boilerplate:

The <main> and <article> elements signal where substantive content lives.

Handle Crawl Budget Efficiently

AI crawlers, like search engines, allocate limited resources to each domain. Help them spend that budget on valuable content.

Eliminate Crawl Traps

Common traps that waste crawl budget:

Infinite calendar pagination: Links to past/future dates that generate pages indefinitely
Faceted navigation: Filter combinations that create thousands of near-duplicate URLs
Session IDs in URLs: Unique URLs for every visitor session
Printer-friendly versions: Duplicate pages at separate URLs

Block or canonicalize these with robots.txt rules and canonical tags.

Use Canonical Tags Correctly

When duplicate or similar content exists at multiple URLs, specify the authoritative version:

This tells crawlers which URL represents the content, preventing wasted budget on duplicates.

Implement Clean URL Structures

Predictable, hierarchical URLs are easier to crawl:

Good: Clear hierarchy /products/category/product-name/ /blog/2025/01/article-title/

Problematic: Opaque parameters /p?id=12345&cat=67&ref=home /index.php?route=product/product&product_id=12345

Note that this isn’t a necessity, but clean URLs also tend to indicate well-organized site architecture, which correlates with better crawl efficiency.

Ensure High Availability

Crawlers that encounter errors will reduce crawl frequency or give up entirely.

Minimize Downtime

Uptime matters more for AI crawling than you might expect. Training crawls may run on schedules that don't align with your maintenance windows. Retrieval crawlers need your site to be available when users ask questions about your content.

Target 99.9% uptime if AI discoverability matters to you.

Return Proper Status Codes

Correct HTTP status codes help crawlers understand your site:

200: Page exists and is available
301: Permanent redirect (crawler will update its records)
404: Page doesn't exist (crawler will remove from index)
503: Temporary unavailability (crawler will retry later)

Avoid soft 404s—pages that return 200 status but show "not found" content. These confuse crawlers and waste budget.

Handle Rate Limiting Gracefully

If you implement rate limiting, don't return error codes that suggest permanent problems. Use 429 (Too Many Requests) with a Retry-After header:

HTTP/1.1 429 Too Many Requests Retry-After: 60

This tells crawlers to back off temporarily rather than abandoning your site.

Configure Server Headers for Crawlers

HTTP headers communicate metadata about your pages to crawlers before they process content.

Set Appropriate Cache Headers

Help crawlers understand content freshness:

Cache-Control: max-age=3600 Last-Modified: Mon, 20 Jan 2025 14:30:00 GMT ETag: "abc123"

Crawlers can use conditional requests (If-Modified-Since, If-None-Match) to check for updates without downloading unchanged content.

Avoid Overly Aggressive Security Headers

Some security configurations block legitimate crawlers:

IP-based blocking that catches crawler IP ranges
User-agent filtering that blocks unfamiliar bots
CAPTCHA challenges on every request
JavaScript challenges (like some Cloudflare settings) that headless crawlers can't pass

Review your WAF and CDN settings to ensure AI crawlers can access content without friction.

Monitor Crawler Activity

You can't optimize what you don't measure. Track AI crawler behavior through your logs.

Parse Server Logs for Bot Traffic

Identify AI crawler requests by user agent:

bash

grep -E "GPTBot|ClaudeBot|PerplexityBot|ChatGPT-User" access.log | head -100

Look for patterns:

Which pages are crawled most frequently?
Are crawlers encountering errors?
How much of your site have they discovered?

Set Up Crawler-Specific Analytics

Some analytics platforms can segment bot traffic. Create segments for major AI crawlers to track:

Total requests per crawler
Pages per session
Error rates
Response times experienced by bots

Watch for Unusual Patterns

Sudden drops in AI crawler activity might indicate:

Robots.txt changes that blocked access
Server issues causing timeouts
Rate limiting triggering too aggressively
Content changes that reduced page count

Investigate significant changes promptly.

Consider Direct Data Partnerships

For organizations with substantial content libraries, technical optimization has limits. Direct relationships with AI companies can provide more reliable inclusion.

Data Licensing Programs

Some AI companies license content directly from publishers. This guarantees inclusion in training data with proper attribution and potentially compensation. Major publishers like AP, Financial Times, and Reddit have established such deals.

API Access for Real-Time Retrieval

Answer engines may support direct integrations that bypass traditional crawling. This ensures your latest content is available for retrieval without waiting for crawl cycles.

Common Crawl Inclusion

Ensuring your site is well-represented in Common Crawl archives improves your chances of inclusion in AI training data, since many companies use Common Crawl as a data source. The technical optimizations in this guide—clean HTML, good availability, proper robots.txt—all contribute to Common Crawl coverage.

Summary: Priority Checklist

For immediate impact, focus on these items first:

Audit robots.txt — Explicitly allow AI crawlers you want to index your content
Test JavaScript-free rendering — Verify your content is visible without client-side rendering
Fix server performance — Ensure fast, reliable responses for crawler requests
Update sitemap lastmod — Accurate timestamps help crawlers prioritize fresh content
Implement semantic HTML — Use proper heading structure and semantic elements
Add structured data — Schema.org markup improves content extraction
Monitor crawler logs — Track AI bot activity to identify issues early

Each improvement reduces friction between AI systems and your content, increasing the likelihood of discovery, extraction, and inclusion—whether in training datasets or real-time retrieval results.