Skip to main content

What can you do with it?

With Crawling, crawl a website to retrieve the contents of pages. Systematically extract content from multiple pages by following links, perfect for comprehensive data collection, documentation scraping, or content migration. Features configurable depth controls, path filtering, subdomain inclusion, and the ability to apply scraping options (formats, proxy, caching) to all crawled pages.

How to use it?

Basic Command Structure

/crawling [urls] [limit]

Parameters

Required:
  • urls - Array of starting URLs to crawl
Optional Crawl Control:
  • limit - Maximum number of pages to crawl (default: 10)
  • maxDepth - Max depth to crawl (legacy, converts to maxDiscoveryDepth)
  • maxDiscoveryDepth - How many link levels deep to discover
  • allowBackwardLinks - Allow external domains (legacy, converts to allowExternalLinks)
  • allowExternalLinks - Crawl external domains
  • crawlEntireDomain - Crawl entire domain, not just children of URL
  • allowSubdomains - Include subdomains in crawl
Path Filtering:
  • includePaths - Only crawl URLs matching these patterns (e.g., [“/blog/*”])
  • excludePaths - Skip URLs matching these patterns (e.g., [“/admin/*”])
Scrape Options (Applied to All Pages):
  • scrapeOptions - Object containing any scrape parameters:
    • formats - Output formats ([“markdown”, “html”, “links”, etc.])
    • onlyMainContent - Extract only main content
    • proxy - Proxy type (“basic”, “stealth”, “auto”)
    • maxAge - Use cache if younger than this (milliseconds)
    • location - Location settings for geo-specific content
    • Any other scrape parameter
File Storage:
  • file_links_expire_in_days - Days until file links expire
  • file_links_expire_in_minutes - Alternative to days

Response Format

{
  "status": "completed",
  "total": 4,
  "completed": 4,
  "creditsUsed": 4,
  "results": [
    {
      "url": "https://example.com/page1",
      "data": {
        "markdown": "# Page Title\n\nContent in markdown...",
        "html": "<html>...</html>",
        "links": ["https://link1.com", "https://link2.com"],
        "metadata": {
          "title": "Page Title",
          "description": "Page description",
          "language": "en",
          "sourceURL": "https://example.com/page1",
          "url": "https://example.com/page1",
          "statusCode": 200
        }
      }
    },
    {
      "url": "https://example.com/page2",
      "data": {
        "markdown": "# Another Page\n\nMore content...",
        "metadata": {
          "title": "Another Page",
          "sourceURL": "https://example.com/page2",
          "url": "https://example.com/page2",
          "statusCode": 200
        }
      }
    }
  ]
}

Examples

Basic Usage

/crawling crawl https://docs.example.com up to 20 pages
Crawl a documentation site up to 20 pages.

Path Filtering

/crawling crawl https://example.com including only docs and guides folders, excluding blog and news
Crawl with path filtering to focus on specific sections.

Entire Domain Crawl

/crawling crawl entire domain of https://example.com including subdomains
Crawls the entire domain and all subdomains, not just children of the URL.

With Caching

/crawling crawl https://site.com with cached results if less than 1 hour old
Uses cached content when available (scrapeOptions with maxAge).

Stealth Mode Crawl

/crawling crawl https://protected-site.com using stealth proxy for all pages
Applies stealth proxy to all crawled pages.

Multi-Format Output

/crawling crawl https://docs.site.com getting markdown, html, and screenshots for each page
Retrieves multiple formats for each crawled page.

Location-Specific Crawl

/crawling crawl https://global-site.com from Japan in Japanese
Crawls as if accessing from Japan with Japanese language preference.

Deep Crawl with Limits

/crawling crawl https://site.com up to depth 5 maximum 100 pages
Controls crawl depth and total page limit.

Notes

  • Polls job to completion automatically
  • Legacy parameters are auto-converted (maxDepth→maxDiscoveryDepth, allowBackwardLinks→allowExternalLinks)
  • Screenshots are uploaded to storage and returned as file_urls
  • Supports glob patterns for path filtering (e.g., “/docs/”, “/blog/”)
  • All scrape parameters can be applied to crawled pages via scrapeOptions
  • Each page crawled uses one credit
I