Skip to main content
Server path: /web-scraping | Type: Embedded | PCID required: No Scrape content from web pages, crawl entire websites, map site structure, and read RSS/Atom feeds.

Tools

ToolDescription
web-scraping_scrapeScrape content from one or more web pages
web-scraping_crawlCrawl a website and scrape multiple pages
web-scraping_mapMap all URLs on a website
web-scraping_rssRead and parse RSS/Atom feeds

web-scraping_scrape

Scrape content from one or more web pages. Supports multiple output formats, content filtering, and browser actions before scraping. Parameters:
ParameterTypeRequiredDefaultDescription
urlsstring[]YesURLs to scrape
formatsenum[]No["markdown"]Output formats: "markdown", "html", "rawHtml", "links", "summary"
onlyMainContentbooleanNotrueExtract only the main content, excluding headers, navs, footers, etc.
removeBase64ImagesbooleanNotrueRemove base64-encoded images from the output
waitFornumberNoMilliseconds to wait for the page to load before scraping
actionsobject[]NoBrowser actions to perform before scraping. Each action has type (required) and optional fields: milliseconds, selector, direction, fullPage, text, key.
includeTagsstring[]NoHTML tags to include in the output
excludeTagsstring[]NoHTML tags to exclude from the output
locationobjectNoGeolocation settings: { country?, languages? }
Response fields:
FieldTypeDescription
resultsobject[]Array of scrape results
results[].urlstringThe scraped URL
results[].successbooleanWhether the scrape succeeded
results[].dataobjectScraped content in the requested formats
results[].file_urlsstring[]URLs of any files found on the page

web-scraping_crawl

Crawl a website starting from one or more URLs and scrape multiple pages. Follows links up to a configurable depth and page limit. Parameters:
ParameterTypeRequiredDefaultDescription
urlsstring[]YesStarting URLs to crawl from
limitnumberNo10Maximum number of pages to crawl
maxDepthnumberNoMaximum link depth to crawl from the starting URLs
includePathsstring[]NoGlob patterns for paths to include (e.g. ["/blog/*"])
excludePathsstring[]NoGlob patterns for paths to exclude
allowExternalLinksbooleanNofalseFollow links to external domains
allowSubdomainsbooleanNofalseFollow links to subdomains of the starting URLs
scrapeOptionsobjectNoOptions applied to each scraped page: { formats?, onlyMainContent?, proxy?, waitFor? }
Response fields:
FieldTypeDescription
resultsobject[]Array of crawl results, one per scraped page
results[].urlstringThe crawled URL
results[].successbooleanWhether the page was scraped successfully
results[].dataobjectScraped content in the requested formats

web-scraping_map

Map all discoverable URLs on a website. Useful for understanding site structure before crawling. Parameters:
ParameterTypeRequiredDefaultDescription
urlsstring[]YesWebsite URLs to map
searchstringNoFilter term to narrow results to matching URLs
limitnumberNo100Maximum number of URLs to return
includeSubdomainsbooleanNofalseInclude URLs from subdomains
sitemapenumNo"include"Sitemap handling: "include" (use sitemap and crawl), "skip" (ignore sitemap), "only" (use sitemap exclusively)
Response fields:
FieldTypeDescription
urlsobject[]Array of discovered URLs
urls[].urlstringThe discovered URL
urls[].metadataobjectURL metadata (title, description, etc.)

web-scraping_rss

Read and parse RSS/Atom feeds. Supports checking feed validity, retrieving all items, searching items, and getting the latest entries. Parameters:
ParameterTypeRequiredDefaultDescription
actionenumYesAction to perform: "check" (validate feed), "get" (all items), "search" (filter items), "get_latest" (recent items)
urlstringYesRSS/Atom feed URL
timeoutnumberNo10000Request timeout in milliseconds
limitnumberNoMaximum number of items to return
querystringNoSearch query string (used with "search" action)
caseSensitivebooleanNofalseWhether the search query is case-sensitive
countnumberNo10Number of items to return (used with "get_latest" action)
Response fields:
FieldTypeDescription
actionstringThe action that was performed
resultobjectResult payload (structure varies by action)
Response for "check" action:
FieldTypeDescription
result.validbooleanWhether the URL is a valid RSS/Atom feed
result.titlestringFeed title
result.descriptionstringFeed description
result.linkstringFeed website link
Response for "get", "search", and "get_latest" actions:
FieldTypeDescription
result.itemsobject[]Array of feed items
result.items[].titlestringItem title
result.items[].linkstringItem URL
result.items[].pubDatestringPublication date
result.items[].contentstringItem content or summary