Server path:Documentation Index
Fetch the complete documentation index at: https://docs.pinkfish.ai/llms.txt
Use this file to discover all available pages before exploring further.
/web-scraping | Type: Embedded | PCID required: No
Tools
| Tool | Description |
|---|---|
web-scraping_scrape | Scrape content from one or more web pages. Returns clean markdown, HTML, or structured data. Supports browser actions like screenshots, clicks, and scrolling for dynamic content. Use this for extracting content from specific URLs. |
web-scraping_crawl | Crawl a website starting from one or more URLs to discover and scrape multiple pages. Follows links within the site with configurable depth limits and path filtering. Use this to extract content from entire websites or specific sections. |
web-scraping_map | Generate a map of all URLs on a website without scraping content. Discovers pages via links and sitemap. Use this to understand site structure, find specific pages, or plan what to crawl/scrape. |
web-scraping_rss | Read and parse RSS/Atom feeds from URLs. Supports checking feed validity, fetching all items, searching items by content, and getting the latest items sorted by date. |
web-scraping_scrape
Scrape content from one or more web pages. Returns clean markdown, HTML, or structured data. Supports browser actions like screenshots, clicks, and scrolling for dynamic content. Use this for extracting content from specific URLs. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | Array of URLs to scrape (required). Can be full URLs or just domain names like “google.com” |
formats | string[] | No | — | Output formats: “markdown” (default), “html”, “rawHtml”, “links”, “summary” |
onlyMainContent | boolean | No | — | Extract only main content, excluding headers/footers/nav (default: true) |
removeBase64Images | boolean | No | — | Remove base64 encoded images from output (default: true) |
waitFor | number | No | — | Milliseconds to wait before scraping. Use for pages with dynamic content that loads after initial render. Example: 2000 for 2 seconds |
actions | object[] | No | — | Browser actions to perform before scraping. Actions execute in order. Examples: - Wait: {“type”: “wait”, “milliseconds”: 2000} - Click button: {“type”: “click”, “selector”: “button.load-more”} - Scroll down: {“type”: “scroll”, “selector”: “body”, “direction”: “down”} - Type in input: {“type”: “write”, “selector”: “#search”, “text”: “search query”} - Press Enter: {“type”: “press”, “key”: “Enter”} - Take screenshot: {“type”: “screenshot”, “fullPage”: true} |
includeTags | string[] | No | — | HTML tags to include (e.g., [“div”, “p”, “h1”]) |
excludeTags | string[] | No | — | HTML tags to exclude (e.g., [“script”, “style”]) |
location | object | No | — | Location/language settings for geo-specific content |
web-scraping_crawl
Crawl a website starting from one or more URLs to discover and scrape multiple pages. Follows links within the site with configurable depth limits and path filtering. Use this to extract content from entire websites or specific sections. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | Starting URLs to crawl from (required). Can be full URLs or just domain names like “google.com” |
limit | number | No | — | Maximum number of pages to crawl (default: 10) |
maxDepth | number | No | — | Maximum link depth to follow from starting URL |
includePaths | string[] | No | — | Only crawl URLs matching these glob patterns (e.g., [“/blog/*”]) |
excludePaths | string[] | No | — | Skip URLs matching these glob patterns (e.g., [“/admin/*”]) |
allowExternalLinks | boolean | No | — | Allow crawling external domains |
allowSubdomains | boolean | No | — | Include subdomains in crawl |
scrapeOptions | object | No | — | Options to apply when scraping each crawled page |
web-scraping_map
Generate a map of all URLs on a website without scraping content. Discovers pages via links and sitemap. Use this to understand site structure, find specific pages, or plan what to crawl/scrape. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | Starting URLs to map from (required). Can be full URLs or just domain names like “google.com” |
search | string | No | — | Filter results to URLs containing this search term |
limit | number | No | — | Maximum number of URLs to return (default: 100) |
includeSubdomains | boolean | No | — | Include subdomains in the map |
sitemap | string | No | — | Sitemap usage: “include” (default), “skip” (ignore sitemap), “only” (only use sitemap) |
web-scraping_rss
Read and parse RSS/Atom feeds from URLs. Supports checking feed validity, fetching all items, searching items by content, and getting the latest items sorted by date. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
action | string | Yes | — | Action to perform: “check” (validate feed and get basic info), “get” (fetch all feed items), “search” (search items by query), “get_latest” (get most recent items by date) |
url | string | Yes | — | The URL of the RSS/Atom feed |
timeout | number | No | 10000 | Request timeout in milliseconds (default: 10000) |
limit | number | No | — | For “get” and “search” actions: maximum number of items to return |
query | string | No | — | For “search” action: search query to match against item title, description, or content |
caseSensitive | boolean | No | false | For “search” action: whether search should be case-sensitive (default: false) |
count | number | No | 10 | For “get_latest” action: number of latest items to return (default: 10) |

