| title | Advanced Scraping Guide |
|---|---|
| description | Learn how to improve your Firecrawl scraping with advanced options. |
| og:title | Advanced Scraping Guide | Firecrawl |
| og:description | Learn how to improve your Firecrawl scraping with advanced options. |
This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.
To scrape a single page and get clean markdown content, you can use the /scrape endpoint.
# pip install firecrawl-py
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
content = app.scrape_url("https://docs.firecrawl.dev")// npm install @mendable/firecrawl-js
import { FirecrawlApp } from 'firecrawl-js';
const app = new FirecrawlApp({ apiKey: 'YOUR_API_KEY' });
const content = await app.scrapeUrl('https://docs.firecrawl.dev');// go get github.com/mendableai/firecrawl-go
import (
"fmt"
"log"
"github.com/mendableai/firecrawl-go"
)
func main() {
app, err := firecrawl.NewFirecrawlApp("YOUR_API_KEY")
if err != nil {
log.Fatalf("Failed to initialize FirecrawlApp: %v", err)
}
content, err := app.ScrapeURL("docs.firecrawl.dev", nil)
if err != nil {
log.Fatalf("Failed)
}
}// Install the firecrawl_rs crate with Cargo
use firecrawl_rs::FirecrawlApp;
#[tokio::main]
async fn main() {
// Initialize the FirecrawlApp with the API key
let api_key = "YOUR_API_KEY";
let api_url = "https://api.firecrawl.dev";
let app = FirecrawlApp::new(api_key, api_url).expect("Failed to initialize FirecrawlApp");
let scrape_result = app.scrape_url("https://docs.firecrawl.dev", None).await;
match scrape_result {
Ok(data) => println!("Scrape Result:\n{}", data["markdown"]),
Err(e) => eprintln!("Scrape failed: {}", e),
}
}curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev"
}'Firecrawl supports scraping PDFs by default. You can use the /scrape endpoint to scrape a PDF link and get the text content of the PDF. You can disable this by setting parsePDF to false.
When using the /scrape endpoint, you can customize the scraping behavior with many parameters. Here are the available options:
- Type:
array - Enum:
["markdown", "links", "html", "rawHtml", "screenshot", "json"] - Description: Specify the formats to include in the response. Options include:
markdown: Returns the scraped content in Markdown format.links: Includes all hyperlinks found on the page.html: Provides the content in HTML format.rawHtml: Delivers the raw HTML content, without any processing.screenshot: Includes a screenshot of the page as it appears in the browser.json: Extracts structured information from the page using the LLM.
- Default:
["markdown"]
- Type:
boolean - Description: By default, the scraper will only return the main content of the page, excluding headers, navigation bars, footers, etc. Set this to
falseto return the full page content. - Default:
true
- Type:
array - Description: Specify the HTML tags, classes and ids to include in the response.
- Default: undefined
- Type:
array - Description: Specify the HTML tags, classes and ids to exclude from the response.
- Default: undefined
- Type:
integer - Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
- Default:
0
- Type:
integer - Description: Set the maximum duration in milliseconds that the scraper will wait for the page to respond before aborting the operation.
- Default:
30000(30 seconds)
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H '
Content-Type: application/json' \
-H 'Authorization : Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"formats": ["markdown", "links", "html", "rawHtml", "screenshot"],
"includeTags": ["h1", "p", "a", ".main-content"],
"excludeTags": ["#ad", "#footer"],
"onlyMainContent": false,
"waitFor": 1000,
"timeout": 15000
}'In this example, the scraper will:
- Return the full page content as markdown.
- Include the markdown, raw HTML, HTML, links and screenshot in the response.
- The response will include only the HTML tags
<h1>,<p>,<a>, and elements with the class.main-content, while excluding any elements with the IDs#adand#footer. - Wait for 1000 milliseconds (1 second) for the page to load before fetching the content.
- Set the maximum duration of the scrape request to 15000 milliseconds (15 seconds).
Here is the API Reference for it: Scrape Endpoint Documentation
When using the /scrape endpoint, you can specify options for extracting structured information from the page content using the extract parameter. Here are the available options:
- Type:
object - Required: False if prompt is provided
- Description: The schema for the data to be extracted. This defines the structure of the extracted data.
- Type:
string - Required: False
- Description: System prompt for the LLM.
- Type:
string - Required: False if schema is provided
- Description: A prompt for the LLM to extract the data in the correct structure.
- Example:
"Extract the features of the product"
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://firecrawl.dev",
"formats": ["markdown", "json"],
"json": {
"prompt": "Extract the features of the product"
}
}'{
"success": true,
"data": {
"content": "Raw Content",
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://docs.firecrawl.dev/",
"ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://docs.firecrawl.dev/",
"statusCode": 200
},
"extract": {
"product": "Firecrawl",
"features": {
"general": {
"description": "Turn websites into LLM-ready data.",
"openSource": true,
"freeCredits": 500,
"useCases": [
"AI applications",
"Data science",
"Market research",
"Content aggregation"
]
},
"crawlingAndScraping": {
"crawlAllAccessiblePages": true,
"noSitemapRequired": true,
"dynamicContentHandling": true,
"dataCleanliness": {
"process": "Advanced algorithms",
"outputFormat": "Markdown"
}
},
...
}
}
}
}When using the /scrape endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
- Type:
object - Description: Wait for a specified amount of milliseconds.
- Properties:
type:"wait"milliseconds: Number of milliseconds to wait.
- Example:
{ "type": "wait", "milliseconds": 2000 }
- Type:
object - Description: Take a screenshot.
- Properties:
type:"screenshot"fullPage: Should the screenshot be full-page or viewport sized? (default:false)
- Example:
{ "type": "screenshot", "fullPage": true }
- Type:
object - Description: Click on an element.
- Properties:
type:"click"selector: Query selector to find the element by.
- Example:
{ "type": "click", "selector": "#load-more-button" }
- Type:
object - Description: Write text into an input field.
- Properties:
type:"write"text: Text to type.selector: Query selector for the input field.
- Example:
{ "type": "write", "text": "Hello, world!", "selector": "#search-input" }
- Type:
object - Description: Press a key on the page.
- Properties:
type:"press"key: Key to press.
- Example:
{ "type": "press", "key": "Enter" }
- Type:
object - Description: Scroll the page.
- Properties:
type:"scroll"direction: Direction to scroll ("up"or"down").amount: Amount to scroll in pixels.
- Example:
{ "type": "scroll", "direction": "down", "amount": 500 }
For more details about the actions parameters, refer to the API Reference.
To crawl multiple pages, you can use the /crawl endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev"
}'Returns a id
{ "id": "1234-5678-9101" }Used to check the status of a crawl job and get its result.
curl -X GET https://api.firecrawl.dev/v1/crawl/1234-5678-9101 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'If the content is larger than 10MB or if the crawl job is still running, the response will include a next parameter. This parameter is a URL to the next page of results. You can use this parameter to get the next page of results.
When using the /crawl endpoint, you can customize the crawling behavior with request body parameters. Here are the available options:
- Type:
array - Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
- Example:
["/blog/*", "/products/*"]
- Type:
array - Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
- Example:
["/admin/*", "/login/*"]
- Type:
integer - Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
- Example:
2
- Type:
integer - Description: Maximum number of pages to crawl.
- Default:
10000
- Type:
boolean - Description: This option permits the crawler to navigate to URLs that are higher in the directory structure than the base URL. For instance, if the base URL is
example.com/blog/topic, enabling this option allows crawling to pages likeexample.com/blogorexample.com, which are backward in the path hierarchy relative to the base URL. - Default:
false
- Type:
boolean - Description: This option allows the crawler to follow links that point to external domains. Be careful with this option, as it can cause the crawl to stop only based only on the
limitandmaxDepthvalues. - Default:
false
As part of the crawler options, you can also specify the scrapeOptions parameter. This parameter allows you to customize the scraping behavior for each page.
- Type:
object - Description: Options for the scraper.
- Example:
{"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000} - Default:
{ "formats": ["markdown"] } - See: Scrape Options
curl -X POST https://api.firecrawl.dev/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization : Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev",
"includePaths": ["/blog/*", "/products/*"],
"excludePaths": ["/admin/*", "/login/*"],
"maxDepth": 2,
"limit": 1000
}'In this example, the crawler will:
- Only crawl URLs that match the patterns
/blog/*and/products/*. - Skip URLs that match the patterns
/admin/*and/login/*. - Return the full document data for each page.
- Crawl up to a maximum depth of 2.
- Crawl a maximum of 1000 pages.
The /map endpoint is adept at identifying URLs that are contextually related to a given website. This feature is crucial for understanding a site's contextual link environment, which can greatly aid in strategic site analysis and navigation planning.
To use the /map endpoint, you need to send a GET request with the URL of the page you want to map. Here is an example using curl:
curl -X POST https://api.firecrawl.dev/v1/map \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev"
}'This will return a JSON object containing links contextually related to the url.
{
"success":true,
"links":[
"https://docs.firecrawl.dev",
"https://docs.firecrawl.dev/api-reference/endpoint/crawl-delete",
"https://docs.firecrawl.dev/api-reference/endpoint/crawl-get",
"https://docs.firecrawl.dev/api-reference/endpoint/crawl-post",
"https://docs.firecrawl.dev/api-reference/endpoint/map",
"https://docs.firecrawl.dev/api-reference/endpoint/scrape",
"https://docs.firecrawl.dev/api-reference/introduction",
"https://docs.firecrawl.dev/articles/search-announcement",
...
]
}- Type:
string - Description: Search for links containing specific text.
- Example:
"blog"
- Type:
integer - Description: Maximum number of links to return.
- Default:
100
- Type:
boolean - Description: Ignore the website sitemap when crawling
- Default:
true
- Type:
boolean - Description: Include subdomains of the website
- Default:
false
Here is the API Reference for it: Map Endpoint Documentation