search developers
Edit

Headless browsers

Learn how to scrape the web with a headless browser using only a few lines of code. Chrome, Firefox, Safari, Edge - all are supported.

A headless browser is simply a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are Puppeteer and Playwright. Crawlee supports both.

Building a Playwright scraper

We'll be focusing on Playwright today, as it was developed by the same team that created Puppeteer, and it is newer with more features and better documentation.

Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper.

First, we must not forget to install Playwright into our project.

npm install --save playwright

After Playwright installs, we can proceed with updating the scraper code. As always, the comments describe changes in the code. Everything else is the same as before.

// crawlee.js
import { PlaywrightCrawler, Dataset } from 'crawlee';
// Don't forget to import cheerio, we will need it later.
import cheerio from 'cheerio';

// Replace CheerioCrawler with PlaywrightCrawler
const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Here, we extract the HTML from the browser and parse
        // it with Cheerio. Thanks to that we can use exactly
        // the same code as before, when using CheerioCrawler.
        const $ = cheerio.load(await page.content());

        if (request.userData.label === 'START') {
            await enqueueLinks({
                selector: 'a[href*="/product/"]',
                baseUrl: new URL(request.url).origin,
            });

            // When on the START page, we don't want to
            // extract any data after we extract the links.
            return;
        }

        // We copied and pasted the extraction code
        // from the previous lesson
        const title = $('h3').text().trim();
        const price = $('h3 + div').text().trim();
        const description = $('div[class*="Text_body"]').text().trim();

        // Because we're using a browser, we can now access
        // dynamically loaded data. Our target site has
        // dynamically loaded images.
        const imageRelative = $('img[alt="Product Image"]').attr('src');
        const base = new URL(request.url).origin;
        const image = new URL(imageRelative, base).href;

        // Instead of saving the data to a variable,
        // we immediately save everything to a file.
        await Dataset.pushData({
            title,
            description,
            price,
            image,
        });
    },
});

await crawler.addRequests([{
    url: 'https://demo-webstore.apify.org/search/on-sale',
    // By labeling the Request, we can very easily
    // identify it later in the requestHandler.
    userData: {
        label: 'START',
    },
}]);

await crawler.run();

Yup, that's it. To quickly recap, we added 2 lines and changed 1 line of code to transform our crawler from a static HTTP request crawler to a headless-browser crawler. The scraper now runs exactly the same as before, but using a full Chromium browser instead of plain HTTP requests and Cheerio. This is a taste of the true power of Crawlee.

Notice that we are also scraping a new piece of data - image. We were unable to access this content before with Cheerio, as it is dynamically loaded in. If you're confused about the differences between PlaywrightCrawler/PuppeteerCrawler and CheerioCrawler, and why one might choose one over the other, give this short article about dynamic pages a quick read-over.

Using Playwright in combination with Cheerio like this is only one of many ways how you can utilize Playwright (and Puppeteer) with Crawlee. In the advanced courses of the Academy, we will go deeper into using headless browsers for scraping and web automation (RPA) use-cases.

Running in headless mode

We said that headless browsers didn't have a UI, but while scraping with the above scraper code, we're sure your screen was full of browser tabs. That's because in Crawlee, browsers run headful (with a UI) by default. This is useful for debugging and seeing what's going on in the browser. Once the scraper is complete, we switch it to headless mode using one of two methods:

Environment variable

This is the programmer's preferred solution, because you don't have to change the source code to change the behavior. All you need to do is set the CRAWLEE_HEADLESS=1 environment variable when running your scraper, and it will automatically run headless. For example like this:

CRAWLEE_HEADLESS=1 node crawlee.js

Scraper code

If you always want the scraper to run headless, it might be better to hardcode it in the source. To do that, we need to access Playwright's launch options. In Crawlee we can do that in the PlaywrightCrawler constructor.

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true, // setting headless in code
        },
    },
    requestHandler: async ({ page, request }) => {
        // ...
    },
    // ...
});

launchContext and launchOptions give you control over launching browsers with Crawlee.

Next up

We learned how to scrape with Cheerio and Playwright, but how do we process the scraped data? Let's learn that in the next lesson.