Skip to main content

Extracting data

Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.


Now that we know how to execute scripts on a page, we're ready to learn a bit about data extraction. In this lesson, we'll be scraping all the on-sale products from our Fakestore website.

Most web data extraction cases involve looping through a list of items of some sort.

There are two main ways to extract data with Playwright and Puppeteer:

  1. Directly in page.evaluate() and other evaluate functions such as page.$$eval().
  2. In the Node.js context using a parsing library such as Cheerio

Setup

Here is the base set up for our code, upon which we'll be building off of in this lesson:

import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');

// code will go here

await page.waitForTimeout(10000);

await browser.close();

Extracting from the browser context

Whatever is returned from the callback function in page.evaluate() will be returned by the evaluate function, which means that we can set it to a variable like so:

const products = await page.evaluate(() => ({ foo: 'bar' }));

console.log(products); // -> { foo: 'bar' }

We'll be returning a bunch of product objects from this function, which will be accessible back in our Node.js context after the promise has resolved. Let's now go ahead and write some data extraction code to collect each product:

const products = await page.evaluate(() => {
const productCards = Array.from(document.querySelectorAll('a[class*="ProductCard_root"]'));

return productCards.map((element) => {
const name = element.querySelector('h3[class*="ProductCard_name"]').textContent;
const price = element.querySelector('div[class*="ProductCard_price"]').textContent;

return {
name,
price,
};
});
});

console.log(products);

When we run this code, we see this logged to our console:

Products logged to the console

Using jQuery

Working with document.querySelector is cumbersome and quite verbose, but with the page.addScriptTag() function and the latest jQuery CDN link, we can inject jQuery into the current page to gain access to its syntactical sweetness:

await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' });

This function will literally append a <script> tag to the <head> element of the current page, allowing access to jQuery's API when using page.evaluate() to run code in the browser context.

Now, since we're able to use jQuery, let's translate our vanilla JavaScript code within the page.evaluate() function to jQuery:

await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' });

const products = await page.evaluate(() => {
const productCards = Array.from($('a[class*="ProductCard_root"]'));

return productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});
});

console.log(products);

This will output the same exact result as the code in the previous section.

Parsing in the Node.js context

One of the most popular parsing libraries for Node.js is Cheerio, which can be used in tandem with Playwright and Puppeteer. It is extremely beneficial to parse the page's HTML in the Node.js context for a number of reasons:

  • You can easily port the code between headless browser data extraction and plain HTTP data extraction
  • You don't have to worry in which context you're working (which can sometimes be confusing)
  • Errors are easier to handle when running in the base Node.js context

To install it, we can run the following command within your project's directory:

npm install cheerio

Then, we'll import the load function like so:

import { load } from 'cheerio';

Finally, we can create a Cheerio object based on our page's current content like so:

const $ = load(await page.content());

It's important to note that this $ object is static. If any content on the page changes, the $ variable will not automatically be updated. It will need to be re-declared or re-defined.

Here's our full code so far:

import { chromium } from 'playwright';
import { load } from 'cheerio';

const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');

const $ = load(await page.content());

// code will go here

await browser.close();

Now, to loop through all of the products, we'll make use of the $ object and loop through them while safely in the server-side context rather than running the code in the browser. Notice that this code is nearly exactly the same as the jQuery code above - it is just not running inside of a page.evaluate() in the browser context.

const $ = load(await page.content());

const productCards = Array.from($('a[class*="ProductCard_root"]'));

const products = productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});

console.log(products);

Final code

Here's what our final optimized code looks like:

import { chromium } from 'playwright';
import { load } from 'cheerio';

const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://demo-webstore.apify.org/search/on-sale');

const $ = load(await page.content());

const productCards = Array.from($('a[class*="ProductCard_root"]'));

const products = productCards.map((element) => {
const card = $(element);

const name = card.find('h3[class*="ProductCard_name"]').text();
const price = card.find('div[class*="ProductCard_price"]').text();

return {
name,
price,
};
});

console.log(products);

await browser.close();

Next up

Our next lesson will be discussing something super cool - request interception and reading data from requests and responses. It's just like using DevTools, except programmatically!