search developers
Edit

Finding links

Learn what a link looks like in HTML and how to find and collect their URLs when web scraping using both DevTools and Node.js.

There are many kinds of links on the internet, which we'll cover in the advanced Academy courses. For now, let's think of links as HTML anchor elements with <a> tags. A typical link looks like this:

<a href="https://example.com">This is a link to example.com</a>

On a webpage, the link above will look like this: "This is a link to example.com". When you click it, your browser will navigate to the URL in the <a> tag's href attribute (https://example.com).

href means Hypertext REFerence. You don't need to remember this - just know that href typically means some sort of link.

So, if a link is just an HTML element, and the URL is just an attribute, this means that we can collect links exactly the same way as we collected data.💡 Easy!

To test this theory in the browser, we can try running the following code in our DevTools console on any website.

// Select all the <a> elements.
const links = document.querySelectorAll('a');
// For each of the links...
for (const link of links) {
    // get the value of its 'href' attribute...
    const url = link.href;
    // and print it to console.
    console.log(url);
}

Boom 💥, all the links from the website have now been printed to the console. Unless you were on example.com, it's usually a lot of links. By doing this, we can get a first-hand look at how interconnected the web really is.

DevTools is a fun playground, but Node.js is way more useful. Let's create a new file in our project called crawler.js and start adding some basic crawling code. We'll start with the same boilerplate as with our original scraper, but this time, we'll download the HTML of the demo site's main page.

// crawler.js
import { gotScraping } from 'got-scraping';
import cheerio from 'cheerio';

// This time we open main page
const response = await gotScraping('https://demo-webstore.apify.org/');
const html = response.body;

const $ = cheerio.load(html);

const links = $('a');

for (const link of links) {
    const url = $(link).attr('href');
    console.log(url);
}

Aside from importing libraries and downloading HTML, we loaded the HTML into Cheerio and then used it to retrieve all the <a> elements. After that, we iterated over the collected links and printed their href attributes, which we accessed using the .attr() function. Remember, Cheerio functions are exactly the same as they are in jQuery.

Next Up

After running the code, you will see quite a lot of links in the terminal. Some of them may look weird because they don't start with the regular https:// protocol. We'll learn what to do with them in the next lesson.