JavaScript spider examples

This article introduces two solutions to crawl Web pages with JavaScript. The first one is a lightweight solution that uses Axios and Cheerio. The second one uses Puppeteer to control a real browser to simulate human actions. All the tools listed here are free.

  1. Axios(a HTTP client) and Cheerio(a lightweight jQuery meaning you can use it as jQuery) are two small Node libraries that can be used together as a spider.
  2. Puppeteer is also a Node library which is able to control Chrome or Chromium. Because this ways use a real browser, you can do most things automatically with Puppeteer that you can do manually in the browser. Of course it will not be a problem if a website has adopted anti-spider strategies.

Method 1 — Use Axios and Cheerio

Use Axios(a simple HTTP client) to fetch a HTML page and Cheerio (a lightweight version of jQuery) to parse the resulting data.

About Axios and Cheerio

  • Axios

    A simple promise based HTTP client for the browser and node.js.

    Its GitHub address: https://github.com/axios/axios

  • Cheerio

    Cheerio is a node library which implements a subset of core jQuery. You can use it like a jQuery.

    It works with a very simple, consistent DOM model and does not produce a visual rendering as a web browser does like file upload and download. As a result, it parses markup and manipulates the resulting data structure incredibly efficiently.

Crawl a HTML page and parse the result

The usage of Axios and Cheerio is pretty simple. Below is an example to crawl a HTML page and parse its table content.

The HTML page looks like:

<!DOCTYPE html>
<html>
  <head>...</head>
  <body>
    ...     
    <table id="zodiac-signs">
      <tbody>
        <tr>
            <th>Aries</th>
            <th>Taurus</th>
            <th>Gemini</th>
            ...
        </tr>
        <tr>
            <td data-th="Aries"><img src="https://.../aries.png" alt="Aries"></td>
            <td data-th="Taurus"><img src="https://.../taurus.png" alt="Taurus"></td>
            <td data-th="Gemini"><img src="https://.../gemini.png" alt="Gemini"></td>
            ...
        </tr>
      </tbody>
    </table>
    ...
  </body>
</html>

Use below code to fetch the zodiac image names and source URLs in the table above.

get-zodiacs.js:

const axios = require('axios');
const cheerio = require('cheerio');

function getZodiacs() {
    let url = 'https://...';

    axios.get(url)  // Use Axios to fetch the HTML page.
        .then((response) => { //------- Fetch success handler      
            // Use Cheerio to parse the resulting data structure.

            let $ = cheerio.load(response.data);      
            $('tr', '#zodiac-signs').last().children().each((i, elem) => {
                $(elem).children().each((i, e) => {
                    console.log('zodiac=' + e.attribs['alt'] + ', src=' + e.attribs['src']);
                });
            });
        })
        .catch(error => { //-------- Fetch error handler
            console.log(error);
        })
        .then(() => { //-------- Final handler
            console.log('Finished');
        });
}

getZodiacs();

Run this file with node:

$ node get-zodiacs.js 

The result:

zodiac=Aries, src=https://.../aries.png
zodiac=Taurus, src=https://.../taurus.png
zodiac=Gemini, src=https://.../gemini.png
Finished

Method 2 — Use Puppeteer to control a real browser

About Puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless(no UI) by default, but can be configured to run full (non-headless) Chrome or Chromium.

What Puppeteer can do

As mentioned in the beginning, most things that you can do manually in the browser can be done using Puppeteer! Here are a few examples to get you started:

  • Generate screenshots and PDFs of pages.
  • Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. “SSR” (Server-Side Rendering)).
  • Automate form submission, UI testing, keyboard input, etc.
  • Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
  • Capture a timeline trace of your site to help diagnose performance issues.
  • Test Chrome Extensions.

The basic usage example

Below code open a website and generate its screenshot to example.png file.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({ path: 'example.png' });

  await browser.close();
})();

Crawl a page repeatedly with different search keywords

Below example crawls the a website and fetch the meaning of a list of words.

The main elements in the HTML page that we care about look like:

<form method="GET" id="search-form" novalidate="">    
  <!-- The search input box -->
  <input id="search-word">
    ...

  <!-- The search submit button -->
  <button type="submit">
      <i class="i i-search" aria-hidden="true"></i>
  </button>

</form>

<!-- The element that contains the word meaning -->
<div id="word-content">
    ...
</div>

get-word-meanings.js:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const fs = require('fs');

async function getWordMeaings() {
    let url = 'https://xxx.com/dict'; // Replace it with a real URL

    let words = ['community', 'resident', 'institution'];
    let dictionary = new Map();

    const browser = await puppeteer.launch(
        {
            // Replace it your own Chrome path
            executablePath: 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe', 
            // Umcomment it if you want to open Chrome in the front end
            // headless: false 
        }
    );
    const page = await browser.newPage();

    let word = words[0];
    // await page.goto(url + encodeURI(word), {timeout: 0});
    await page.goto(url, {timeout: 0});
    for (let i = 0; i  e.textContent);
        console.log(meaning);

        dictionary.set(word, meaning);
    }

    await browser.close();

    // Save the dictionary to a file.
    let content = JSON.stringify(Array.from(dictionary), null, 4);
    fs.writeFile('dictionary.json', content, 'utf8', function(err) {
        if (err) {
            console.log('save error:', err);
        }
    });
    console.log(content);
}

getWordMeaings();

Run this file with node:

$ node get-word-meanings.js

The result:

[
    [
        "community",
        "the people living in one particular area or people who are considered as a unit"
    ],
    [
        "resident",
        "a person who lives or has their home in a place"
    ],
    [
        "institution",
        "a large and important organization"
    ]
]

Some Puppeteer API

Puppeteer API (V13.0.1)

  • page
    • page.setViewPort(viewport)
    const page = await browser.newPage();
    await page.setViewport({
      width: 640,
      height: 480,
      deviceScaleFactor: 1,
    });
    await page.goto('https://example.com');
    
    • page.screenshot([options])
    // Generate the screenshot to example.png file
    await page.screenshot({path: 'example.png'})
    
    • page.pdf([options])

    page.pdf() generates a pdf of the page with print CSS media. To generate a pdf with screen media, call page.emulateMediaType(‘screen’) before calling page.pdf().

    // Generates a PDF with 'screen' media type.
    await page.emulateMediaType('screen');
    await page.pdf({ path: 'page.pdf', width: '100px' });
    
    • page.$(selector)

    • selector “ A selector to query page for

    • returns: &lt;Promise&gt;

    The method runs document.querySelector within the page. If no element matches the selector, the return value resolves to null.

    await page.$('#exampleid');
    
    const tweetHandle = await page.$('.tweet .retweets');
    
    • page.$$(selector)

    The method runs document.querySelectorAll within the page. If no elements match the selector, the return value resolves to [].

    const divCount = await page.$$eval('div', (divs) =&gt; divs.length);
    const options = await page.$$eval('div &gt; span.options', (options) =&gt;
      options.map((option) =&gt; option.textContent)
    );
    
    • page.$eval(selector, pageFunction[, …args])

    This method runs document.querySelector within the page and passes it as the first argument to pageFunction. If there’s no element matching selector, the method throws an error.

    const searchValue = await page.$eval('#search', (el) =&gt; el.value);
    const preloadHref = await page.$eval('link[rel=preload]', (el) =&gt; el.href);
    const html = await page.$eval('.main-container', (e) =&gt; e.outerHTML);
    
    • page.evaluate(pageFunction[, …args])

    • pageFunction “ Function to be evaluated in the page context

    • ...args “ Arguments to pass to pageFunction
    • returns: &lt;Promise&gt; Promise which resolves to the return value of
    const bodyHandle = await page.$('body');
    const html = await page.evaluate((body) =&gt; body.innerHTML, bodyHandle);
    await bodyHandle.dispose();
    
  • ElementHandle

    • elementHandle.$(selector)

    • elementHandle.$$(selector)

    • elementHandle.$$eval(selector, pageFunction[, …args])

    • elementHandle.$eval(selector, pageFunction[, …args])

    • elementHandle.evaluate(pageFunction[, …args])

    This method passes this handle as the first argument to pageFunction.

    const tweetHandle = await page.$('.tweet .retweets');
    expect(await tweetHandle.evaluate((node) =&gt; node.innerText)).toBe('10');
    
    • elementHandle.type(text[, options])

    • elementHandle.click(options])

    • elementHandle.uploadFile(…filePaths)