This article introduces two solutions to crawl Web pages with JavaScript. The first one is a lightweight solution that uses Axios and Cheerio. The second one uses Puppeteer to control a real browser to simulate human actions. All the tools listed here are free.
- Axios(a HTTP client) and Cheerio(a lightweight jQuery meaning you can use it as jQuery) are two small Node libraries that can be used together as a spider.
- Puppeteer is also a Node library which is able to control Chrome or Chromium. Because this ways use a real browser, you can do most things automatically with Puppeteer that you can do manually in the browser. Of course it will not be a problem if a website has adopted anti-spider strategies.
Method 1 — Use Axios and Cheerio
Use Axios(a simple HTTP client) to fetch a HTML page and Cheerio (a lightweight version of jQuery) to parse the resulting data.
About Axios and Cheerio
- Axios
A simple promise based HTTP client for the browser and node.js.
Its GitHub address: https://github.com/axios/axios
-
Cheerio is a node library which implements a subset of core jQuery. You can use it like a jQuery.
It works with a very simple, consistent DOM model and does not produce a visual rendering as a web browser does like file upload and download. As a result, it parses markup and manipulates the resulting data structure incredibly efficiently.
Crawl a HTML page and parse the result
The usage of Axios and Cheerio is pretty simple. Below is an example to crawl a HTML page and parse its table content.
The HTML page looks like:
<!DOCTYPE html>
<html>
<head>...</head>
<body>
...
<table id="zodiac-signs">
<tbody>
<tr>
<th>Aries</th>
<th>Taurus</th>
<th>Gemini</th>
...
</tr>
<tr>
<td data-th="Aries"><img src="https://.../aries.png" alt="Aries"></td>
<td data-th="Taurus"><img src="https://.../taurus.png" alt="Taurus"></td>
<td data-th="Gemini"><img src="https://.../gemini.png" alt="Gemini"></td>
...
</tr>
</tbody>
</table>
...
</body>
</html>
Use below code to fetch the zodiac image names and source URLs in the table above.
get-zodiacs.js
:
const axios = require('axios');
const cheerio = require('cheerio');
function getZodiacs() {
let url = 'https://...';
axios.get(url) // Use Axios to fetch the HTML page.
.then((response) => { //------- Fetch success handler
// Use Cheerio to parse the resulting data structure.
let $ = cheerio.load(response.data);
$('tr', '#zodiac-signs').last().children().each((i, elem) => {
$(elem).children().each((i, e) => {
console.log('zodiac=' + e.attribs['alt'] + ', src=' + e.attribs['src']);
});
});
})
.catch(error => { //-------- Fetch error handler
console.log(error);
})
.then(() => { //-------- Final handler
console.log('Finished');
});
}
getZodiacs();
Run this file with node:
$ node get-zodiacs.js
The result:
zodiac=Aries, src=https://.../aries.png
zodiac=Taurus, src=https://.../taurus.png
zodiac=Gemini, src=https://.../gemini.png
Finished
Method 2 — Use Puppeteer to control a real browser
About Puppeteer
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless(no UI) by default, but can be configured to run full (non-headless) Chrome or Chromium.
What Puppeteer can do
As mentioned in the beginning, most things that you can do manually in the browser can be done using Puppeteer! Here are a few examples to get you started:
- Generate screenshots and PDFs of pages.
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. “SSR” (Server-Side Rendering)).
- Automate form submission, UI testing, keyboard input, etc.
- Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
The basic usage example
Below code open a website and generate its screenshot to example.png
file.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
Crawl a page repeatedly with different search keywords
Below example crawls the a website and fetch the meaning of a list of words.
The main elements in the HTML page that we care about look like:
<form method="GET" id="search-form" novalidate="">
<!-- The search input box -->
<input id="search-word">
...
<!-- The search submit button -->
<button type="submit">
<i class="i i-search" aria-hidden="true"></i>
</button>
</form>
<!-- The element that contains the word meaning -->
<div id="word-content">
...
</div>
get-word-meanings.js
:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const fs = require('fs');
async function getWordMeaings() {
let url = 'https://xxx.com/dict'; // Replace it with a real URL
let words = ['community', 'resident', 'institution'];
let dictionary = new Map();
const browser = await puppeteer.launch(
{
// Replace it your own Chrome path
executablePath: 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe',
// Umcomment it if you want to open Chrome in the front end
// headless: false
}
);
const page = await browser.newPage();
let word = words[0];
// await page.goto(url + encodeURI(word), {timeout: 0});
await page.goto(url, {timeout: 0});
for (let i = 0; i e.textContent);
console.log(meaning);
dictionary.set(word, meaning);
}
await browser.close();
// Save the dictionary to a file.
let content = JSON.stringify(Array.from(dictionary), null, 4);
fs.writeFile('dictionary.json', content, 'utf8', function(err) {
if (err) {
console.log('save error:', err);
}
});
console.log(content);
}
getWordMeaings();
Run this file with node:
$ node get-word-meanings.js
The result:
[
[
"community",
"the people living in one particular area or people who are considered as a unit"
],
[
"resident",
"a person who lives or has their home in a place"
],
[
"institution",
"a large and important organization"
]
]
Some Puppeteer API
Puppeteer API (V13.0.1)
- page
- page.setViewPort(viewport)
const page = await browser.newPage(); await page.setViewport({ width: 640, height: 480, deviceScaleFactor: 1, }); await page.goto('https://example.com');
- page.screenshot([options])
// Generate the screenshot to example.png file await page.screenshot({path: 'example.png'})
- page.pdf([options])
page.pdf() generates a pdf of the page with print CSS media. To generate a pdf with screen media, call page.emulateMediaType(‘screen’) before calling page.pdf().
// Generates a PDF with 'screen' media type. await page.emulateMediaType('screen'); await page.pdf({ path: 'page.pdf', width: '100px' });
- page.$(selector)
-
selector
“ A selector to query page for - returns:
<Promise>
The method runs document.querySelector within the page. If no element matches the selector, the return value resolves to null.
await page.$('#exampleid'); const tweetHandle = await page.$('.tweet .retweets');
- page.$$(selector)
The method runs
document.querySelectorAll
within the page. If no elements match the selector, the return value resolves to[]
.const divCount = await page.$$eval('div', (divs) => divs.length); const options = await page.$$eval('div > span.options', (options) => options.map((option) => option.textContent) );
- page.$eval(selector, pageFunction[, …args])
This method runs document.querySelector within the page and passes it as the first argument to pageFunction. If there’s no element matching selector, the method throws an error.
const searchValue = await page.$eval('#search', (el) => el.value); const preloadHref = await page.$eval('link[rel=preload]', (el) => el.href); const html = await page.$eval('.main-container', (e) => e.outerHTML);
- page.evaluate(pageFunction[, …args])
-
pageFunction
“ Function to be evaluated in the page context ...args
“ Arguments to pass to pageFunction- returns:
<Promise>
Promise which resolves to the return value of
const bodyHandle = await page.$('body'); const html = await page.evaluate((body) => body.innerHTML, bodyHandle); await bodyHandle.dispose();
-
ElementHandle
- elementHandle.$(selector)
-
elementHandle.$$(selector)
-
elementHandle.$$eval(selector, pageFunction[, …args])
-
elementHandle.$eval(selector, pageFunction[, …args])
-
elementHandle.evaluate(pageFunction[, …args])
This method passes this handle as the first argument to pageFunction.
const tweetHandle = await page.$('.tweet .retweets'); expect(await tweetHandle.evaluate((node) => node.innerText)).toBe('10');
-
elementHandle.type(text[, options])
-
elementHandle.click(options])
-
elementHandle.uploadFile(…filePaths)