Web scraping with Node.js and ES2015
When lunch time rolls around at work, I usually check the websites of a few local restaurants to see what they’re offering for lunch. Having to load up four or five different sites every time gets tedious though. I wanted a single page where I could see all the lunch menus at a glance.
Time to write a web scraper! Of course using ES2015, the new version of JavaScript implementing the ECMAScript 2015 specification.
The setup
To get started, we create a new folder, open a terminal and run these commands.
npm init
npm install express promise request cors cheerio bluebird aws-sdk --save
npm install -g babel-cli
When that’s done, we need to edit our package.json
and add a compile script under scripts. It should look like this.
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"compile": "babel src --out-dir lib"
},
That will have Babel transform our ES2015 JavaScript to regular old ES5 JavaScript. That way we don’t need to run the final script with any flags, or update Node to a version that supports ES2015 without flags. Later, when we’ve written some code, we will execute the compile script to get deployable code.
Now, to get started writing our lunch menu scraper, we create the index.js file we saw referenced in package.json as the main script, and put this code in it.
var app = require('./lib');
app.listen(process.env.PORT || 8080);
console.log('Running');
Notice that app points to the lib folder that the compile script creates. Our actual source files will live in the directory src.
A scraper utility function
We will use a popular library called request to fetch the HTML of the different websites we want to scrape, and another library called Cheerio, which provides a familiar jQuery-like API we can use to interact with the HTML. This lets us easily manipulate the DOM elements we need.
At this stage we’re going to need to make some assumptions about what each web page will look like. It will have some DOM element containing the day of the week, followed by one or more DOM elements containing that day’s lunch menu, and so on for each day of the week. That’s enough for us to write our scraper.
import request from 'request';
import cheerio from 'cheerio';
const days = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday'];
function scrape(url, selector, callback) {
let content = '';
let week = [[], [], [], [], []];
request(url, (err, response, html) => {
if(!err) {
let $ = cheerio.load(html);
// Get the text from each element within the selector.
$(selector).each((index, element) => content += ' ' + $(element).text());
// Transform the content into an array of all the words.
content = content.split(/\s+/).map(e => e.trim());
content = content.map(e => e.toLowerCase());
for(let i = 0; i < days.length; i++) {
week[i] = content.slice(content.indexOf(days[i]), i !== days.length - 1 ? content.indexOf(days[i+1]) : undefined);
}
callback(week);
}
});
}
// Expose the function to the rest of the app.
module.exports = scrape;
Using the scraper utility
We’re now ready to write some functions using this scraper utility. For each restaurant we want to scrape, we’re going to write a new function. This lets us clean up the data returned from the scrape on a site by site basis. You will notice that each scraper function returns a promise. This will prove useful later when scraping several sites at the same time and writing the result to a file only when all scrapes have finished.
import scrape from './scraper';
import Promise from 'promise';
function awesomeFoodPlace() {
return new Promise((resolve, reject) => {
let url = 'http://awesomefoodplace.com/lunch-menu/';
let restaurant = {};
restaurant.menu = [];
restaurant.url = url;
restaurant.name = 'Awesome Food Place';
scrape(url, '#weeklymenu', week => {
for(let i = 0; i < 5; i++) {
restaurant.menu[i] = week[i].slice(1).join(' ').replace(/\n+/g, '').replace(/\t+/g, '').trim();
}
resolve(restaurant);
});
});
}
module.exports = {
awesomeFoodPlace
}
That was just one, but any number of functions can be added and exported from this file.
Hosting the thing
I’m hosting my scraper on Heroku, using their free tier. Because Heroku wipes files written to disk when the dyno stops (which the free one does after one hour of inactivity), I wanted to try using Amazon S3 for persistent data storage. Luckily Amazon provides a JavaScript SDK for their services.
How to get set up with Heroku and Amazon S3 isn’t really the point of this post but there are tons of resources out there to get started.
Anyway, on to our handler file.
import Promise from 'bluebird';
import _fs from 'fs';
import aws from 'aws-sdk';
import scrapers from './scrapers';
aws.config.region = process.env.REGION || aws.config.region;
const outputName = 'menus.json';
The JavaScript promise library bluebird lets you turn synchronous methods (like those in the AWS SDK) into asynchronous ones, which will come in handy later. It’s as easy as this:
const s3 = Promise.promisifyAll(new aws.S3());
const fs = Promise.promisifyAll(_fs);
It’s time to create the root handler for our web scraper. Every time this handler is called we will kick off the process of scraping all the sites we’ve added to scrapers.js. We will use Promise.all to defer writing the result to file until all scraping functions have finished.
function rootHandler (req, res) {
let result = {};
let promises;
let params = { Bucket: process.env.S3_BUCKET, Key: outputName };
result.restaurants = {};
result.updated = new Date();
// Get all the promises from the scrapers we have written.
promises = Object.getOwnPropertyNames(scrapers).map(name => scrapers[name]());
Promise.all(promises)
.then(response => {
response.forEach(restaurant => {
result.restaurants[restaurant.name] = restaurant;
});
params.Body = JSON.stringify(result);
// Asynchronously save the data to S3.
return s3.putObjectAsync(params);
})
.then(() => {
// If a local copy is saved on Heroku, delete it.
if (fs.existsSync(outputName)) {
fs.unlinkSync(outputName);
}
res.send('Scraped and saved to S3.');
})
.catch(console.log.bind(console));
}
Right, so with that we’ve made a bunch of requests to different sites we wanted to scrape and then saved the result as JSON to Amazon S3. But the data doesn’t do anyone any good just sitting in S3. We need a way to get that data back. Let’s create a function called apiHandler
.
function apiHandler (req, res) {
let params = { Bucket: process.env.S3_BUCKET, Key: outputName, ResponseContentType : 'application/json' };
let output;
// Try to get the file from local disk storage first.
fs.readFileAsync(outputName, 'utf8')
.then(data => {
res.json(JSON.parse(data));
})
.catch(err => {
// If that fails, get the data from S3 and store it locally.
s3.getObjectAsync(params)
.then(data => {
output = data.Body.toString();
return fs.writeFileAsync(outputName, output);
})
.then((data) => {
res.json(JSON.parse(output));
})
.catch(console.log.bind(console));
});
}
We’re approaching something working now. We just need to export these two functions, and then create the entry point for our app, index.js
.
module.exports = {
rootHandler,
apiHandler
}
import express from 'express';
import cors from 'cors';
import handlers from './handlers';
const app = express();
app.use(cors());
app.get('/', handlers.rootHandler);
app.get('/api/menus', handlers.apiHandler);
exports = module.exports = app;
All that’s left at this point is running the compile script, and we’ll be ready to deploy our code to Heroku.
npm run compile
lunch-menu-scraper@0.0.1 compile C:\projects\lunch-menu-scraper
babel src --out-dir lib
src\handlers.js -> lib\handlers.js
src\index.js -> lib\index.js
src\scraper.js -> lib\scraper.js
src\scrapers.js -> lib\scrapers.js
All of this code is available in this GitHub repository.
In the next post I’ll show how I made a simple Angular app to display the scraped data. (Narrator: There never was a next post.)