Web scraping with Node.js and ES2015
When lunch time rolls around at work, I usually check the websites of a few local restaurants to see what they’re offering for lunch. Having to load up four or five different sites every time gets tedious though. I wanted a single page where I could see all the lunch menus at a glance.
To get started, we create a new folder, open a terminal and run these commands.
When that’s done, we need to edit our
package.json and add a compile script under
scripts. It should look like this.
Now, to get started writing our lunch menu scraper, we create the
index.js file we saw referenced in
package.json as the main script, and put this code in it.
app points to the
lib folder that the compile script creates. Our actual source files will live in the directory
A scraper utility function
We will use a popular library called request to fetch the HTML of the different websites we want to scrape, and another library called Cheerio, which provides a familiar jQuery-like API we can use to interact with the HTML. This lets us easily manipulate the DOM elements we need.
At this stage we’re going to need to make some assumptions about what each web page will look like. It will have some DOM element containing the day of the week, followed by one or more DOM elements containing that day’s lunch menu, and so on for each day of the week. That’s enough for us to write our scraper.
Using the scraper utility
We’re now ready to write some functions using this scraper utility. For each restaurant we want to scrape, we’re going to write a new function. This lets us clean up the data returned from the scrape on a site by site basis. You will notice that each scraper function returns a promise. This will prove useful later when scraping several sites at the same time and writing the result to a file only when all scrapes have finished.
That was just one, but any number of functions can be added and exported from this file.
Hosting the thing
How to get set up with Heroku and Amazon S3 isn’t really the point of this post but there are tons of resources out there to get started.
Anyway, on to our handler file.
It’s time to create the root handler for our web scraper. Every time this handler is called we will kick off the process of scraping all the sites we’ve added to scrapers.js. We will use Promise.all to defer writing the result to file until all scraping functions have finished.
Right, so with that we’ve made a bunch of requests to different sites we wanted to scrape and then saved the result as JSON to Amazon S3. But the data doesn’t do anyone any good just sitting in S3. We need a way to get that data back. Let’s create a function called
We’re approaching something working now. We just need to export these two functions, and then create the entry point for our app,
All that’s left at this point is running the compile script, and we’ll be ready to deploy our code to Heroku.
All of this code is available in this GitHub repository.
In the next post I’ll show how I made a simple Angular app to display the scraped data.