[Scraping]: Basics

scraping

09/26/2019


Web scraping allows to extract large amounts of data from websites. I'll use my github page as an example. We'll use Node.js with Puppeteer for web scrapping. Install Node.js & Puppeteer

1. Decide which information to extract

1

As an example, I'll try to extract the inner HTML texts in the header tab: About, Posts, SQL, Spring, Swift

2. Inspect the elements on page

2

3

Right click the element to inspect elements or open debugging console by pressing F12 button. Then, you'll notice that each navigation item is placed in li tag with a class name masthead__menu-item

3. Verify & Select items

4

Verify

Move to console and select the specified elements

BASH
document.querySelectAll('li.masthead__menu-item');

To find certain elements on current page. Here, we find five elements within li tag with class name masthead__menu-item

5

Expand the list to see their values. You'll notice each li.masthead__menu-item componenet has its innerText values that we were interested in.

Select innerHTML

6

Get a copied array of innerHTML texts from navbar items

BASH
Array.from(document.querySelectorAll('li.masthead__menu-item')).map((subpages) => subpages.innerText);

Note that NodeList needs to be converted into Array. Otherwise, it will trigger an error

4. Use Node.js & Puppeteer for scraping

If not previously done, install Node.js & Puppeteer

JS
const puppeteer = require("puppeteer")
;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://ellismin.github.io")
const navItems = await page.evaluate(() =>
Array.from(document.querySelectorAll("li.masthead__menu-item")).map(
subpages => subpages.innerText
)
)
console.log(navItems)
await browser.close()
})()

Insert the query from above step into a js file. Then test it with

BASH
$ node scraper.js
  • Output:
BASH
$ node scraper.js
[ 'About', 'Posts', 'SQL', 'Spring', 'Swift' ]

Output result array into file

JS
;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://ellismin.github.io")
const navItems = await page.evaluate(() =>
Array.from(document.querySelectorAll("li.masthead__menu-item")).map(
subpages => subpages.innerText
)
)
writeFile(navItems, "array.txt")
await browser.close()
})()
// Write array into file
function writeFile(arr, fileName) {
var strToBeWritten = arr.toString.replace(/,/g, "\r\n")
var fs = require("fs")
fs.writeFile(fileName, strToBeWritten, function(err) {
if (err) {
return console.log(err)
}
console.log("The file was saved!")
})
}

We can output the array into file with above function.

JS
var strToBeWritten = arr.toString.replace(/,/g, "\r\n")

arr.toString has array elements with "," in them. With above line of code, file stores each element in array in each line of code. /,/g replaces all occurrences of "," to new line

Output in array.txt:

BASH
$ cat array.txt
About
Posts
SQL
Spring

We have achieved our main goal here. Continue reading if you want to manipulate with stored data

Downloading images in page as file (+request module)

BASH
document.querySelectorAll("p>img");

Above command in console selects all images on this particular post page. Now, you can modify the navItems array to images.

">" sign specfies all img tags that is right under p tag

JS
;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://ellismin.github.io/webscraping/")
// Get elements
const images = await page.evaluate(() =>
Array.from(document.querySelectorAll("p>img")).map(imgs => imgs.src)
)
console.log(images)
await browser.close()
})()

Output:

BASH
$ node scraper.js
[ 'https://ellismin.github.io/images/webscrape/1.png',
'https://ellismin.github.io/images/webscrape/2.png',
'https://ellismin.github.io/images/webscrape/3.png',
'https://ellismin.github.io/images/webscrape/4.png',
'https://ellismin.github.io/images/webscrape/5.png',
'https://ellismin.github.io/images/webscrape/6.png' ]

We have an array of image sources. Now, we can download them.

Request module

Downloading images with image url we've extracted is a lot easier with request module package. To install request module:

TEXT
$ npm i request
JS
;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://ellismin.github.io/webscraping/")
const images = await page.evaluate(() =>
Array.from(document.querySelectorAll("p>img")).map(imgs => imgs.src)
)
downloadImage(images, "imgPrefix")
await browser.close()
})()
async function downloadImage(arr, imgPrefix) {
var fs = require("fs"),
request = require("request")
var download = function(url, filename, callback) {
request.head(url, function(err, res, body) {
request(url)
.pipe(fs.createWriteStream(filename))
.on("close", callback)
})
}
// download images with urls in arr
for (var i = 0; i < arr.length; i++) {
var imgName = imgPrefix + (i + 1).toString() + ".png"
download(arr[i], imgName, function() {
console.log("image created!")
})
}
}

Scraping recent posts

Let's now try scrapping a multiple items and store them into json format. Try scrapping titles and excerpts of recent posts from main page.

JS
document.querySelectorAll("article")
document.querySelecto("article h2.archive__item-title")
document.querySelector("article p.archive__item-excerpt")
  1. First line selects all articles which each of them contain a title and excerpt
  2. Second line selects a title within an article
  3. Third line selects an excerpt within an article

scraper.js

JS
;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://ellismin.github.io")
const posts = await page.evaluate(() =>
Array.from(document.querySelectorAll("article")).map(post => ({
title: post.querySelector("h2.archive__item-title").innerText,
excerpt: post.querySelector("p.archive__item-excerpt").innerText,
}))
)
console.log(posts)
await browser.close()
})()

Output:

BASH
$ node scraper.js
[ { title: 'Web Scraping',
excerpt: 'Extracting source elements with Node.js & Puppeteer' },
{ title: 'Aggregate Functions', excerpt: 'COUNT • GROUP BY' },
{ title: 'Web Scraping Prerequisites',
excerpt: 'Use Puppeteer for web scraping' },
{ title: 'Table of Contents',
excerpt: 'Dynamically generate table of contents with JavaScript' },
{ title: 'Refining Selections',
excerpt: 'DISTINCT • ORDERBY • LIMIT • LIKE' } ]

WRITTEN BY

Keeping a record