[Scraping]: Basics

scraping

09/26/2019

Web scraping allows to extract large amounts of data from websites. I'll use my github page as an example. We'll use Node.js with Puppeteer for web scrapping. Install Node.js & Puppeteer

1. Decide which information to extract

As an example, I'll try to extract the inner HTML texts in the header tab: About, Posts, SQL, Spring, Swift

2. Inspect the elements on page

Right click the element to inspect elements or open debugging console by pressing F12 button. Then, you'll notice that each navigation item is placed in li tag with a class name masthead__menu-item

3. Verify & Select items

Verify

Move to console and select the specified elements

BASH

document.querySelectAll('li.masthead__menu-item');

To find certain elements on current page. Here, we find five elements within li tag with class name masthead__menu-item

Expand the list to see their values. You'll notice each li.masthead__menu-item componenet has its innerText values that we were interested in.

Select innerHTML

Get a copied array of innerHTML texts from navbar items

BASH

Array.from(document.querySelectorAll('li.masthead__menu-item')).map((subpages) => subpages.innerText);

Note that NodeList needs to be converted into Array. Otherwise, it will trigger an error

4. Use Node.js & Puppeteer for scraping

If not previously done, install Node.js & Puppeteer

const puppeteer = require("puppeteer")

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto("https://ellismin.github.io")

  const navItems = await page.evaluate(() =>
    Array.from(document.querySelectorAll("li.masthead__menu-item")).map(
      subpages => subpages.innerText
    )
  )

  console.log(navItems)
  await browser.close()
})()

Insert the query from above step into a js file. Then test it with

BASH

$ node scraper.js

Output:

BASH

$ node scraper.js
[ 'About', 'Posts', 'SQL', 'Spring', 'Swift' ]

Output result array into file

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto("https://ellismin.github.io")

  const navItems = await page.evaluate(() =>
    Array.from(document.querySelectorAll("li.masthead__menu-item")).map(
      subpages => subpages.innerText
    )
  )

  writeFile(navItems, "array.txt")
  await browser.close()
})()

// Write array into file
function writeFile(arr, fileName) {
  var strToBeWritten = arr.toString.replace(/,/g, "\r\n")
  var fs = require("fs")

  fs.writeFile(fileName, strToBeWritten, function(err) {
    if (err) {
      return console.log(err)
    }
    console.log("The file was saved!")
  })
}

We can output the array into file with above function.

var strToBeWritten = arr.toString.replace(/,/g, "\r\n")

arr.toString has array elements with "," in them. With above line of code, file stores each element in array in each line of code. /,/g replaces all occurrences of "," to new line

Output in array.txt:

BASH

$ cat array.txt
About
Posts
SQL
Spring

We have achieved our main goal here. Continue reading if you want to manipulate with stored data

Downloading images in page as file (+request module)

BASH

document.querySelectorAll("p>img");

Above command in console selects all images on this particular post page. Now, you can modify the navItems array to images.

">" sign specfies all img tags that is right under p tag

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto("https://ellismin.github.io/webscraping/")

  // Get elements
  const images = await page.evaluate(() =>
    Array.from(document.querySelectorAll("p>img")).map(imgs => imgs.src)
  )

  console.log(images)
  await browser.close()
})()

Output:

BASH

$ node scraper.js
[ 'https://ellismin.github.io/images/webscrape/1.png',
  'https://ellismin.github.io/images/webscrape/2.png',
  'https://ellismin.github.io/images/webscrape/3.png',
  'https://ellismin.github.io/images/webscrape/4.png',
  'https://ellismin.github.io/images/webscrape/5.png',
  'https://ellismin.github.io/images/webscrape/6.png' ]

We have an array of image sources. Now, we can download them.

Request module

Downloading images with image url we've extracted is a lot easier with request module package. To install request module:

TEXT

$ npm i request

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto("https://ellismin.github.io/webscraping/")

  const images = await page.evaluate(() =>
    Array.from(document.querySelectorAll("p>img")).map(imgs => imgs.src)
  )
  downloadImage(images, "imgPrefix")

  await browser.close()
})()

async function downloadImage(arr, imgPrefix) {
  var fs = require("fs"),
    request = require("request")

  var download = function(url, filename, callback) {
    request.head(url, function(err, res, body) {
      request(url)
        .pipe(fs.createWriteStream(filename))
        .on("close", callback)
    })
  }

  // download images with urls in arr
  for (var i = 0; i < arr.length; i++) {
    var imgName = imgPrefix + (i + 1).toString() + ".png"
    download(arr[i], imgName, function() {
      console.log("image created!")
    })
  }
}

Scraping recent posts

Let's now try scrapping a multiple items and store them into json format. Try scrapping titles and excerpts of recent posts from main page.

document.querySelectorAll("article")
document.querySelecto("article h2.archive__item-title")
document.querySelector("article p.archive__item-excerpt")

First line selects all articles which each of them contain a title and excerpt
Second line selects a title within an article
Third line selects an excerpt within an article

scraper.js

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto("https://ellismin.github.io")

  const posts = await page.evaluate(() =>
    Array.from(document.querySelectorAll("article")).map(post => ({
      title: post.querySelector("h2.archive__item-title").innerText,
      excerpt: post.querySelector("p.archive__item-excerpt").innerText,
    }))
  )

  console.log(posts)
  await browser.close()
})()

Output:

BASH

$ node scraper.js
[ { title: 'Web Scraping',
    excerpt: 'Extracting source elements with Node.js & Puppeteer' },
  { title: 'Aggregate Functions', excerpt: 'COUNT • GROUP BY' },
  { title: 'Web Scraping Prerequisites',
    excerpt: 'Use Puppeteer for web scraping' },
  { title: 'Table of Contents',
    excerpt: 'Dynamically generate table of contents with JavaScript' },
  { title: 'Refining Selections',
    excerpt: 'DISTINCT • ORDERBY • LIMIT • LIKE' } ]

WRITTEN BY

Ellis Min

Keeping a record

GitHub