[Scraping]: web scraping imgur.com

Node.js

scraping

10/01/2019

Intro

I was going to scrape multiple wallpapers images from imgur website.

With the strategy learned from [Scraping]: Basics, I've tried to query all image src elements from website.

However, imgur.com uses an API which doesn't show all the images at once, but rather it loads the next page on on scrolling

Inspect

Inspect element (Using Google Chrome). Then move to the Network tab and select XHR.

XHR filter shows the type of HMLHttpRequest that's built with JavaScript, and it sends the request on loading images.

Browser over to the name of the requests and take a look at JSON file.

This contains the URL where all the name of the images are stored. Let's open the link to retrieve JSON data.

TEXT

https://imgur.com/gallery/SU6bL/comment/best/hit.json

Getting data from JSON file

Open the console and type in the following to parse the JSON file.

JSON.parse(document.querySelector("body").innerText)

Looking at the parsed data, you will see images are stored under data -> image -> album_images -> images

Type in the following to get these images

JSON.parse(document.querySelector("body").innerText).data.image.album_images
  .images

Then map it into array with the following

Array.from(
  JSON.parse(document.querySelector("body").innerText).data.image.album_images
    .images
).map(imageName => imageName.hash)

Now we have extracted image names. Let's use puppeteer to download these images

On Puppeteer

Test the array

async function scrapeJSON() {
  // Open the browser
  const browser = await puppeteer.launch({
    headless: false,
  })

  const page = await browser.newPage()

  await page.goto("https://imgur.com/gallery/SU6bL/comment/best/hit.json")

  var content = await page.content()

  imageNames = await page.evaluate(() => {
    return Array.from(
      JSON.parse(document.querySelector("body").innerText).data.image
        .album_images.images
    ).map(imageName => imageName.hash)
  })
  console.log(imageNames)
  await browser.close()
}
scrapeJSON()

BASH

$ node imgurScrape.js
[ 'hNmDF6p',
  '7SrF82H',
   .......
  'hPayG82' ]

We obtained the image names. Let's create the urls on top of downloadImage() function from the last post.

Be sure to have request module available.

var urlPrefix = "https://i.imgur.com/"
var urlAffix = ".png"
// Create url form arr
for (i = 0; i < arr.length; i++) {
  arr[i] = urlPrefix + arr[i] + urlAffix
}

imgurScrape.js

Whole code snippet:

const puppeteer = require("puppeteer")
async function downloadImage(arr, imgPrefix) {
  var fs = require("fs"),
    request = require("request")

  var download = function(url, filename, callback) {
    request.head(url, function(err, res, body) {
      request(url)
        .pipe(fs.createWriteStream(filename))
        .on("close", callback)
    })
  }

  var urlPrefix = "https://i.imgur.com/"
  var urlAffix = ".png"
  // Create url form arr
  for (i = 0; i < arr.length; i++) {
    arr[i] = urlPrefix + arr[i] + urlAffix
  }

  // Test downloading one image
  // download(arr[0], "test.png", function(){
  //   console.log("image created!!");
  // });

  // download images with url
  for (var i = 0; i < arr.length; i++) {
    var imgName = imgPrefix + (i + 1).toString() + ".png"
    download(arr[i], imgName, function() {
      console.log("image created!") ///
    })
  }
}

async function scrapeJSON() {
  // Open the browser
  const browser = await puppeteer.launch({
    headless: false,
  })

  const page = await browser.newPage()

  await page.goto("https://imgur.com/gallery/SU6bL/comment/best/hit.json")

  var content = await page.content()

  imageNames = await page.evaluate(() => {
    return Array.from(
      JSON.parse(document.querySelector("body").innerText).data.image
        .album_images.images
    ).map(imageName => imageName.hash)
  })
  // console.log(imageNames);///
  downloadImage(imageNames, "wallpaper-")
  await browser.close()
}

// Run
scrapeJSON()

Result

WRITTEN BY

Ellis Min

Keeping a record

GitHub