Written by Amit Bansal
on April 23, 2018

Web scraping using Scala

This posts shows how we can parse and extract data stored in HTML documents using scala and jsoup with functional programming.

In this example we are going to scrape IMDB reviews for a given movie id.

We are going to use Jsoup to fetch and parse HTML documents.

Create a sbt project and add jsoup dependency to your build.sbt.

libraryDependencies += "org.jsoup" % "jsoup" % "1.11.2"

Review Class.

class Review(title: String,
               text: String,
               user: String,
               name: String,
               date: String,
               score: String,
               spoiler: Boolean,
               votes: String,
               foundHelpful: String
              ) {
    override def toString: String = {
      title + " | " +
        name + " | " +
        user + " | " +
        date + " | " +
        spoiler + " | " +
        score + " | " +
        votes + " | " +
        foundHelpful + " | " +
        text
    }
  }

Given an url to web page this methods returns a Jsoup Document of that web page.

def getJsoupDoc(url: String): Document = Jsoup.connect(url).get()

This method extracts all the data and create a Review Model from a jsoup Element.

 def reviewToReviewModel(review: Element): List[Review] = {

      def scoreOrMinus1(elements: Elements): String = {
        try {
          elements.get(1).text()
        } catch {
          case e: Exception => "-1"
        }
      }

      List(new Review(
        review.select("div[class=title]").text(),
        review.select("div[class=text show-more__control]").text(),
        review.select("div[class=display-name-date]").select("a[href]").attr("href").split("/")(2),
        review.select("div[class=display-name-date]").select("span").get(0).select("a").text(),
        review.select("div[class=display-name-date]").select("span").get(1).text(),
        scoreOrMinus1(review.select("div[class=ipl-ratings-bar]").select("span")),
        !review.select("span[class=spoiler-warning]").isEmpty,
        review.select("div[class=actions text-muted]").text.split("\n")(0).split(" ")(3).replace(",", ""),
        review.select("div[class=actions text-muted]").text.split("\n")(0).split(" ")(0).replace(",", ""),
      ))
    }

This methods loops over all the individual elements in elements, a single element consists of a review, as annotated this method is tail recursive. This methods stops when count >= limit.

@tailrec
def scrapeReviewContainer(elements: Elements, idx: Int, acc: List[Review], count: Int, limit: Int): List[Review] = {
  if (idx >= elements.size() || count >= limit) acc
  else scrapeReviewContainer(elements, idx + 1, acc ::: reviewToReviewModel(elements.get(idx)), count + 1, limit)
}

So we have created all the required methods. Now all we need to do is create Jsoup doc using method we created previously and call scrapeReviewContainer passing that doc and maximum numbers of reviews we need to fetch. This methods returns list of reviews.

def getAllreviews(id: String, limit: Int): List[Review] = {
  val doc = getJsoupDoc("http://www.imdb.com/title/" + id + "/reviews/?sort=submissionDate&dir=desc")
  scrapeReviewContainer(doc.select("div[class=review-container]"), 0, Nil, 0, limit)
}

and we are done here.

Complete code is available here with added functionality.

_{^{_{^{This post is for educational purposes only.}}}}

← → Top