Ruby V | Web Scraping ⛏️

Milan Parmar
3 min readJan 17, 2021

--

Welcome to the fifth instalment in my Ruby series…Web Scraping!

Web Scraping

Web scraping is when we extract data from a website. It involves us to scrape data in the form of HTML and then export it as a format more useful to us. This process is necessary when we do not have an API with raw information for us to use.

‘There’s a gem for that!’

Say hello to…Nokogiri (鋸) 👋🏼. Nokogiri is the ruby gem we need that will help us on our scraping journey. It provides us an API for reading, writing, modifying and querying documents.

Vedabase CLI Project

Vedabase is a CLI application that allows users to read an introduction to one of A.C Bhaktivedanta Swami Śrīla Prabhupāda’s (Founder Ācārya of ISKCON) main books. They have the ability to step into a book and read the introduction and go back to then read another. I created this application as one of my first projects with Flatiron and will be using examples from Vedabase to show the process of scraping.

Install Nokogiri

First things first, we need to install Nokogiri by running the following:

gem install nokogiri

It is important to add require 'nokogiri' to theenvironment.rb file like so:

require 'open-uri'
require 'nokogiri'

As you can see, I have also added require 'open-uri, this is because we are scraping from a remote site and open-uri encapsulates all the work of making a HTTP request into the open method, making the operation as simple as opening a file on our own hard drive. 🧑🏽‍💻

Scraper Class

In this blog, I will not be going through the whole CLI application but just explaining the scraping process. In order to deal with scraping, I first created a scraper class like so:

class Vedabase::Scraperend 

I also used a unary operator (::) in order to access any constants, instance methods and/or variables defined from anywhere else outside the class.

As I was scrapping from several pages to retrieve an introduction, I created a class variable:

class Vedabase::Scraper  @@url_arrays = ["https://vedabase.io/en/library/bg/introduction/",
"https://vedabase.io/en/library/sb/1/introduction/",
"https://vedabase.io/en/library/cc/adi/introduction/"
]
end

I then added a class method, called scrape_title, where I would handle the scraping:

class Vedabase::Scraper  @@url_arrays = ["https://vedabase.io/en/library/bg/introduction/",
"https://vedabase.io/en/library/sb/1/introduction/",
"https://vedabase.io/en/library/cc/adi/introduction/"
]
def self.scrape_title endend

Knock, Knock!…Who’s there?…Nokogiri…You May Enter 🚪

Now we include the Nokogiri::HTML construct which takes in the information and wraps it in a special Nokogiri data object.

def self.scrape_title  doc = Nokogiri::HTML(open("https://vedabase.io/en/library/"))end

Notice that we used open before the URL. This method comes from open-uri, as mentioned earlier.

CSS Selectors

Now we need to scrape a particular element we desire from the website and we do this by using the CSS method to obtain that information:

def self.scrape_title  doc = Nokogiri::HTML(open("https://vedabase.io/en/library/"))  books = doc.css("div.col-6.col-sm-3.col-md-2.col-lg-2.text-center.book-item").slice(0, 3)end

Next Steps

Now this may vary for you but what I did was map through the information and pick out specific data. I created another class method to handle scraping the introduction of each book, whilst the first class method grabbed the titles.

def self.scrape_title  doc = Nokogiri::HTML(open("https://vedabase.io/en/library/"))   books = doc.css("div.col-6.col-sm-3.col-md-2.col-lg-2.text-center.book-item").slice(0, 3)   books.map.with_index do | book, i |
title = book.css("a.book-title").text.strip
url = @@url_arrays[i]
Vedabase::Vedabase.new(title, url)
end
enddef self.scrape_intro(url) doc = Nokogiri::HTML(open(url)) intro = doc.css("div#content.row").text.stripend

I then accessed the titles and introduction in a file named, vedabase.rb by creating an instance variable for them.

I hope this short tutorial helps you with the many possibilities of web scraping in your Ruby applications and don’t forget to scrape responsibly!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Milan Parmar
Milan Parmar

Written by Milan Parmar

Software Engineer 👨🏽‍💻 | Full Stack Web Development 💻 | Smartphone Tech Enthusiast📱

No responses yet

Write a response