Write a post
Published May 14, 2017

Let's Deal with Web Scrapping (Vol-1)in Ruby Way

Let's Deal with Web Scrapping (Vol-1)in Ruby Way

Web Scraping

An overview of how data is harvested and decorated for users.

Before digging deep into the concepts related to web-scraping, let us together walk through the concept's how user interacts with the website and data is served to the user.

A broad overview of what happens when you interact with a website and submit a request.
1. The User performs an Action
Pre-requisite: Browser has loaded the web page and its associated javascript for you.
What actually Happens: User interacts with the web page either clicking on a link or typing in a search field, mouse hovering or scrolling to the end of the page. Earlier either clicking on a link or performing a user action used to send the request to the server end, fetch's a new webpage which is loaded at the browser end. But nowadays even on scrolling down the webpage, triggers an event which sends the request to the server to fetch the latest updates and populate them on the browser. Same as for facebook or twitter timeline while scrolling down the webpage feeds are posted to the user timeline.

2. The Browser responds to User Action
Pre-requisite: User has performed some kind of action or time-driven event is triggered
What actually Happens: When user performs some action, javascript which is loaded on loading the web page would get enabled and browser behavior would be dependent on the user action. Let's say a user has filled out the fields of the form and enter the submit button, javascript would get enabled, will prepare the form data and request would be sent to the server from javascript side, this process where javascript is intuitive and guiding user action also known as Client-side Processing.

Let's deal with real time example of Client-side Processing.
When "Ru" is typed in the search bar of the Wikipedia, a request is formulated at the javascript side to submit a new web request as soon as you type into the search field.
WikipeidaSearch.png

3. The Browser Sends Request to Server
Pre-requisite: Browser has formulated the request at it's side and has send it to the server
What actually Happens: On hitting a request URL (https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=Ru&namespace=0&limit=10&suggest=true) in the browser will send the request with certain parameters to the server

Javascript at the browser side would get triggered on entering the content in the search bar and will formulate such request (https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=Ru&namespace=0&limit=10&suggest=true) without hitting on any submit button. Here AJAX request comes into play where user website is made more user-friendly.

Browser reacts to your typing into the search field by creating the following web request:
Requested Url: https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=Ru&namespace=0&limit=10&suggest=true
Received Response from the server:
["Ru",["Ru","Russia","Russian language","Rugby union","Russian Empire","Ruhollah Khomeini","Russian Revolution","Rubik's Cube","Runes","Rush Limbaugh"],["Ru, ru, or RU may refer to:","Russia (/ˈrʌʃə/; Russian: Россия, tr. Rossiya; IPA: [rɐˈsʲijə]), also officially the Russian Federation (Russian: Российская Федерация, tr.","Russian (ру́сский язы́к, russkiy yazik) is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, Kyrgyzstan and many minor or unrecognised territories.","Rugby union, known in some parts of the world simply as rugby, is a contact team sport which originated in England in the first half of the 19th century.","The Russian Empire (also known as Russia) was a state that existed from 1721 until it was overthrown by the short-lived February Revolution in 1917. One of the largest empires in world history, stretching over three continents, the Russian Empire was surpassed in landmass only by the British and Mongol empires.","Sayyid Ruhollah Mūsavi Khomeini (Persian: سید روح‌الله موسوی خمینی‎‎, [ruːhoɫˈɫɑːhe χomeiˈniː], 24 September 1902 – 3 June 1989), known in the Western world as Ayatollah Khomeini, was an Iranian Shia Muslim religious leader, philosopher, revolutionary, and politician.","The Russian Revolution was a pair of revolutions in Russia in 1917, which dismantled the Tsarist autocracy and led to the eventual rise of the Soviet Union.","Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik.","Runes (Proto-Norse: ᚱᚢᚾᛟ (runo), Old Norse: rún) are the letters in a set of related alphabets known as runic alphabets, which were used to write various Germanic languages before the adoption of the Latin alphabet and for specialised purposes thereafter.","Rush Hudson Limbaugh III (/ˈlɪmbɔː/, LIM-baw; born January 12, 1951) is an American radio talk show host and conservative political commentator."],["https://en.wikipedia.org/wiki/Ru","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/Russian_language","https://en.wikipedia.org/wiki/Rugby_union","https://en.wikipedia.org/wiki/Russian_Empire","https://en.wikipedia.org/wiki/Ruhollah_Khomeini","https://en.wikipedia.org/wiki/Russian_Revolution","https://en.wikipedia.org/wiki/Rubik%27s_Cube","https://en.wikipedia.org/wiki/Runes","https://en.wikipedia.org/wiki/Rush_Limbaugh"]]

4. The Server Side Computation
Pre-requisite: The server has received the browser's request
What actually Happens: The server processes the request
Server-side scriptiong where a connection would be set up with the database and fetching of result would take place.
Query which must have been applied: "select topicName from topics where topics like '%AT%'"
Computation will be done in your ruby program which will be hidden from user end and raw JSON, XML or HTML formatted data would be sent to the user.

5. The Server Response
Pre-requisite: The server has finished processing the requested url
What actually Happens: The server will either send the requested data or error codes. The response is typically a structured text file, such as HTML, XML, or JSON.
This is where the actual web-scraping will take place. Ruby program grabs the response from the server end and parses it, rather than rendering directly on your web browser, Ruby program doesn't need to see a nicely formatted web page, it just needs the server's response in its raw form.
JSON format response:
["Ru",["Ru","Russia","Russian language","Rugby union","Russian Empire","Ruhollah Khomeini","Russian Revolution","Rubik's Cube","Runes","Rush Limbaugh"],["Ru, ru, or RU may refer to:","Russia (/ˈrʌʃə/; Russian: Россия, tr. Rossiya; IPA: [rɐˈsʲijə]), also officially the Russian Federation (Russian: Российская Федерация, tr.","Russian (ру́сский язы́к, russkiy yazik) is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, Kyrgyzstan and many minor or unrecognised territories.","Rugby union, known in some parts of the world simply as rugby, is a contact team sport which originated in England in the first half of the 19th century.","The Russian Empire (also known as Russia) was a state that existed from 1721 until it was overthrown by the short-lived February Revolution in 1917. One of the largest empires in world history, stretching over three continents, the Russian Empire was surpassed in landmass only by the British and Mongol empires.","Sayyid Ruhollah Mūsavi Khomeini (Persian: سید روح‌الله موسوی خمینی‎‎, [ruːhoɫˈɫɑːhe χomeiˈniː], 24 September 1902 – 3 June 1989), known in the Western world as Ayatollah Khomeini, was an Iranian Shia Muslim religious leader, philosopher, revolutionary, and politician.","The Russian Revolution was a pair of revolutions in Russia in 1917, which dismantled the Tsarist autocracy and led to the eventual rise of the Soviet Union.","Rubik's Cube is a 3-D combination puzzle invented in 1974 by Hungarian sculptor and professor of architecture Ernő Rubik.","Runes (Proto-Norse: ᚱᚢᚾᛟ (runo), Old Norse: rún) are the letters in a set of related alphabets known as runic alphabets, which were used to write various Germanic languages before the adoption of the Latin alphabet and for specialised purposes thereafter.","Rush Hudson Limbaugh III (/ˈlɪmbɔː/, LIM-baw; born January 12, 1951) is an American radio talk show host and conservative political commentator."],["https://en.wikipedia.org/wiki/Ru","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/Russian_language","https://en.wikipedia.org/wiki/Rugby_union","https://en.wikipedia.org/wiki/Russian_Empire","https://en.wikipedia.org/wiki/Ruhollah_Khomeini","https://en.wikipedia.org/wiki/Russian_Revolution","https://en.wikipedia.org/wiki/Rubik%27s_Cube","https://en.wikipedia.org/wiki/Runes","https://en.wikipedia.org/wiki/Rush_Limbaugh"]]

So let's add Web Scrapping in the way.
require 'rubygems'
require 'crack'
require 'rest-client'

url='https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=Ru&namespace=0&limit=10&suggest=true'
puts Crack::JSON.parse(RestClient.get(url))

Following output:
Ru
Ru
Russia
Russian language
Rugby union
Russian Empire
Ruhollah Khomeini
Russian Revolution
Rubik's Cube
Runes
Rush Limbaugh
etc.

At the end Browser will makes it look pretty

Discover and read more posts from Neha Chopra
get started
Enjoy this post?

Leave a like and comment for Neha