Web Scraping: Part 3. Digging deeper

Time has come to dig deeper and learn more about the art of web scraping. In our previous articles we have already told you about possible legislative issues of web scraping and how to start scraping with Ruby. Now it is the right time to learn how one can web scrape JS pages using Wombat and Mechanize.

Why use Wombat and Mechanize?

But wait, you just taught us how to scrape with Ruby. Why do we need another tool? The method we showed in the previous article works majorly for HTML pages, it wouldn’t work for JS. Wombat and Mechanize gems are much more powerful and would work for any website. But one should remember that they require more resources. This way we recommend using them only where they are really needed.

How to use Mechanize?

To get things rolling with Mechanize install it and specify which page are you looking at.

[ruby]
$ gem install mechanize

require ‘mechanize’

mechanize = Mechanize.new

page = mechanize.get(‘http://someurl.com/’)

puts page.title
[/ruby]

This way Mechanize will be looking at Someurl website. It uses Nokogiri to parse HTML. So after you pointed it to the required page, just use Nokogiri to scrape the page as shown below.

[ruby]
agent.get(‘http://someurl.com/’).search(“p.posted”)
[/ruby]

Mechanize Page Search can use both CSS and XPath expressions:

[ruby]
agent.get(‘http://someurl.com/’).search(“.//p[@class=’posted’]”)
[/ruby]

Check more details about Mechanize here.

How to use Wombat?

Wombat gem is another powerful tool for web scraping almost any page. Let’s see how to use it properly.

First install the gem:

[ruby]
gem install wombat
[/ruby]

Then start scraping the page via Wombat.crawl call:

[ruby]
require ‘wombat’

Wombat.crawl do

base_url “http://someurl.com”

path “/”

headline xpath: “//h1”

subheading css: “p.subheading”

what_is({ css: “.teaser h3″ }, :list)

links do

explore xpath: ‘//*[@id=”wrapper”]/div[1]/div/ul/li[1]/a’ do |e|

e.gsub(/Explore/, “Love”)

end

search css: ‘.search’

features css: ‘.features’

blog css: ‘.blog’

end

end
[/ruby]

We should admit that Wombat has many more functions and settings. You can find more information about it here.

To scrape or not to scrape

As you can see from our 3 articles, web scraping can be very useful when working with big amounts of data. It can save you tons of time and simply make your business more efficient. Using just a few Ruby gems you can scrape almost any page. We hope that now you will know how and when to use web scraping. And we are always ready to help you with any questions you have.

Questions? Comments? Let’s talk about them in the comments section below.

Web Scraping: Part 3. Digging deeper

Why use Wombat and Mechanize?

How to use Mechanize?

How to use Wombat?

To scrape or not to scrape

Related Posts

10 Best Practices for Writing Clean & Maintainable Rails Code

How to Scale a Ruby on Rails Application for High Traffic

Ruby on Rails vs JavaScript: Choosing the Right Technology Stack

Write A Comment Cancel Reply