We keep telling you how web scraping can make your life a little better place. In part 1 you could learn what are the legal issues regarding it are nothing but the discussions of how ethic is behavior of some people in the Internet. The technique of web scraping can do a lot of good. And today we will tell you how to use it with the help of Ruby and Nokogirl.

Getting started

Ruby and Nokogirl work great, when you need to web scrape HTML + CSS pages. It would save you tons of time for migration and exporting data. As soon as you learn this technique, you will be ready for any data transition that world has in the sleeve for you. So let’s get down to some Ruby stuff! You will need:

  1. Ruby 1.9.3 or higher
  2. RubyGems
  3. Nokogiri, CSV, Sanitize, and Find gems
  4. And some really basic Ruby skills even a beginner has.

If you don’t have any of those, use ‘gem install’ in the Terminal.

Prepare for scraping

The technique of web scraping is basically telling the system to read all the pages and save them locally. If you have a cheap hosting, this might lead to some load issues. Another problem is that not all URLs might be available for you. But let’s imagine we have a pretty accessible and stable website.

Create a new file and name it something like srape.rb. Declare required gems in the beginning.

It is a good idea to get rid of all bad symbols right away, so that you won’t have to bother later. Here’s how we can remove apostrophes and quotes Microsoft uses:

And now let’s clean HTML tags to keep text well-formatted:

Nokogiri time

Now we are ready for our first web scraping. Nokogiri gem will basically do all the hard work for us. Let’s see how it scrapes data from /blog:

What happens here is that Ruby is checking files in the blog folder one after another looking for HTML ones. It makes sure old, draft and archive posts are not included.

Using .css one can create content structure. Text and to_html methods will let do some formatting.

Rolling things up

Now when Nokogiri did some scrpaing magic, let’s save it to CSV file. Guess what would help us with that? CSV gem!

Don’t forget to echo the job done:

And trigger the script

This is it. Now you know how to web scrape HTML + CSS pages. But you should remember that this method won’t work for js files. What shall I do then? We will tell you the answer in the next article. Stay tuned!

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?


Author

Business Analyst at Rubyroid Labs

Write A Comment