We keep telling you how web scraping can make your life a little better place. In part 1 you could learn what are the legal issues regarding it are nothing but the discussions of how ethic is behavior of some people in the Internet. The technique of web scraping can do a lot of good. And today we will tell you how to use it with the help of Ruby and Nokogirl.

Getting started

Ruby and Nokogirl work great, when you need to web scrape HTML + CSS pages. It would save you tons of time for migration and exporting data. As soon as you learn this technique, you will be ready for any data transition that world has in the sleeve for you. So let’s get down to some Ruby stuff! You will need:

  1. Ruby 1.9.3 or higher
  2. RubyGems
  3. Nokogiri, CSV, Sanitize, and Find gems
  4. And some really basic Ruby skills even a beginner has.

If you don’t have any of those, use ‘gem install’ in the Terminal.

Prepare for scraping

The technique of web scraping is basically telling the system to read all the pages and save them locally. If you have a cheap hosting, this might lead to some load issues. Another problem is that not all URLs might be available for you. But let’s imagine we have a pretty accessible and stable website.

Create a new file and name it something like srape.rb. Declare required gems in the beginning.

[ruby]
# include required gems

require ‘find’

require ‘rubygems’

require ‘nokogiri’

require ‘sanitize’

require ‘csv’
[/ruby]

It is a good idea to get rid of all bad symbols right away, so that you won’t have to bother later. Here’s how we can remove apostrophes and quotes Microsoft uses:

[ruby]
# generic function to replace MS word smart quotes and apostrophes

def strip_bad_chars(text)

text.gsub!(/”/, “‘”);

text.gsub!(/\u2018/, “‘”);

text.gsub!(/[”“]/, ‘”‘);

text.gsub!(/’/, “‘”);

return text

end
[/ruby]

And now let’s clean HTML tags to keep text well-formatted:

[ruby]
def clean_body(text)

text.gsub!(/(\r)?\n/, “”);

text.gsub!(/\s+/, ‘ ‘);

# extra muscle, clean up crappy HTML tags and specify what attributes are allowed

text = Sanitize.clean(text, :elements => [‘h1’, ‘h2’, ‘h3’, ‘h4’, ‘h5’, ‘h6’, ‘p’, ‘a’, ‘b’, ‘strong’, ’em’, ‘img’, ‘iframe’],

:attributes => {

‘a’ => [‘href’, ‘title’, ‘name’],

‘img’ => [‘src’, ‘title’, ‘alt’],

‘iframe’ => [‘src’, ‘url’, ‘class’, ‘id’, ‘width’, ‘height’, ‘name’],

},

:protocols => {

‘a’ => {

‘href’ => [‘http’, ‘https’, ‘mailto’]

},

‘iframe’ => {

‘src’ => [‘http’, ‘https’]

}

})

# clean start and end whitespace

text = text.strip;

return text

end[/ruby]

Nokogiri time

Now we are ready for our first web scraping. Nokogiri gem will basically do all the hard work for us. Let’s see how it scrapes data from /blog:

[ruby]
# this is the main logic that recursively searches from the current directory down, and parses the HTML files.

def parse_html_files

Find.find(Dir.getwd) do |file|

if !File.directory? file and File.extname(file) == ‘.html’

# exclude and skip if in a bad directory

# we may be on an html file, but some we just do not want

current = File.new(file).path

# stick to just the blog folder

if not current.match(/(blog)/)

next

end

 

# however, skip these folders entirely

if current.match(/(old|draft|archive)/)

next

end

# open file, pluck content out by its element(s)

page = Nokogiri::HTML(open(file));

# grab title

title = page.css(‘title’).text.to_s;

title = strip_bad_chars(title)

# for page title, destroy any pipes and MS pipes and return the first match

title.gsub!(/[│,|],{0,}(.*)+/, ”)

# grab the body content

body = page.css(‘section article’).to_html

body = clean_body(body)

# clean the file path

path = File.new(file).path

path.gsub! $base_path, “/”

# if we have content, add this as a page to our page array

if (body.length > 0)

$count += 1

puts “Processing ” + title

# insert into array

data = {

‘path’ => path,

‘title’ => title,

‘body’ => body,

}

 

$posts.push data

end

end

end

 

write_csv($posts)

report($count)

end[/ruby]

What happens here is that Ruby is checking files in the blog folder one after another looking for HTML ones. It makes sure old, draft and archive posts are not included.

Using .css one can create content structure. Text and to_html methods will let do some formatting.

Rolling things up

Now when Nokogiri did some scrpaing magic, let’s save it to CSV file. Guess what would help us with that? CSV gem!

[ruby]
# This creates a CSV file from the posts array created above

def write_csv(posts)

CSV.open(‘posts.csv’, ‘w’ ) do |writer|

writer << [“path”, “title”, “body”]

$posts.each do |c|

writer << [c[‘path’], c[‘title’], c[‘body’]]

end

end

end[/ruby]

Don’t forget to echo the job done:

[ruby]# echo to the console how many posts were written to the CSV file.

def report(count)

puts “#{$count} html posts were processed to #{Dir.getwd}/posts.csv”

end
[/ruby]

And trigger the script

[ruby]
# trigger everything

parse_html_files
[/ruby]

This is it. Now you know how to web scrape HTML + CSS pages. But you should remember that this method won’t work for js files. What shall I do then? We will tell you the answer in the next article. Stay tuned!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 21

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?


Author

Daria Stolyar is a Marketing Manager at Rubyroid Labs. You can follow her at Linkedin.

Write A Comment