We keep telling you how web scraping can make your life a little better place. In part 1 you could learn what are the legal issues regarding it are nothing but the discussions of how ethic is behavior of some people in the Internet. The technique of web scraping can do a lot of good. And today we will tell you how to use it with the help of Ruby and Nokogirl.
Getting started
Ruby and Nokogirl work great, when you need to web scrape HTML + CSS pages. It would save you tons of time for migration and exporting data. As soon as you learn this technique, you will be ready for any data transition that world has in the sleeve for you. So let’s get down to some Ruby stuff! You will need:
- Ruby 1.9.3 or higher
- RubyGems
- Nokogiri, CSV, Sanitize, and Find gems
- And some really basic Ruby skills even a beginner has.
If you don’t have any of those, use ‘gem install’ in the Terminal.
Prepare for scraping
The technique of web scraping is basically telling the system to read all the pages and save them locally. If you have a cheap hosting, this might lead to some load issues. Another problem is that not all URLs might be available for you. But let’s imagine we have a pretty accessible and stable website.
Create a new file and name it something like srape.rb. Declare required gems in the beginning.
[ruby]
# include required gems
require ‘find’
require ‘rubygems’
require ‘nokogiri’
require ‘sanitize’
require ‘csv’
[/ruby]
It is a good idea to get rid of all bad symbols right away, so that you won’t have to bother later. Here’s how we can remove apostrophes and quotes Microsoft uses:
[ruby]
# generic function to replace MS word smart quotes and apostrophes
def strip_bad_chars(text)
text.gsub!(/”/, “‘”);
text.gsub!(/\u2018/, “‘”);
text.gsub!(/[”“]/, ‘”‘);
text.gsub!(/’/, “‘”);
return text
end
[/ruby]
And now let’s clean HTML tags to keep text well-formatted:
[ruby]
def clean_body(text)
text.gsub!(/(\r)?\n/, “”);
text.gsub!(/\s+/, ‘ ‘);
# extra muscle, clean up crappy HTML tags and specify what attributes are allowed
text = Sanitize.clean(text, :elements => [‘h1’, ‘h2’, ‘h3’, ‘h4’, ‘h5’, ‘h6’, ‘p’, ‘a’, ‘b’, ‘strong’, ’em’, ‘img’, ‘iframe’],
:attributes => {
‘a’ => [‘href’, ‘title’, ‘name’],
‘img’ => [‘src’, ‘title’, ‘alt’],
‘iframe’ => [‘src’, ‘url’, ‘class’, ‘id’, ‘width’, ‘height’, ‘name’],
},
:protocols => {
‘a’ => {
‘href’ => [‘http’, ‘https’, ‘mailto’]
},
‘iframe’ => {
‘src’ => [‘http’, ‘https’]
}
})
# clean start and end whitespace
text = text.strip;
return text
end[/ruby]
Nokogiri time
Now we are ready for our first web scraping. Nokogiri gem will basically do all the hard work for us. Let’s see how it scrapes data from /blog:
[ruby]
# this is the main logic that recursively searches from the current directory down, and parses the HTML files.
def parse_html_files
Find.find(Dir.getwd) do |file|
if !File.directory? file and File.extname(file) == ‘.html’
# exclude and skip if in a bad directory
# we may be on an html file, but some we just do not want
current = File.new(file).path
# stick to just the blog folder
if not current.match(/(blog)/)
next
end
# however, skip these folders entirely
if current.match(/(old|draft|archive)/)
next
end
# open file, pluck content out by its element(s)
page = Nokogiri::HTML(open(file));
# grab title
title = page.css(‘title’).text.to_s;
title = strip_bad_chars(title)
# for page title, destroy any pipes and MS pipes and return the first match
title.gsub!(/[│,|],{0,}(.*)+/, ”)
# grab the body content
body = page.css(‘section article’).to_html
body = clean_body(body)
# clean the file path
path = File.new(file).path
path.gsub! $base_path, “/”
# if we have content, add this as a page to our page array
if (body.length > 0)
$count += 1
puts “Processing ” + title
# insert into array
data = {
‘path’ => path,
‘title’ => title,
‘body’ => body,
}
$posts.push data
end
end
end
write_csv($posts)
report($count)
end[/ruby]
What happens here is that Ruby is checking files in the blog folder one after another looking for HTML ones. It makes sure old, draft and archive posts are not included.
Using .css one can create content structure. Text and to_html methods will let do some formatting.
Rolling things up
Now when Nokogiri did some scrpaing magic, let’s save it to CSV file. Guess what would help us with that? CSV gem!
[ruby]
# This creates a CSV file from the posts array created above
def write_csv(posts)
CSV.open(‘posts.csv’, ‘w’ ) do |writer|
writer << [“path”, “title”, “body”]
$posts.each do |c|
writer << [c[‘path’], c[‘title’], c[‘body’]]
end
end
end[/ruby]
Don’t forget to echo the job done:
[ruby]# echo to the console how many posts were written to the CSV file.
def report(count)
puts “#{$count} html posts were processed to #{Dir.getwd}/posts.csv”
end
[/ruby]
And trigger the script
[ruby]
# trigger everything
parse_html_files
[/ruby]
This is it. Now you know how to web scrape HTML + CSS pages. But you should remember that this method won’t work for js files. What shall I do then? We will tell you the answer in the next article. Stay tuned!