Getting Started with Nokogiri

What Is Nokogiri?

Nokogiri is a Ruby document parser that was released in October 2008. It’s great for reading, searching, editing, and otherwise mucking around with HTML (and XML).

Nokogiri Basics

After installing Nokogiri, run the following Ruby program:

require 'rubygems'
require 'nokogiri'
 
html = %Q{
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>A Riddle</title>
  </head>
  <body>
    <p id="question">
      How many web servers does it take to screw in a light bulb?
    </p>
    <p id="answer">
      200, OK?
    </p>
  </body>
</html>
}
 
doc = Nokogiri::HTML(html)
body = doc.at_css("body")
body['style'] = "background-color: blue;"
html = doc.to_html
 
puts html

First we translate the HTML string into a Nokogiri document. Then we find the body element, change its style, and translate the Nokogiri document back into an HTML string. Easy as pi!

Locating Elements

Nokogiri makes locating HTML elements easy. Although you can use XPath to search through the HTML, Nokogiri lets you locate elements using CSS3 selectors.

This finds all paragraphs:

doc.css("p")

This finds the paragraph with id "answer":

doc.at_css("p#answer")

Notice that doc.css is used to return multiple elements (also referred to as nodes) and doc.at_css is used to return a single element (node). Multiple nodes are returned as a Nokogiri::XML::NodeSet object, which is an array-like collection of Nokogiri::XML::Node objects.

Changing Elements

doc = Nokogiri::HTML(html)
question = doc.at_css("p#question")
question.content = "What is 50*PI rounded to the nearest hundred?"
html = doc.to_html

We use the content method to change the text. We can also add attributes to the node like this:

question['class'] = "math"

Adding Elements

First we need to create a new node:

new_paragraph = Nokogiri::XML::Node.new("p", doc)
new_paragraph.content = "Thanks!"

Then we add it to the body node:

body = doc.at_css("body")
body << new_paragraph

This makes it the last paragraph. We can make it the first paragraph like this:

body.children.first.add_previous_sibling(new_paragraph)

Deleting Elements

This is how you delete a node:

answer_paragraph = doc.at_css("p#answer")
answer_paragraph.remove

Diving Deeper into Nokogiri

As you can see, it’s really easy to get started with Nokogiri. Here are some resources for going deeper:

As you’re working with Nokogiri, you may notice that it can very quickly turn from easy to frustrating since document parsing is an inherently complex problem. Be sure to read the sequel to this blog post, Getting Mugged by Nokogiri, to learn how to avoid some Nokogiri pitfalls.

You can skip to the end and leave a response. Pinging is currently not allowed.

One Response to “Getting Started with Nokogiri”

  1. Thanks for the overview of Nokogiri, and also for the precise links to documentation. I’d read that the nokogiri website was helpful, but I found it quite the opposite until I followed your pointer to the Nokogiri::XML::Node page [http://nokogiri.org/Nokogiri/XML/Node.html]. Nicholas’s Blog was also helpful. Thanks again.

    P.S. Nice puns in your Nokogiri posts.

Leave a Reply