Getting Mugged by Nokogiri

The Scene of the Crime

mugging

Once you’ve gotten started with Nokogiri, you’ll expect everything to be as easy as the beginning, and then–

BAM!

–you’ll get sucker-punched.

Here’s the scene of the crime:

html = %Q{
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>A Riddle</title>
  </head>
  <body>
    <p id="question">
      Did you hear about the HTTP quiz?
    </p>
    <p id="answer">
      Yeah, it was <span class="number">300</span> multiple choices.
    </p>
  </body>
</html>
}

The Kidney Punch

Let’s say you innocently want to change the text of all paragraphs to uppercase.

doc = Nokogiri::HTML(html)
doc.css("p").each do |p|
  p.content.upcase!
end
html = doc.to_html

You may notice that nothing happens! That’s because you can’t modify the content in place; you have to go through Nokogiri’s content= method like this:

p.content = p.content.upcase

The Groin Kick

So now the code looks like so:

doc = Nokogiri::HTML(html)
doc.css("p").each do |p|
  p.content = p.content.upcase
end
html = doc.to_html

At first glance that seems to work, but then you take a closer look at the output:

<!DOCTYPE html>
<html lang="en">
<head><title>A Riddle</title></head>
<body>
    <p id="question">
      DID YOU HEAR ABOUT THE HTTP QUIZ?
    </p>
    <p id="answer">
      YEAH, IT WAS 300 MULTIPLE CHOICES.
    </p>
  </body>
</html>

Where did the span tag go? The content method strips out internal tags. Go ahead and see what happens when you call content on the body node:

puts doc.at_css("body").content

You’ll get:

Did you hear about the HTTP quiz?
 
Yeah, it was 300 multiple choices.

See? No tags!

What’s going on?

Understanding the Document Structure

To understand the right way to solve this problem, we need to understand the underlying document structure. Everything is a hierarchy of nodes, even the text. So the body node looks like this:

  body
   |
   |--> p#question
   |      |
   |      `--> "Did you hear about the HTTP quiz?"
   |
   `--> p#answer
          |
          |--> "Yeah, it was "
          |
          |--> span.number
          |      |
          |      `--> "300"
          |
          `--> " multiple choices."

Notice how some nodes are text nodes and others are tag nodes, but everything is a node. So if we replace the content of the p#answer node with "foo", all of p#answer‘s children are deleted and replaced with a single text node containing "foo":

  body
   |
   |--> p#question
   |      |
   |      `--> "Did you hear about the HTTP quiz?"
   |
   `--> p#answer
          |
          `--> "foo"

So we can’t just replace content willy-nilly. Fortunately, by knowing the document structure, this problem has a straightforward solution:

doc.at_css("body").traverse do |node|
  if node.text? && (node.parent.name == "p")
    node.content = node.content.upcase
  end
end

Nokogiri gives us a nice traverse method which walks through all of a node’s children recursively. We check whether the node is a text node and if its parent node is a paragraph. If so, we capitalize the contents. This works!

Note, however, if there were letters inside the span tag, they wouldn’t get capitalized. I’ll leave it as an exercise to the reader to figure out how to capitalize all of the text inside of paragraph tags that include other tags (hint: Nokogiri provides an ancestors method).

You may also notice that Nokogiri provides an inner_html method for nodes. This doesn’t strip out the children tags like content does. For example, if you try this:

puts doc.at_css("body").inner_html

this is what you get:

<p id="question">
  Did you hear about the HTTP quiz?
</p>
<p id="answer">
  Yeah, it was <span class="number">300</span> multiple choices.
</p>

This means that you could also capitalize your paragraphs this way:

doc.css("p").each do |p|
  p.inner_html = p.inner_html.upcase
end

While this works for this example, it’s not as robust. For one, you are capitalizing the entire inner_html string, which includes the tag and attribute names. They only get downcased again because Nokogiri does that for you to make the markup valid. Also, for more complex algorithms where you are traversing the node tree, I think you can end up with some pretty weird states if you change the inner_html (which contain tags) while you are traversing. In other words, if you are traversing a node tree, you shouldn’t do anything that could add or remove nodes (unless you want to get sucker punched).

The Suplex

So you think you’re done, right? Well, try out this program:

require 'rubygems'
require 'nokogiri'
 
html = %Q{
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>A Riddle</title>
  </head>
  <body>
    <p id="question">
      Why was 6 afraid of 7?
    </p>
    <p id="answer">
      Because 7, 8, 9 & 10!
    </p>
  </body>
</html>
}
 
doc = Nokogiri::HTML(html)
doc.at_css("body").traverse do |node|
  if node.text? && (node.parent.name == "p")
    node.content = node.content.upcase
  end
end
html = doc.to_html
 
puts html

And here’s what you get:

<!DOCTYPE html>
<html lang="en">
<head><title>A Riddle</title></head>
<body>
    <p id="question">
      WHY WAS 6 AFRAID OF 7?
    </p>
    <p id="answer">
      BECAUSE 7, 8, 9 &amp;amp;amp; 10!
    </p>
  </body>
</html>

What’s with all the amp‘s?! Well, Nokogiri’s content= method helpfully HTML escapes the text for you. The problem is that the text gets HTML escaped again later (probably during the to_html call). This means the text gets escaped twice!

So how do you get around this? You can’t. Nokogiri doesn’t supply us with an option to skip the HTML escaping in content=. Fortunately, Ruby allows us to call private methods, so there is a work-around until Nokogiri offers this functionality:

doc.at_css("body").traverse do |node|
  if node.text? && (node.parent.name == "p")
    new_text = node.content.upcase
    node.send(:native_content=, new_text)
  end
end

The private native_content= method is basically the same as content= but without the HTML escaping.

Conclusion

You may think that I’m complaining about Nokogiri, but I’m not. I’m a huge fan! It’s just that HTML parsing is a difficult problem and any library that deals with the problem will have some counter-intuitive parts. If you’ve come across any Nokogiri gotchas, please share them in the comments. Thanks!

You can skip to the end and leave a response. Pinging is currently not allowed.

3 Responses to “Getting Mugged by Nokogiri”

  1. Andrew Mager says:

    Awesome post

  2. I just had my BAM! moment with Nokogiri, thank you for pointing me in the right direction.

  3. rimkashox says:

    keep the kung fu going, like Nokogiri says… XML is like violence, if it doesn’t solve all yours problems, you’re not using enough of it !!! ;)

Leave a Reply