about:benjie

Random learnings and other thoughts from an unashamed geek

Octopress UTF-8 Issues

| Comments

After running exitwp to import my blog, I ran rake generate to build it, and got the following issue:

$ rake generate
(in /Users/benjiegillam/Documents/Blog/octopress)
## Generating Site with Jekyll
Configuration from /Users/benjiegillam/Documents/Blog/octopress/_config.yml
unchanged sass/screen.scss
Building site: source -> public
/Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `unwrap'
    [...]

However, converting the file was perfectly valid UTF-8 as confirmed by an iconv -c conversion followed by a diff -u.

After a quick bit of hacking in the octopress/plugins/raw.rb file to spit out the content that was being converted, I found the file at fault. After some iteration I got to the root of the issue - Octopress’ default markdown parser, rdiscount, REALLY doesn’t like UTF-8 characters in URLs. I’ve built a test here:

rdiscount UTF-8 URL test (rdiscounttest.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# encoding: utf-8
require 'rdiscount'

print "Test 1: "
markdown = RDiscount.new("http://commons.wikipedia.org/wiki/Image:Dara_Ó_Briain.jpg")
if markdown.to_html == "<p>http://commons.wikipedia.org/wiki/Image:Dara<em>Ó</em>Briain.jpg</p>\n"
  puts "PASS"
else
  print "FAIL: "
  puts markdown.to_html
end
# Correct: <p>http://commons.wikipedia.org/wiki/Image:Dara<em>Ó</em>Briain.jpg</p>

print "Test 2: "
markdown = RDiscount.new("[Wikipedia](http://commons.wikipedia.org/wiki/Image:Dara_Ó_Briain.jpg)")
if markdown.to_html == "<p><a href=\"http://commons.wikipedia.org/wiki/Image:Dara_%C3%93_Briain.jpg\">Wikipedia</a></p>\n"
  puts "PASS"
else
  print "FAIL: "
  puts markdown.to_html
end
# Incorrect: <p><a href="http://commons.wikipedia.org/wiki/Image:Dara_?%93_Briain.jpg">Wikipedia</a></p>
# Note the invalid character (represented by ?)
# Expected: <p><a href="http://commons.wikipedia.org/wiki/Image:Dara_%C3%93_Briain.jpg">Wikipedia</a></p>

It was converting Dara_Ó_Briain.jpg to Dara_?%93_Briain.jpg, where the ? is an invalid UTF-8 character. (Should be Dara_%C3%93_Briain.jpg)

Solution?

Annoyingly pandoc (a tool employed by exitwp) seems to be converting the link from Dara_%C3%93_Briain.jpg to Dara_Ó_Briain.jpg in the markdown file, which is then breaking when it is rdiscounted. As it only affected 2 characters in my entire blog history I’ve not bothered with an automated fix - I just manually re-encoded the characters. I’ve commented on this issue with pandoc so that hopefully they will fix it.

Comments