After running exitwp to import my blog, I ran rake generate to build it, and got the following issue:
$ rake generate
(in /Users/benjiegillam/Documents/Blog/octopress)
## Generating Site with Jekyll
Configuration from /Users/benjiegillam/Documents/Blog/octopress/_config.yml
unchanged sass/screen.scss
Building site: source -> public
/Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `unwrap'
[...]
However, converting the file was perfectly valid UTF-8 as confirmed by an iconv -c conversion followed by a diff -u.
After a quick bit of hacking in the octopress/plugins/raw.rb file to spit out the content that was being converted, I found the file at fault. After some iteration I got to the root of the issue - Octopress’ default markdown parser, rdiscount, REALLY doesn’t like UTF-8 characters in URLs. I’ve built a test here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
It was converting Dara_Ó_Briain.jpg to Dara_?%93_Briain.jpg, where the ? is an invalid UTF-8 character. (Should be Dara_%C3%93_Briain.jpg)
Solution?
Annoyingly pandoc (a tool employed by exitwp) seems to be converting the link from Dara_%C3%93_Briain.jpg to Dara_Ó_Briain.jpg in the markdown file, which is then breaking when it is rdiscounted. As it only affected 2 characters in my entire blog history I’ve not bothered with an automated fix - I just manually re-encoded the characters. I’ve commented on this issue with pandoc so that hopefully they will fix it.