Justin Makes a Ruby Program!

I’m helping a friend convert some HTML-formatted posts from his blog into an ePub. I wanted to use the great Markdown editor Ulysses to produce the ePub, because it’s a nice program and because being able to edit in Ulysses would give me some easy control over the formatting of the ePub.

This required two steps.

The first was to get the HTML files into a format suitable for importing into Ulysses and outputting into an ePub.

This was a bit tricky, because I wouldn’t want to just do a straightforward conversion from HTML to Markdown. I needed to retain some HTML so that it could be styled by the CSS sheet I’d use in Ulysses and look correct. So I needed to be able to selectively remove formatting while getting it Markdown-ish enough for Ulysses to be able to output to ePub.

The second step would be to actually output the ePub with the desired styling. This would involve customizing a CSS Style sheet in Ulysses.

Step two was basically trivial — I modified Jennifer Mack’s excellent KBasic style sheet, essentially doing a copy-and-paste (with modification) of some of the relevant CSS stylesheet from my friend’s blog. No prob.

So this post will focus on the first step, the Ruby program and associated gems I used to get the HTML files in shape for Ulysses, since that was the interesting part.

Step by step

This bit lets us work with various Ruby Gems necessary for the project:

I wanted the script to iterate through a whole directory of HTML files. That’s what this next bit does:

The next few steps use Nokogiri, an HTML/XML parser, to pull the information we want from the HTML file we’re currently working with.

This initializes a variable that will let us work with the content of the html file in Nokogiri:

This initializes a variable for the content of the blog post we are gonna turn into an ePub:

In contrast with what I do below on the blog post’s title, I am not using the .text Nokogiri method to extract the text of the article. Why? Cuz i need the HTML formatting of the article for various purposes (to convert to Markdown and to have properly formatted blockquotes). Nokokgiri has a css selector that lets you extract elements from web pages. See how it works here

We want to put the body of the post into a string for further manipulation, so we do that next:

This initializes a variable for the title of the article. We just want the text here, so we’ll use the .text Nokogiri method:

This concatentes the title with a leading “# ” so that it will be a proper Markdown title, which will automate the process of turning blog posts into individual chapters in an ePub:

(The strip method prevents the title from appearing on a separate line than the “#”)

Next we’re gonna combine the post title and body. first we initialize an empty string:

Then we concatenate the strings we’ve got for the title and body, with the title being added first of course:

This next bit puts blockquotes on their own lines, which helps ulysses handle blockquotes correctly when making ePubs:

This converts <h3> tags to markdown appropriate format, which Upmark wasn’t handling well for some reason:

Sanitize “cleans up” html files by removing stuff that’s not on the white list. I had to build my own custom whitelist to get curi blog posts to work correctly. I basically modified an example white list by adding a couple things:

Upmark is a Ruby gem for converting HTML to Markdown.

(NOTE: I’ve manually MODIFIED the version of Upmark ruby gem I’m running (specifically the markdown.rb script) by commenting out the portion that handles <br> tags. Leaving the break tags in, as opposed to replacing them with newlines, keeps the formatting correct inside of blockquotes)

This last bit saves our file to with a markdown extension and closes the loop we opened way up top:

The Script

Here’s the whole script, for reference:

5 thoughts on “Justin Makes a Ruby Program!”

  1. string_of_article = “”
    string_of_article << post_title
    string_of_article << body_of_article

    I would write:

    string_of_article = post_title + body_of_article

    You're mutating state for no reason and making it longer.

  2. FYI my blog uses Redcarpet for markdown.

    For sanitizing, shouldn’t you keep div and span tags? I’ve used those for formatting sometimes (offhand I know some blockquotes use divs). Also shouldn’t you allow hr tags? I use those.

    Also shouldn’t you convert h1 and h2 tags similar to the h3? And I don’t understand why the sanitizer doesn’t have b, i, u, em, strong on the list.

  3. gsub(/(.*)/) {“\n” + “### ” + $1 + “\n”}

    you don’t need a block. just:

    gsub(/(.*)/, “\n### \\1\n”

    \\ stands for \, in the same way \n stands for newline (since backslash is used for stuff like \n, it has to have a special way to write a backslash). also you don’t need to add string literals together (the \n and ###) when you can just write them next to each other in one string.

    also for making string_of_article don’t you need a space or newline in between the title and body?

    for headers do this:

    1.upto(6) do |n|
    string_of_article.gsub!(/(.*)/, “\n#{“#”*n}\\1\n”}
    end

  4. missing a space before the \\1 at the end of the previous comment. also FYI when you copy/paste that code to ruby you need to fix the quotes to be straight quotes not curly. also i intended the second to last line with 2 spaces but they aren’t being displayed 🙁

  5. test

Leave a Reply

Your email address will not be published.