A Ruby Script to Go From Raw Source of Emails to Something I Can Put On My Blog

I wanted to take an email from Apple Mail and paste it to my blog.

If I copy/paste from the normal email, I lose the > marks you need for proper quoting in Markdown.

Ah but there’s the option to view the Raw Source of an email in Apple Mail (Opt-Cmd-U).

But if you paste from that you get something like this:

British=E2=80=94again, really English=E2=80=94society remained defined =
by a national culture that Orwell would have recognized. In that year, =
however, Tony Blair=E2=80=99s just-elected first Labour government =
launched a demographic=E2=80=94and, concomitantly, a =
cultural=E2=80=94revolution, a revolution that historians and =
commentators of all political stripes now recognize as by far Blair=E2=80=99=
s most historically significant legacy.

ewww what are those =E2=80=blahblahs 🙁

Turns out Raw emails are encoded in Quoted-Printable. Ok cool, so how do we deal with that?

First we will need a gem.

Let’s go!

gem install clipboard
gem install mail

Clipboard lets you easily interact with Mac clipboard, and Mail has a thing that lets us decode Quoted-Printable text to normal text.

The workflow is: I view Raw Source on an email I wanna paste, and select the body of the email, and copy it into the clipboard. Then I run the following script:

and now I have the plain text of my email with the appropriate > marks. Nice!

However, this would be a lot better if I could set this up as a system wide service I could summon with a shortcut key….but I’m not sure if its possible to use an environment besides the system default Ruby to set such a service up in Automator.

Justin Makes a Ruby Program Part 2

Elliot Temple replied to my last post with several good comments and I wanted to address them.

Elliot Temple’s comments have a yellow background:

I would write:

You’re mutating state for no reason and making it longer.

Great suggestion, thanks!

For sanitizing, shouldn’t you keep div and span tags? I’ve used those for formatting sometimes (offhand I know some blockquotes use divs). Also shouldn’t you allow hr tags? I use those.

All good suggestions, thanks. I don’t really have a good understanding of HTML and CSS yet so I’m not good at judging things like what’s important to include in a sanitize list. I was mostly judging by noting that the output files looked good enough, but the hr tag in particular helps with a problem I’d noticed on some of the pages so thanks for pointing that out.

The <hr> tag was not agreeable to the Upmark gem, so I added the following to handle that issue

Also shouldn’t you convert h1 and h2 tags similar to the h3?

Yeah it doesn’t hurt to do them all!

Elliot also suggested I handle the RegEx for header tags by just doing substitution instead of using a block. This is what I came up with for the header issue with Elliot’s help. (my syntax highlighter isn’t handling
#{} well so ignore the highlighting on this code snippet):

And I don’t understand why the sanitizer doesn’t have b, i, u, em, strong on the list.

Fixed thanks.

Thanks for your help Elliot!

With regards to code in blog comments, Markdown has been activated in the comments, and it plays nicely with my syntax highlighting plugin! (I’m using Crayon if anyone is curious)

Inline code syntax is `back-ticks like this`

and which produces code styled like this

and for blocks of code

```

its a fence of three backticks around your code like this

```

Justin Makes a Ruby Program!

I’m helping a friend convert some HTML-formatted posts from his blog into an ePub. I wanted to use the great Markdown editor Ulysses to produce the ePub, because it’s a nice program and because being able to edit in Ulysses would give me some easy control over the formatting of the ePub.

This required two steps.

The first was to get the HTML files into a format suitable for importing into Ulysses and outputting into an ePub.

This was a bit tricky, because I wouldn’t want to just do a straightforward conversion from HTML to Markdown. I needed to retain some HTML so that it could be styled by the CSS sheet I’d use in Ulysses and look correct. So I needed to be able to selectively remove formatting while getting it Markdown-ish enough for Ulysses to be able to output to ePub.

The second step would be to actually output the ePub with the desired styling. This would involve customizing a CSS Style sheet in Ulysses.

Step two was basically trivial — I modified Jennifer Mack’s excellent KBasic style sheet, essentially doing a copy-and-paste (with modification) of some of the relevant CSS stylesheet from my friend’s blog. No prob.

So this post will focus on the first step, the Ruby program and associated gems I used to get the HTML files in shape for Ulysses, since that was the interesting part.

Step by step

This bit lets us work with various Ruby Gems necessary for the project:

I wanted the script to iterate through a whole directory of HTML files. That’s what this next bit does:

The next few steps use Nokogiri, an HTML/XML parser, to pull the information we want from the HTML file we’re currently working with.

This initializes a variable that will let us work with the content of the html file in Nokogiri:

This initializes a variable for the content of the blog post we are gonna turn into an ePub:

In contrast with what I do below on the blog post’s title, I am not using the .text Nokogiri method to extract the text of the article. Why? Cuz i need the HTML formatting of the article for various purposes (to convert to Markdown and to have properly formatted blockquotes). Nokokgiri has a css selector that lets you extract elements from web pages. See how it works here

We want to put the body of the post into a string for further manipulation, so we do that next:

This initializes a variable for the title of the article. We just want the text here, so we’ll use the .text Nokogiri method:

This concatentes the title with a leading “# ” so that it will be a proper Markdown title, which will automate the process of turning blog posts into individual chapters in an ePub:

(The strip method prevents the title from appearing on a separate line than the “#”)

Next we’re gonna combine the post title and body. first we initialize an empty string:

Then we concatenate the strings we’ve got for the title and body, with the title being added first of course:

This next bit puts blockquotes on their own lines, which helps ulysses handle blockquotes correctly when making ePubs:

This converts <h3> tags to markdown appropriate format, which Upmark wasn’t handling well for some reason:

Sanitize “cleans up” html files by removing stuff that’s not on the white list. I had to build my own custom whitelist to get curi blog posts to work correctly. I basically modified an example white list by adding a couple things:

Upmark is a Ruby gem for converting HTML to Markdown.

(NOTE: I’ve manually MODIFIED the version of Upmark ruby gem I’m running (specifically the markdown.rb script) by commenting out the portion that handles <br> tags. Leaving the break tags in, as opposed to replacing them with newlines, keeps the formatting correct inside of blockquotes)

This last bit saves our file to with a markdown extension and closes the loop we opened way up top:

The Script

Here’s the whole script, for reference: