Heroku's just Linux... Sometimes

Heroku is often treated as blackbox into which a developer puts code and out pops a running web server. There is certainly a good amount of magic that goes into it but keep in mind at the end of the day, those Dynos are just custom Ubuntu instances.

When Is It Linux?

Did you know can do this?

~ $ heroku run bash
Running `bash` attached to terminal... up, run.5617

~ $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:  Ubuntu 10.04 LTS
Release:  10.04

~ $ uptime
 20:50:25 up 35 days, 23:48,  0 users,  load average: 13.64, 9.56, 8.78


Why does it matter? Because sometimes programming languages are slow (Ruby). Sometimes what you need to do has already been written in C and left behind on standard Linux distrutions waiting for you to use them.

For example, say we wanted to take a rather large CSV, sort it by the 2nd then 1st columns and return the first row.

I was able to find an 8MB CSV from baseball-databank.org. Here is an example Sinatra app:

require 'csv'
require 'benchmark'
require 'sinatra/base'

class HerokuTest < Sinatra::Base

  def mem_usage
    `ps -o rss=  #{Process.pid}`.to_i

  def results(time, mem_total, line)
    output = []
    output << "Finished in #{time.to_s}"
    output << "Memory growth: #{mem_total}"
    output << "First sorted line:"
    output << line + "\n"

  get("/linux_sort") do
    mem_start = mem_usage
    time = Benchmark.measure do
      # Splitting on commas isn't proper CSV parsing but it works for our purposes here
      system("sort -g --field-separator=',' --key=2,1 fielding.csv > sorted.csv")
    File.open("sorted.csv") do |f|
      mem_end = mem_usage
      results(time, mem_end - mem_start, f.readline)

  get("/ruby_sort") do
    mem_start = mem_usage
    time = Benchmark.measure do
      @file = CSV.parse(File.read("fielding.csv"))
      @file.sort_by! {|b| b.values_at(1,0) }
    mem_end = mem_usage
    results(time, mem_end - mem_start, @file.first.join(","))

The app can be checked out at https://github.com/ejfinneran/herokus-just-unix.

So, we're doing the same thing in each endpoint. Sorting the file and returning the first line along with the stats about the request.

These results are running on Ruby 2.1.2 on a standard 1X Cedar stack Dyno.

Let's see what the results are:

Ruby Sort

$ curl http://murmuring-castle-8562.herokuapp.com/ruby_sort
Finished in 5.419442253
Memory growth: 188716
First sorted line:

So it took 5.4 seconds and grew our memory footprint by ~188k. Subsequent requests don't grow memory nearly as much but if we were parsing a different file in each request, you'd start seeing much worse memory growth.

Linux Sort

$ curl http://murmuring-castle-8562.herokuapp.com/unix_sort
Finished in 0.84406482
Memory growth: 12
First sorted line:

Wow. The sort binary installed on the Dyno did the same job in less than one second and only added 12 bytes to the memory footprint.

Keep this in mind even outside of Heroku. If you can drop down into those Linux tools to process data in large batches, you will save quite a bit of processing time and memory overhead.

This also means you can write a custom app that doesn't something simple like sort in a language like Go and shell out to it. We've used that method at Cloudability to handle fetching and parsing very large files outside of Ruby.

When Is It Not Linux?


Heroku munges STDERR into STDOUT when using the run command so you can't use standard Linux patterns for filtering output.

heroku run "ls not_real_file" &2>/dev/null

This issue is documented here if you want to track it's progress.


Output truncation

Heroku also truncates output of the run command seemingly at random.

We can run nl on the server and see that this CSV is 164,898 lines.

$ heroku run "nl fielding.csv | tail"
Running `nl fielding.csv | tail` attached to terminal... up, run.9082
164889  zuverge01,1955,2,BAL,AL,P,28,5,259,7,19,3,2,,,,,
164890  zuverge01,1956,1,BAL,AL,P,62,0,292,8,17,0,2,,,,,
164891  zuverge01,1957,1,BAL,AL,P,56,0,338,5,26,0,2,,,,,
164892  zuverge01,1958,1,BAL,AL,P,45,0,207,4,19,0,1,,,,,
164893  zuverge01,1959,1,BAL,AL,P,6,0,39,0,4,0,0,,,,,
164894  zwilldu01,1910,1,CHA,AL,OF,27,,,45dj2,3,1,,,,,
164895  zwilldu01,1914,1,CHF,FL,OF,154,,,340,15,14,3,,,,,
164896  zwilldu01,1915,1,CHF,FL,1B,3,,,3,0,0,0,,,,,
164897  zwilldu01,1915,1,CHF,FL,OF,148,,,356,20,8,6,,,,,
164898  zwilldu01,1916,1,CHN,NL,OF,10,,,11,0,0,0,,,,,

However, if we try to cat the file back to our machine, we see that we get cut off between 3800 and 5000 line mark.

$ heroku run "cat fielding.csv" | nl | tail
 5077 ayalabo01,1998,1,SEA,AL,P,62,0,226,8,8,5,1,,,,,
 5078 ayalabo01,1999,1,MON,NL,P,53,0,198,7,10,4,1,,,,,
 5079 ayalabo01,1999,2,CHN,NL,P,13,0,48,1,2,0,0,,,,,
 5080 ayalalu01,2003,1,MON,NL,P,65,0,213,8,19,1,0,,,,,
 5081 ayalalu01,2004,1,MON,NL,P,81,0,271,9,21,0,4,,,,,
 5082 ayalalu01,2005,1,WAS,NL,P,68,0,213,7,15,0,1,,,,,
 5083 ayalalu01,2007,1,WAS,NL,P,44,0,127,3,8,0,0,,,,,
 5084 ayalalu01,2008,1,WAS,NL,P,62,0,173,1,10,0,1,,,,,
 5085 ayalalu01,2008,2,NYN,NL,P,19,0,54,1,2,0,0,,,,,

$ heroku run "cat fielding.csv" | nl | tail
 3886 aparilu01,1964,1,BAL,AL,SS,145,144,3815,260,437,15,98,,,,,
 3887 aparilu01,1965,1,BAL,AL,SS,141,139,3809,238,439,20,87,,,,,
 3888 aparilu01,1966,1,BAL,AL,SS,151,151,4098,303,441,17,104,,,,,
 3889 aparilu01,1967,1,BAL,AL,SS,131,128,3440,221,333,25,67,,,,,
 3890 aparilu01,1968,1,CHA,AL,SS,156,151,4069,269,535,19,92,,,,,
 3891 aparilu01,1969,1,CHA,AL,SS,154,153,3965,248,563,20,94,,,,,
 3892 aparilu01,1970,1,CHA,AL,SS,146,140,3588,251,483,18,99,,,,,
 3893 aparilu01,1971,1,BOS,AL,SS,121,121,3171,194,338,16,56,,,,,
 3894 aparilu01,1972,1,BOS,AL,SS,109,109,2807,183,304,16,54,,,,,
 3895 aparilu01,1973,1,BOS,AL,SS,132,129,3314,190,404%

Documented here: https://github.com/heroku/heroku/issues/674

Using Replicate gem with Heroku

Heroku vs Replicate

Update 01/2013:

I've recently found that the heroku run command can randomly truncate output so your milage on this may vary, especially with larger sets of data. GitHub Issue here: https://github.com/heroku/heroku/issues/674

While trying to use the Replicate gem with Heroku, I ran into a strange issue. Namely, Heroku didn't like the bytestream that Ruby's Marshal class was producing:

╰─○  heroku run 'replicate -r ./config/environment -d "Post.first"' > file.dump
 !    Heroku client internal error.
 !    Search for help at: https://help.heroku.com
 !    Or report a bug at: https://github.com/heroku/heroku/issues/new

    Error:       invalid byte sequence in UTF-8 (ArgumentError)


I did find a workaround though. The base64 binary is present on Heroku VMs so you can pipe replicate's output through that to get data that Heroku can handle. However, there are a couple caveats here:

  1. Heroku squashes STDERR into STDOUT which means you need to filter out STDERR (via 2>/dev/null) before sending the data back to the client.

  2. You also need to do a little manual work on the data you get back to filter out the lines Heroku addes to the output.

  3. Decode the base64 file on your end before passing it your local replicate instance.


Here is the flow that works for me:

heroku run 'replicate -r ./config/environment -d "Post.first" 2>/dev/null | base64' > file.dump

Open file.dump and remove any of the "…attached to terminal" lines Heroku adds. In the following case, just the first line.

Running `replicate -r ./config/environment -d "Post.first" 2>/dev/null | base64` attached to terminal... up, run.7145

Decode the base64 file back into a Marshal friendly file and load it into Replicate:

╰─○ base64 -i file.dump -D | replicate -r ./config/environment -l
zsh: correct './config/environment' to './config/environments' [nyae]? n
==> loaded 2 total objects:
Post      1
User      1

Hope that helps anyone working with Replicate on Heroku.

Rails UJS and jQuery Mobile

I ran into a gotcha when working with jQuery Mobile and Rails UJS helpers. As you probably know, adding remote: true to your form_for arguments triggers Rails to add some UJS helpers that submit that form via AJAX rather than a normal post.

However, jQuery Mobile looks for any form element and does the same. As a result, I was getting double posts each time I clicked submit on what should have been a very simple form.

Adding a data attribute of data-ajax='false' prevents jQuery Mobile from trying to muck with the form at all and lets Rails do its thing.

OpenStack Simple Scheduler

Interesting gotcha found while playing with OpenStack.

We were working on a two node OpenStack cluster but ran into a problem where the scheduler was not properly balancing between the two nodes. We'd spin up 10 instances and eight would spawn on one box and two on the other. Not ideal, obviously.

We figured out that you can specify a different scheduler via nova.conf. We added: --scheduler_driver=nova.scheduler.simple.SimpleScheduler

Despite its name, SimpleScheduler tries to intelligently schedule new instances based on the current load of the available compute nodes. That solved our issue!

This isn't very well documented. I had to go digging through the OpenStack code to figure out the syntax we needed but that's why open source is awesome. :-)

Overall, OpenStack has blown me away. Awesome stuff.

Red Hat needs a build-essentials package

Dear Red Hat,

Please fix this:

[root@localhost ~]# yum groupinstall "Development Tools"
Transaction Summary
Install 133 Package(s)
Upgrade 9 Package(s)

Total download size: 137 M

vs Ubuntu

user@ubuntu:~$ sudo apt-get install build-essential
Reading package lists... Done
Building dependency tree 
Reading state information... Done
0 upgraded, 16 newly installed, 0 to remove and 0 not upgraded.
Need to get 19.6MB of archives.
After this operation, 66.8MB of additional disk space will be used.

19.6MB to your 137MB.

I just want the packages I need to compile and install software, I don't want every development library available on the system.