Tuesday, November 25, 2008

Ruby's Bowels: Array

Exactly how Ruby’s arrays work has always been somewhat of a mystery to me. They’re certainly fast enough for most things, but where exactly they shine and where exactly they fail was for me and, I imagine, most Ruby programmers not well known.

So, let’s get right down to it. Ruby’s Arrays are backed by actual C arrays. When you first declare a basic array ([] or Array.new) an instance of the RArray struct is created, and it points to a C array which is allocated with a size of 16. (As indiciated in array.c by the const ARY_DEFAULT_SIZE.)

So, what happens when your array gets larger than 16 elements? Let’s take a look at the code which handles that situation. This code is swiped from the rb_ary_store function in array.c. rb_ary_store is the function which runs when you try to store a new value in an array. The snippet below is what runs when the index of that value exceeds the currently allocated space.
  if (idx >= RARRAY(ary)->aux.capa) {
    long new_capa = RARRAY(ary)->aux.capa / 2;

    if (new_capa < ARY_DEFAULT_SIZE) {
      new_capa = ARY_DEFAULT_SIZE;
    }
    new_capa += idx;
    if (new_capa * (long)sizeof(VALUE) <= new_capa) {
      rb_raise(rb_eArgError, "index too big");
    }
    REALLOC_N(RARRAY(ary)->ptr, VALUE, new_capa);
    RARRAY(ary)->aux.capa = new_capa;
  }
Alright, let’s break this down. RARRAY(ary)->aux.capa is the current size of the C array. (Note that this is not necessarily the same size as the Ruby array, but it is at least the size of said Ruby array.) A few lines down from where we see the initial use of the above statement we see: new_capa += idx. idx refers to the index of the value that we are currently trying to store in the array, which has caused this code to run.

Now, new_capa is now the old size divided by 2 plus the index of the value we’re trying to add.

After a check to make sure we haven’t exceeded the maximum size of Ruby’s max array size, we run REALLOC_N (a helper function defined in the Ruby source code) to increase the size of C array to new_capa and finally we store new_capa in place of the old value.

So, let’s take a snippet of Ruby code and see what this code would try to do.

  a = []            # the C array is initialized to 16 elements
  a[16] = 'foo'     # it is reallocated to (16 / 2 + 16) or 24
  a[50] = 'blah!'   # now it is (24 / 2 + 50) or 62

So, what if you were to run the following, incredibly common, code in Ruby?

  a = []
  100_000.times do |i|
    a << 'shenanigans!'
  end

At what indices do the reallocations take place? You can use the following little snippet to figure that out.

  x = 16
  ultimate_size = 100_000
  ultimate_size.times do |i|
    if i >= x
      x = x / 2 + i
      print "#{x}, " 
    end
  end

That blurb yields the following output: 24, 36, 54, 81, 121, 181, 271, 406, 609, 913, 1369, 2053, 3079, 4618, 6927, 10390, 15585, 23377, 35065, 52597, 78895, 118342. So, appending 100,000 elements to sequentially to an empty Ruby array means reallocating the C array backing it 22 times. It also means that the larger the size of the array the greater the amount of wasted memory. After the last reallocation the C array has a size of 118,342. We end up using 100,000. So we end up with 18,342 bytes of wasted memory. 18k isn’t terrible. But how about a 1 million element array? How much is wasted for that? It ends up being 347,984 bytes.

That sucks, but if you know how many things you’re going to be putting into it, this problem has a solution. Array.new(n) where n is the number of elements. Ruby will initialize the C array to the correct number of elements, and no reallocations will have to take place.

So, you might be thinking that doing so many reallocations is probably really slow, right? Well… Not really. Keep in mind that the difference between the number of allocations for 100,000 elements and 1,000,000 elements is only 6. It goes from 22 to 28. (Thus the increase in wasted memory.) Beyond that, reallocations are really pretty fast.

The reason I started looking into Ruby’s arrays in more depth is because I’m working on a package to extend the number of data structures available in Ruby. There are of course, many many more data structures than the standard Array and Hash that Ruby offers. (I know there are more, but those are the most common.) It seems to me that offering more data structures could be a very good thing… If people can figure out which ones are best for their problem, of course.

As part of that, I think that offering linked lists could be a nifty feature, depending on what you’re doing. I figured that adding an arbitrary number of elements might be somewhat faster, due to not having to reallocate things. Turns out I was wrong. Here’s the results of sequentially appending 1,000,000 elements onto a blank list and a blank array.

          user       system     total    real
  list:   0.430000   0.020000   0.450000 (  0.449394)
  array:  0.330000   0.000000   0.330000 (  0.332871)

Turns out that (using my linked list implementation anyway) it’s slower. Now, of course, that’s just when appending to the array. Pushing things onto the front of the array is a totally different story.

Since Ruby arrays are backed by actual C arrays, pushing something onto the front of the array means moving all of the rest of the elements one byte forward. Here, the linked lists shine. (I reduced the amount to 100,000 elements, because I’m impatient.)

          user       system     total    real
  list:   0.040000   0.000000   0.040000 (  0.046550)
  array:  3.840000   0.010000   3.850000 (  3.855183)

So, in this case the code runs about 100 times slower than the equivalent appending code. The story gets more interesting when we change the number of elements we’re appending. The above 100x slowdown was for 100,000 records. My results for 50,000 records was a 50x slowdown. 170x for 200,000 records, and 358x for 400,000 records. This all points to the fact that unshift as well as shift are O(n) operations. (As opposed to the O(1) of appending.)

Of course, random access is in Ruby arrays is blazingly fast as well.

So, to sum up: Ruby arrays are backed by actual C arrays, which are reallocated on the fly. Appending to Ruby arrays is O(1) and fast. Prepending to Ruby arrays is O(n)... Don’t do it.

Saturday, October 25, 2008

Collaborative Filtering Slides

I meant to post this, oh I dunno, maybe 6 months ago. Here are the slides from my talk on Collaborative Filtering at Boston.rb from this past summer.

Monday, August 25, 2008

Funny How That Works

I spent a couple months, weeks at least, working on my Collaborative Filter project. It’s in use in a fairly large production system. I presented a walkthrough of using it at the Boston Ruby Group. There are exhaustive examples. Despite this the project currently has 4 followers on GitHub. One is my brother. Two are people I know personally.

Meanwhile, a few weeks ago I realized it’d be handy if I could control the computers I use via a Jabber IM client. So, I set about writing Uppercut. The first version took… two hours, maybe. Since, then it’s gotten a bit more complicated and I’ve added tests and examples and whatnot.

Uppercut currently has 50 followers. It has 4 forks. I’ve received patches from people who I don’t personally know. It’s in use in at least one real application. And today it was featured at the top of RubyInside’s list of “What’s Hot on GitHub: August 2008”.

In the larger scheme of things, that’s all pretty minor. Jamis Buck could really a steaming pile and get far more attention. (Not that Jamis Buck would release such a thing.) But, comparative to Collaborative Filter there’s a raving party of admirers outside of the Uppercut studios.

This initially seemed odd to me. For most metrics which matter to me, Collaborative Filter is a superior project. The algorithms are most interesting and in my opinion, it is ultimately more useful than Uppercut.

So why the disparity in public interest?

I think it’s two things. And they’re the opposites of each other. Uppercut is easy. It’s really easy. You could have a simple agent up and running in 5 minutes. It’s easy to think of simple uses for Uppercut as well. I already have about half a dozen uses for it planned.

On the other hand, Collaborative Filter is much harder to wrap your head around. It has dozens of configuration options. The average programmer probably does not see a need for such a thing in the first place! Collaborative filtering is hard. There’s not getting around that. (Although the underlying idea, at least with recommendations, is really simple. Picture a sparsely filled matrix. Users to Items. Your job is to fill in the missing values.)

So, I guess the take home for me is that I need to simplify the way Collaborative Filter works…

...And give it a silly name.

Thursday, February 14, 2008

Function Composition in Ruby

I’m sure several dozen other people have done this already, but I found it nifty. Often when I’m using the Rails console I do things like this:

Model.find(:all).map(&:something).map(&amp;:others).map(&amp;:size)

Not bad all in all. But, like most good things, it spoils you eventually. Now I want it shorter… simpler… better! One thing I’ve always liked about Lisp is function composition. Paul Graham’s Arc has it down pat, making it as simple as func1:func2:func3 to compose functions.

Doing simple function composition in Ruby is really quite effortless.

class Proc
  def |(other)
    lambda { |*args| self[other[*args]] }
  end
end

add1 = lambda { |i| i + 1 }
sub3 = lambda { |i| i - 3 }

[1,2,3].map(add1|sub3) #=> [-1,0,1]

Hurray. We’re done! Or are we? Well, not quite. Notice in the initial code above, we’re using blockified symbols in the map calls. We’re going to do this part a little differently…

class Symbol
  def &lt;(other)
    self.to_proc | other.to_proc
  end
  def &gt;(other)
    other.to_proc | self.to_proc
  end
end

Well whats the point of all this? Well, traditionally functions composed as a|b|c will end up as a( b( c( blah ) ) ). However, if we were to do that with the initial example above, we’d up with something along the lines of…

Model.find(:all).map &amp;(:size|:others|:something)

Notice that they’re reversed from what they were when using the maps above? That might be fine, but it seems counter intuitive to me. In this case, I’d like to be able to write them in the reverse of that… but at the same time, the I’d like to keep the traditional ordering around. So, we’re using symbols which give a bit of direction.

So, using the code above, we could write our initial code as…

Model.find(:all).map &amp;(:size&gt;:others&gt;:something)

Edit: Yeah, I don’t know what up with the semi-colons before the symbols in the example code. Something is screwing it up… Sorry!

Friday, December 28, 2007

Metaprogramming MUD Commands

About a year and a half ago I threw together a “mud” in the space of a few days. I use the word mud loosely. Really, I just threw together the parts to make a basic world (editable only by changing the database entries the program used by hand), allow multiple people to log on, and talk. It worked and I was proud of my Ruby skills, as I hadn’t done too much with Ruby outside of the Rails context at that point.

Of course the problem with just throwing the parts in a bin and shaking it is that when you want to add new things, it gets harder and harder as you go. So, I quickly lost interest, as doing it right would have required a complete rewrite.

Fast forward to a few days ago, and I’ve decided to start working on a MUD again… and do it right. I’ve spent the last couple years doing a ton of Ruby work. So, I’m more confident now that I can do it right.

One thing that is now pretty thoroughly etched into my head, that wasn’t a year ago, is metaprogramming with Ruby. I understood how to do it back then… and how to use DSLs in previous years, however, now I have a better grip on just how freaking great it is. (Partially thanks to Paul Graham’s writings convincing me that bottom-up is awesome.)

I worked on the mud code quite a bit on the days I had off for the holidays, and then most everything evening since. So, I’ve got a pretty good base built up. One part of that base is the “command system”.

The command system is what the mud uses to identify and dispatch commands sent by the players. Things like:

say Hey whats up man
eq sword
score
area list
area delete 1
area create New Area
area set 1 room_start=1000 room_end=1100

Theres a few things that are probably immediately noticeable about this list of commands. First, some of these commands should not be usable by the average player. (Allowing ‘area delete’ for normal players would be a disaster.) Secondly, some commands have sub commands.

The naive (and sad) way to handle this would just be nested case statements. Which, of course, would lead to a tremendous amount of boilerplate code and repetition. Not acceptable… and certainly not worthy of a blog post. However, I think I’ve come up with a pretty slick way of handling this situation, and since it involves metaprogramming, which I love, I thought I’d share it. So, first off, what does the code need to do? It needs to…

  1. Record a block of code to be run when the command is dispatched
  2. Exclude the command based upon roles of players
  3. Facilitate nested (or namespaced) commands
  4. Easily create reusable methods (helpers)

So what would code that accomplishes all those goals look like? Well… Heres what I came up with:

module Roles
  extend RoleBuilder

  role(:player) { |c| true }
  role(:immortal) { |c| c.level &gt; 100 }
  role(:admin) { |c| c.level &gt; 105 }
end

class Command &lt; CommandBase
  player(:say) { |input| message_room input }

  namespace(:area) do |area|
    area.admin(:create) do |i|
      # blah blah blah
    end

    area.immortal(:info) do |i|
      # etc etc etc
    end

    area.player(:list) do |i|
      # pretend theres real code here
    end
  end
end

module Helpers
  def current_char
    @client.character
  end
end


Hey! Thats not too bad. So in the previous code we’ve defined three roles (player, immortal, admin) and four commands (say, area create, area info, area list). In the Roles module, we define the different roles… which are then used in the Command class. Those roles are defined as methods in the Roles module. That module is included in CommandBase (which Command inherits from). Those dynamically generated methods, in turn, define methods which are called by a dispatch method (which is outside the scope of this article, but suffice it to say dispatch looks at “area info 1” and says “okay, call the ‘area’ command”... more on this later). The way namespacing works is pretty self explanatory as well. You might guess that the code that lets all this happen is a bit gnarly. Actually, its not bad. Its been through several revisions, the first of which were fairly nasty… but the “final” version is short and pretty simple. This was aided, in part, by Ruby 1.9. (Bonus points if you can point out the 1.9 feature in use in the code below.)

module RoleBuilder
  def role(name, &amp;role_block)
    define_method(name) do |command,&amp;command_block|
      define_method(command) do |input|
        command_block.call(input) if role_block.call(current_char)
      end
    end
  end
end

So, lets start with the RoleBuilder module. Really quite simple, we have defined a method “role” which creates a function, which in turn will create the functions which actually run commands. So in this case, use of the RoleBuilder module might look like this:

module Roles
  extend RoleBuilder

  role(:player) { |c| true }
  role(:admin) { |c| c.level &gt; 100 }
end

In the above example, we end up defining two new methods: player and admin. The “c” being passed into each of the blocks is the logged in character we’re dealing with. That code is outside the scope of this post (and not interesting). So, just pretend that its obvious that a Character object should be passed into those. Then as long as the role block evaluates to true eventually the resulting command will be called (line 5 of the previous code). Moving on…

Alright, it gets a bit more interesting here. If we want to namespace a command (area list, area create, area delete) we use the “namespace” method defined here. Looking back at the code that uses this code (the first code sample) we see that namespace is used like this:

namespace(:area) do |area|
  area.admin(:create) do |i|
    # code to create a new area here
  end

  area.immortal(:info) do |i|
    # code to display area info here
  end
end

class CommandBase
  extend Roles
  extend Namespace
  include Helpers
end

module Namespace
  def namespace(name,&amp;block)
    ns_class = Class.new(CommandBase)
    yield ns_class
    ns = ns_class.new

    define_method(name) do |input|
      m = input.match(/([^\s]+)\s/)
      ns.send(m[1],m.post_match)
    end
  end
end

So what happens is that “namespace” creates a temporary class which inherits from CommandBase, just like Command, as well as a typical command method, like player or admin. When that method gets called (in this case “area”), it then re-dispatches the command, using the next argument. So, its like this…

  1. input: “area delete 1”
  2. dispatch calls “area”
  3. input: “delete 1”
  4. area command calls its own dispatch
  5. dispatch calls “delete”
  6. input: “1”

Pretty neat. Its almost, sorta, recursive. (Except that its not calling the same code, but rather, very similar code)

Theres at least one other good way I can think of to deal with writing a command system. That way involves pattern matching. If I was going to go back and do this again, I might try that, just for my own amusement, but I think the way I just described wins in practicality and “coolness”. And really… whats programming without coolness?

Saturday, December 15, 2007

This article was written 10 seconds ago

You’ve undoubtedly used programs where the timing of something is described as “{integer} {units} ago”. (Apple Mail, Reddit, uh… HeavyInk) In these programs, the integer counts up and the units become larger as time passes.

Getting to the point, I’ve seen a few implementations of this. However, they all involve lots of silly copy/paste type code in if or case constructs. So, for something I was working on over at HeavyInk I implemented my own variation of this, without the drivel.

TIME_UNITS = %w(year month week day hour minute)

def time_ago(time)
  diff = Time.now - time
  TIME_UNITS.select{ |unit| 
    diff > 1.send(unit) ? units_ago(unit,diff) : nil 
  } || "Less than a minute ago" 
end

def units_ago(unit,seconds)
  in_units = seconds / 1.send(unit)
  "#{in_units} #{in_units != 1 ? unit.to_s.pluralize : unit} ago" 
end

This was a little bit strange looking to me when I first came back to it, so I think an explanation is in order. First off, TIME_UNITS contains all of the units that we’re interested in, ordered from largest to smallest. We start off by passing a Time object to time_ago.

time_ago(10.minutes.ago) #=> "10 minutes ago" 

(Yeah, I know its a pretty silly example.) How it works becomes fairly obvious. Iterate through each of the time units, and find the first one we’ve exceeded at least one of. Divide our number of seconds of difference by the number of seconds in whatever unit we’ve chosen. (In this case, minutes, so 60.) So then we have our unit and the number of those units that we have.

Simple, and in my opinion, pretty elegant. Certainly better than copy/pasting lots of lines of code

Wednesday, November 14, 2007

Treetop Parser

I’ve been doing some work with Coco/R for Ruby lately. I understand that Coco/R is classic. I understand that in some languages, it might be great. But the Ruby implementation of it is a half-finished, hacked-together, piece of crap.

Seriously. Its rubbish.

So, anyway, I’ve spent a couple days at the office trying to get it to parse these, fairly complicated strings of text. An example might look something like…

Some Title Name #5 Vol. 01 Subtitle:Subsubtitle (AAA123456) (More Extra Data) TYLER CVR A

That in itself would not be hard… its the fact that they’re all horribly different. Is data consistency really that freaking hard?! ...But thats a whole other digression.

Anyway… Getting to the point. At RubyConf, Nathan Sobo of Pivotal Labs introduced a parser written in Ruby. It uses a completely different theory than traditional compilers. I haven’t looked into the gory details much, so I can’t really comment.

What I can comment on, however, is the fact that it works really well. I decided to toy around with a bit, before I switch my project at work to it. So, I watched the screencast which Nathan put together, and I got to work…

After maybe 30 minutes of hacking on it, I have what I believe to be a pretty decent CSS parser. It lacks some things at the moment, specifically application… But whatever. Here it is, for your amusement:

grammar Css
  rule stylesheet
    whitespace rule_set* whitespace
  end

  rule rule_set
    whitespace selector+ whitespace '{' whitespace instruction* whitespace '}'
  end

  rule selector
    selector_key whitespace
  end

  rule selector_key
    [a-zA-Z#.:]+
  end

  rule instruction
    instruction_key whitespace ':' whitespace instruction_value ';' whitespace
  end

  rule instruction_key
    [a-z-]+
  end

  rule instruction_value
    [a-z]+
  end

  rule whitespace
    [\s]*
  end
end

I’m sure it can be done more elegantly and more solidly… but for a first pass in 30 minutes, I’m pleased. What strikes me most of all is how easy it is. Easy to get setup, easy to write grammar files for, and easy to use.

It took me significantly longer than this just to get Coco/R running…

Monday, November 12, 2007

Pattern matching in Ruby

So, I’ve been reading “Practical OCaml” for the last couple days. I was impressed with the way OCaml handles pattern matching. Its not quite as succinct as in Haskell, but I believe its a bit more flexible. (Although I’m not great with Haskell, so I could be wrong)

Anyhow… I was reading Hacker News this evening and came across a post by Eric at http://blog.pretheory.com on pattern matching in Ruby.

It uses Raganwald’s domain-specific language code, which I have yet to use, but from looking at the code for a few moments, it seems pretty slick. Overall, a good addition to Ruby, in my opinion.

However…

Being that I’m reading about OCaml’s fabulous pattern matching at the moment, I see some shortcomings, which I’d like to address.

First is regular expression matching…

def which_country(n)
  match n do
    with(/\d{3}[ -]\d{3}[ -]\d{4}/) { "US phone number" }
    with(/\d-\d{3} \d{2} \d{2}/)    { "Chilean phone number" }
    with(/\d{3} \d{2} \d{2} \d{2}/) { "French phone number" }
    otherwise { "Quit making crap up..." }
  end
end

By default, any call to this function would yield “Quit making crap up…”. However, due to the slick way in which Eric wrote the pattern matching code, all we need to add is this…

class Regexp
  def patmatch(arg,mapping)
    arg =~ self
  end
end

Which allows us to yield…

&gt; which_country("555 338 5913")
"US phone number" 
&gt; which_country("9-555 23 23")
"Chilean phone number" 
&gt; which_country("555 23 45 67")
"French phone number" 

Not really any different than a case statement… but I like it anyway.

The second thing is referred to as “guarded matches” in OCaml. Essentially, each “with” is a boolean function.

def which_mod(n)
  match n do
    with( lambda{|n| n % 2  0} ) { "divisable by two" }
    with( lambda{|n| n % 3  0} ) { “divisable by three” }
    with( lambda{|n| n % 5 == 0} ) { “divisable by five” }
    otherwise { “lets pretend its prime” }
  end
end

&gt; which_mod(4)
"divisable by two" 
&gt; which_mod(17)
"lets pretend its prime" 

Getting this working is just as easy as getting regular expressions working.

class Proc
  def patmatch(arg,mapping)
    self.call(arg)
  end
end

Have fun and thanks again for such a neat hack, Eric.

Sunday, November 04, 2007

Hash#to_styles

On multiple occasions I’ve had the need to generate a bit of CSS on the fly, in Ruby. This was always bothersome, not because it was hard, but because it led to ugly, messy code. Then I realized… What is a block of CSS if not a glorified Hash?

class Hash
  def to_styles
    self.inject("") do |styles,p|
      next styles if p[1].nil?
      styles + "#{p[0]}:#{p[1]};" 
    end
  end
end

Interestingly, it wouldn’t take much more to represent an entire stylesheet using nested Hashes.

class Hash
  def to_stylesheet
    self.inject("") do |stylesheet,rules|
      stylesheet + "#{rules[0]} { #{rules[1].to_styles} }" 
    end
  end
end

Now, taking a stylesheet and making it into a Hash on the otherhand, is a bit harder. Rubis is doing something like that at the moment, except that it allows nested selectors. But that is the subject of a completely different digression.

Anyway, hopefully this post saves some irritated programmer from the dreaded folly of stylesheet creation by string concatenation.