Incremental Pearson
The Pearson correlation algorithm is a stalwart part of the the collaborative-filtering-enthusiasts’s arsenal of awesome. For reference, here is an implementation of it in Clojure, swiped from my clojure-cluster project.
(defn pearson [v1 v2]
(let [v1mean (mean v1)
v2mean (mean v2)]
(/ (sum (map
#(* (- %1 v1mean) (- %2 v2mean))
v1 v2))
(sqrt (* (sum (map #(square (- % v1mean)) v1))
(sum (map #(square (- % v2mean)) v2)))))))
Alex MacCaw recently posed an interesting question to me regarding Pearson. What if your vectors are too large to be held in memory? How do you correlate them then? Is it even possible?
I scratched my head for a while and then realized… Yeah. It is. And it’s easy.
I should note first though that the one caveat of this method is that you do need to calculate the mean of the vectors first. There’s no way to get around that and still have a perfect calculation. Now, on the other hand, you could estimate it without that… But that’s a whole other post.
(sum (map
#(* (- %1 v1mean) (- %2 v2mean))
v1 v2))
(sum (map #(square (- % v1mean)) v1))
(sum (map #(square (- % v2mean)) v2))
Notice that in the snippets from the pearson function above we’re summing up calculations applied to each node in the vectors. Within those calculations the only other variables we use are the means of either vector. So! If we simply take chunks of the vectors we want to correlate and just keep summing up these calculations… We can then take those numbers and finish the calculation without the need of the vectors themselves. Think of it like this…
(/ numerator (sqrt (* v1-denominator v2-denominator)))
That is the Pearson function… We just need to find the numerator, v1-denominator, and v2-denominator. So, I’ll stop babbling and let the code speak for itself…
(defn pearson-start [v1mean v2mean]
{ :v1 { :mean v1mean } :v2 { :mean v2mean } })
(defn pearson-finish [state]
(/ (get state :numerator)
(sqrt (get (get :v1 state) :denominator)
(get (get :v2 state) :denominator))))
(defn pearson-increment [v1 v2 state]
(let [numerator (or (get state :numerator) 0)
v1denom (or (get (get :v1 state) :denominator) 0)
v2denom (or (get (get :v2 state) :denominator) 0)
v1mean (get (get :v1 state) :mean)
v2mean (get (get :v2 state) :mean)]
{ :v1 { :mean v1mean
:denominator (+ v1denom (sum (map #(square (- % v1mean)) v1))) }
:v2 { :mean v2mean
:denominator (+ v2denom (sum (map #(square (- % v2mean)) v2))) }
:numerator (+ numerator (sum (map #(* (- %1 v1mean) (- %2 v2mean)) v1 v2))) }))
The use of this is pretty straightforward. First you call pearson-start, passing in the means of the two vectors. This returns a HashMap which you’ll need to hold on to. Then you take your massive vectors and break them down into chunks. You take two equally sized chunks (one from each vector) pass them into pearson-increment along with the value returned from pearson-start. Again, remember to hold onto the return value.
When you’re done processing all of the chunks of the vectors simply pass the latest return value from pearson-increment into pearson-finish and it finishes the pearson calculation… returning the final value.