So, first: super amused with this project!
Second, to the point: data collection likely needs to consider context.
As an example, when looking at my own profile (http://osrc.dfm.io/weierophinney), perhaps the most amusing stat to me was this:
I hate to say it but Matthew is becoming—as one of the top 66% most vulgar users on GitHub—a tad foul-mouthed (with a particular affinity for filthy words like 'coochie').
What was amusing is that I interact with a contributer by the handle @hoochie-coochie; I'm definitely not prone to saying the word "coochie" in comments, commits, or other github-related dialog, but I will reference the user when addressing them in comments, and always with the @
annotation. (Ironically, I know that this comment will just inflate the counts of that word!)
As another data point, from my own profile:
Matthew and zendframework are probably friends or at least virtual friends.
Well, yes, yes, that is true! However, "zendframework" in this case is not a user, but an organization. The algorithm should check the status of a "user" to see if they are actually an organization, and omit organizations when considering affinity -- or at least alter how the data is presented. (As an example, considering "zendframework" is an organization, the next sentence causes a lot of amusement: "it's worth noting that zendframework is less of a PHP aficionado.")
Again, however, very much enjoying the project!