I See Dead Code

… as sounding brass, or a tinkling cymbal.

I See Dead Code header image 2

Stack Overflow Statistics

February 8th, 2009 · No Comments

Going through a computational linguistics program will bring you in touch with Zipf’s Law. Its core claim:

In a corpus, the frequency of any word is inversely proportional to its rank.

Translated into less-wordy terms, it means that some words (events) occur very often and many words only occur a few times, or only once.

Zipf’s Law also holds for similar structures like DNA, but the distribution can also be observed in the user reputation of Stack Overflow. The following three graphs contain the reputation (X) and the frequency of this particular reputation value (Y) on log-scaled axes. With increasing normalization, the plot gets more Zipf-like, with the typical long “tails” at the lower end.

Distribution of Reputation, no normalization

Distribution of Reputation, 10r normalization

Distribution of Reputation, 100r normalization

If we plot the mass distribution of reputation orderd by decreasing reputation on log-log axes, we get something that looks like the cumulative of an exponential distribution:

Reputation Mass Distribution

On 2009-02-05, the total amount of reputation on Stack Overflow was 8,491,989, and around 15% of the users make up 85% of the reputation (not completely Pareto’s 80-20), with the top user (of 41,082) owning 0.39% of the overall reputation.

For these graphs, I’ve scraped the user overview pages, scraping every single user page would allow for more interesting (and accurate, since inactive users can be removed) statistics, but I’d rather wait for a proper API.

Tags: lang:en · other

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment