m on October 16th, 2009
A better photo of the figurines.

A better photo of the figurines.

Obviously, I suppose, my previous post about the likelihood of getting all 8 anniversary Astérix figurines in only 14 tries of Kinder Surprise eggs was related to a classic statistics problem, the Coupon Collector’s Problem. The problem assumes a uniform random distribution of coupons (I guess, in cereal boxes, or something), and then goes about figuring out both expected values of trials (that is, what’s the average number of eggs you need to randomly acquire before you get all of the figurines) as well as probabilities at certain moments. Wikipedia even generalizes the problem out, and, as is the case with nearly every math-related article, quickly gets into proofs and pictures of formulas from LaTeX and so on. My head starts to hurt a bit, though I did go through the effort of figuring out what a harmonic number is as well as Euler’s solving of the Basel problem.

Nevertheless, simple answers eluded me, namely, what was the expected value for getting all of the Astérix figurines? In the last post, I brute forced an answer that said that I should expect to get all 8 by the 14th trial only 20% of the time. So I know the expected value had to be greater than 14. But how much? Furthermore, would it be possible to figure out what the distribution was with this problem, thereby being able to make more interesting probability predictions, like, “there’s a 1-in-n chance that you’ll get all the characters in only 10 tries.” The math eluded me, and naming your program “R” makes it impossible to google tutorials that may show how to solve it, so I turned back to my two old friends, brute force and perl. This time, instead of counting how many successes I had after 14 trials, I’d keep each effort going until I had success, and then I’d count the trials (or number of eggs). The code:

#!/usr/bin/perl -w

# there are 8 characters [0-7], and I'll later add 1 to give us [1-8].
$range = 8;
# let's run this a large number of times.
$sim = 1000000;
# And we want to take an average, so we need a sum of trials.
$trialsum = 0;
# Finally, let's keep track of the number of trials per run in a file.
open (OUT, "> /tmp/coupon.trials.out");

# do a simulation of $sim runs.
for($j = 1; $j <= $sim; $j++){
        local($trials, $c, @characters);
        # First, pick seven eggs to start the process.
        # 7 because it's impossible to get all 8 in the first 7.
        $trials = 7;
        for($i = 1; $i <= 7; $i++){
                push(@characters, int(rand($range)) +1);
        }
        $c = join(" ", @characters);
        # Now see if all eight characters are present.
        $all_eight = 0; # since there are only 7 eggs!
        do {
                $rand = int(rand($range)) +1;
                $c = $c . " " . $rand;
                $trials++;
                alleight($c);
        } until ($all_eight == 1);
        print OUT $trials . "\n";
        $trialsum += $trials;
}
print "avg num of trials to get all the figurines: " . $trialsum/$sim . "\n";

sub alleight {
        $c = shift;
        if(($c =~ /1/) && ($c =~ /2/) && ($c =~ /3/) && ($c =~ /4/)
        && ($c =~ /4/) && ($c =~ /5/) && ($c =~ /6/) && ($c =~ /7/)
        && ($c =~ /8/)){
                $all_eight = 1;
        }else{
                $all_eight = 0;
        }
}

As you can see, it writes the number of trials to a file on the server, giving each trial its own row. This will come in handy in the next part of the post. There’s a huge assumption I’m making in this code, of course, which is that every attempt will eventually yield all of the figurines. There’s no reason that must be so, since it’s possible that I can be stuck on 7 characters for life, but I think it’s a chance I’m willing to take.

So after a million trials, here was the result:

avg num of trials to get all the figurines: 21.735367

Now, I lied a little, when, earlier, I said the math was above me. I actually was totally able to estimate the expected value mathematically. It’s just 8/8 + 8/7 + 8/6 + 8/5 + 8/4 + 8/3 + 8/2 + 8/1. If you ask Google what that is, Google will tell you it’s 21.7428571. I’d put that in the realm of “close enough.” But I want more, dammit, and for more, I did, in fact, need the simulation.

The next step is to bring the 2.8mb file of trials into R, and to get some preliminary information about it:

> coupons = read.table('coupon.trials.out.txt')
> trials = coupons$V1
> summary(trials)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
8.00   16.00   20.00   21.74   26.00  138.00
> sd(trials)
[1] 8.71664

I feel bad for the poor guy who bought 138 boxes of Kinder Surprises before he got all eight figurines. Then again, that’s a lot of tasty chocolate. A lot. That’s also, for those of you counting at home, 15,456 calories. About a week’s worth of food! Since the distribution isn’t normal, I can’t just call that 138 an outlier. It’s part of the fat tail, after all, a tail that yielded 11 results over 100.

Sadly, here’s where about my abilities fall apart, as I don’t know how to do anything with distributions that are neither normal nor binomial. But I can make a pretty picture:

The trials against a normal distribution with the mean and variance of the trials.

A normal distribution with the mean and variance of the trials is in red.

As you can see, then, it is only with great risk that we try to determine any kind of probabilities regarding the winning of the figurines with the normal distribution. But we can assume that one’s chances of getting all eight figurines quickly are much, much better than normal!

Tags: , , , ,

One Response to “Astérix and the coupon collector’s problem”

  1. Hey really awesome article! Yeah, I hate the name R…

Leave a Reply