Sunday, November 11, 2007

Random number challenge

I read a blog post by Ariya Hidayat on a challenge to generate discrete integer random number 1..7 given a function that is able to generate random numbers (discrete integer) 1..5. This is my answer from the computer engineering point of view:

int r,x;
do{
do{
r = rand5();

if (r > 2 && r <= 4) x = rand5() + 5;

else if (r <= 2) x = rand5();

}while (r > 4);
}while (x > 7);
return x;

This answer is actually a bad one, at least compared to Ariya's answer in the sense that the probability that a call to this function will generate the respective random number directly without involving the loop is really low. Huh ?!?!? Okay, basically let me say that the inner loop will generate hit 80 % of the time. Why? Because there are 5 possibilities, and 4 of them are used to generate discrete integer random number 1..10 while 1 is deliberately ignored. Meanwhile, the outer loop will obviously generate 70 % hit. So, in the end the code will be effective 56 % (as in 80 % x 70 %, right ?) of the time without involving any loop. Actually, I think Ariya's answers are some of the most efficient implementations possible, especially the first one, hehe. Had I seen the challenge earlier, I wouldn't have been able to generate an implementation of the same quality, it would be so so like the above :(.
Now, let's take a glimpse from the mathematical point of view. If you've taken at least one math subject in your undergrad, I believe you should have heard of this term called central limit theorem. Huh?!? what's that?!? hehe, read here. I only know the basic. Basically, if you sum up m random numbers of the same property(mean, variance, etc) you'll end up with a random number with the property of normal distribution with mean m times the original mean and with variance m times the the original mean. So now, you have the ability to generate random number 1..5, and from this you're able to get some normally distributed random numbers, what's next? First, let's see the code to generate rand5 (implemented in matlab for laziness reason).

function r = rand5(m,n)
r = floor(5 * rand(m,n) + 1);
This code will generate a matrix of m row and n column of discrete random numbers. Let's see the code to generate rand7.

function r = rand7(n)
m = 1000;
sig = sqrt(2);
mu = 3;
temp = sum(rand5(m,n));
zn = (temp - m * mu)/(sig * sqrt(m));
x=0.5*erfc(-zn/sqrt(2));
r = floor(mod(x*70,7))+1;

The code above will generate a vector of n discrete random numbers 1..7.
So, how it works? The sig variable is obviously the standard deviation (sqrt of variance) of the random number generated by rand5, while mu is the mean of it which is 3 (correct? (1 + 2 + 3 + 4 + 5)/5 = ?). Now we call the rand5 to generate m by n matrix of it and sum it column by column, hence we get a vector of m number of rows and put it to variable temp. I put m to 1000 here(I know it s*cks and too large and not practical, but my purpose here is to learn something rite :p) such that it is large enough to generate a normal distribution with mean m * mu and standard deviation sig * sqrt(m). The next step is to convert it to standard normal distribution random number zn with mean 0 and standard deviation 1. Finally, convert it to uniform distribution by virtue of standard normal distribution cumulative distribution function(cdf). Again, another uncommon name. Well, basically in the program sense, cdf(x) gives the probabilty (from 0 to 1 obviously) that a random number generator will generate random number less or equal than x with some distribution (normal here) property. So here we get a uniform distribution which seems continous from 0 to 1 but actually it's discrete (and depends on m{that's why the larger the better, as I've told you}). The next step is trivial and obvious.
Now, by calling rand7 function so many times, we are able to generate so many random numbers that we could get a histogram of them. First, let's see the histogram of zn.


As we can see, it is indeed normally distributed with mean 0 (obvious) and variance 1 (unless you're a statistic professor you can't readily see it).
Now, let's check the histogram of rand7.

As we can, not so obviously, see, it is indeed uniformly distributed with discrete integer state 1..7. Cool huh !?! Well, that's about it.
Anyway, this method will make you look knowledgeable and yet at the same time stupid and dumb (because of redundancy and not much logic in it) in front of your interviewer and in the end it's not fun at all as you couldn't "tease" your interviewer with it and it's actually not "harmless" if you don't code in matlab as it requires some matrix and vector optimization to speed up your code in such large numbers of array manipulation.

1 comment:

Ariya Hidayat said...

This is an interesting trick, particularly useful when rand5() is not guaranteed to be uniformly distributed.