Browser history to analyze gender

A script that analyzes your browser history to figure out your gender seems to have gone viral. What the script does is to look at the male-to-female ratio of the top websites (in my history: slashdot.org has a 1.73:1 male-to-female ratio while facebook.com has a 0.83 male ratio and google.com is 0.98). According to the post:

I then apply the ratio of male to female users for each site and with some basic math determine a guestimate of your gender. The math is really quite simple, I just take:
1 / (1 + r_1 * r_2 * … * r_n)
where p_i is the ratio of men-to-women for the specific site.
Now, I'm not against simple formulae, but the above formula is mathematically absurd for two main reasons:

1. The limit is wrong. The more fractions you multiply, the smaller the fraction gets. So, if you visit a lot of popular websites (whose numbers, due to demographics, are all slightly less than 1), the formula will go to 1/(1 + 0) = 1.0 i.e. you will be female.

2. Independence is assumed but not true. By multiplying the individual probabilities, you are assuming that they are independent. But if visiting a website is indicative of gender, then obviously, they are not independent. You can't multiply like this.

Enough with the criticism. How would I fix the formula while keeping the math simple? Change the formula to:
1 / (1 + Average(r_1 , r_2 , … * r_n) )

No comments:

Post a Comment