logoalt Hacker News

meander_watertoday at 1:23 AM4 repliesview on HN

Overall really interesting read, but I'm having trouble processing this:

> OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts

How can you arrive at any conclusion with such a small random sample size?


Replies

hoppolitoday at 1:47 AM

Statistical significance comes mostly from N (number of samples) and the variance on the dimension you're trying to measure[1]. If the variance is high, you'll need higher N. If the variance is low, you'll need a lower N. The percentage of the population is not relevant (N = 1000 might be significant and it doesn't matter if it's 1% or 30% of the population)

[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.

piskovtoday at 1:31 AM

https://en.wikipedia.org/wiki/Central_limit_theorem

For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).

That’s the basis of all polls and what have you

show 1 reply
abdullahkhalidstoday at 1:37 AM

Because the accuracy of an estimated quantity mostly depends on the size of the sample, not on the size of the population [1]. This does require assumptions like somewhat homogenous population and normal distributions etc. However, these assumptions often hold.

[1] https://stats.stackexchange.com/questions/166/how-do-you-dec...

jfrbfbreudhtoday at 1:26 AM

with enough samples