logoalt Hacker News

lordnachoyesterday at 10:00 PM1 replyview on HN

It seems the criticism is indeed Berkson's Paradox, but the example is different to the canonical example of Berkson's paradox.

In the canonical example, you have uncorrelated attributes, eg skill and attractiveness in actors, forming a round scatter plot with no correlation. Selecting a subpopulation of top actors who are either skilled or attractive, you get a negative correlation. You can visualize this as chopping the top-right of the round scatter plot off: the chopped off piece is oriented in roughly a line of negative correlation.

In this example, if you look in the linked paper inside the post by Dimakis, there is a positively correlated scatter plot: You can tell the shape is correlated positively between youth and adult performance. But in this case, if you condition on the extremes of performance, you end up selecting a cloud of points that has flat to slight negative correlation.


Replies

MontyCarloHallyesterday at 10:18 PM

Correlated attributes can still lead to the paradox, so long as the error measured parallel to the cutoff line (the "fuzziness" of the correlation) is greater than the slope of the cutoff line. Here are a couple cartoons to demonstrate. Denote each datapoint with I or E, depending on whether it's included or excluded in the region x + y > z.

Uncorrelated attributes:

   y
   │   ∙                
   │    ∙∙ IIIIIII      
   │     E∙∙IIIIIIII    
   │    EEEE∙∙IIIIIII   
   │    EEEEEE∙∙IIIII   
   │    EEEEEEEE∙∙III   
   │     EEEEEEEEE∙∙    
   │       EEEEEEE  ∙∙  
   │                  ∙ 
   └───────────────────x
Looking at just the Included points shows clear (spurious) negative correlation.

Correlated attributes:

   y
   │  ∙              
   │   ∙∙   IIII   
   │     ∙∙IIIIII  
   │      E∙∙IIIII   
   │     EEEE∙III    
   │    EEEEEE∙∙     
   │     EEEE   ∙∙   
   │       E      ∙∙ 
   │                ∙
   └─────────────────x
The Included points still have a negative spurious correlation, though it's smaller than for the uncorrelated cartoon.