June 7th, 2013

BIG VS. LITTLE: P-Values and Coefficients

RSS icon RSS Category: Uncategorized
Fallback Featured Image

The Quick and Dirty:

For the moment let’s assume that we have some a priori hypothesis, and we want to test. We can talk about two things: how big the relationship is and how strong it is. P-values don’t care about big – they only care about strong.
To get a sense for this recall from ANOVA the fairly common test statistic F. We decide whether or not there is reason to believe that our data reflect a true underlying relationship between the variables of interest based on whether the F statistic generated by our data falls on one side or another of the rejection threshold corresponding with a critical F.
We have some chosen level α (normally .05 by convention). In the simplest sense, F depends on the number of conditions we’re testing, the number of observations, and the variance in the dependent variable relative to the treatment variable of interest. Regardless of whether you are working from a logit regression or a gaussian, most statistical softwares, including R will give you a P-value. The P-value gives you information about the independent variable, and is unique to each independent variable in your model. You can see it here in the output from R below in the far right column.

glm(formula = HrsSlp ~ +HrsTVR)

Coefficients:

Estimate     Std. Error  t value  Pr(> t )
(Intercept)   1.7483       1.2957  1.349   0.19037
HrsTVR         0.4615      0.1449   3.186   0.00412

The P value is the exact probability of the observed test statistic; it allows us to skip over entirely the cumbersome calculation of a threshold statistic and states directly the chance that if all of your assumptions hold and if you repeat sampling (or your experiment) in the same way that you will observe a test statistic more extreme (with a smaller P-value) than the one you have now.  It doesn’t tell you the probability that your hypothesis is true.
In the output above I ran a really quick little glm on Hours Of Sleep as a function of Hours of TV. Neglect everything else, and note that the P-value associated with HrsTVR is  0.00412 – not surprising, considering that the two variables came from a set of numbers that I cooked up to play around with (and so are related by design). We reject our tacit assumption that the two variables we’re interested in aren’t related IF P < α.

Moving along

Let’s assume that the two variables we’re interested in aren’t really that related. In the example, I’ve chosen the Number of Hours of Sleep and the Number of Hours of TV Watched (by the guy three houses away).
I operationalized this in R as: “`glm(formula = HrsSlp ~ +HrsTVNR)“` for n = 25 observations and gotten the following feedback:

Estimate   Std. Error   t value   Pr(> t )
(Intercept)      5.3218      0.7607    6.996  3.95e-07
HrsTVNR       0.1454      0.2674    0.544  0.592

Not surprising – the P-value tells us that if we were to take some different sample of both of these, we would  get a totally different estimate for how much some guy’s TV habits impact hours of sleep.  But what if the two are really related? What if we just by chance randomly sampled 25 nights that are totally not characteristic?
Below are the P-values for increasingly large samples of observed nights sleep and TV watching.  Not surprisingly, the p-values get smaller as the data get larger. Were we to continue on this way we would eventually get to the point where we have such a large number of observations, the significance can be taken for granted; we will have not only passed the threshold P < α, but we will have passed it long ago.
N P
50 0.353
150      0.220
300 0.152 (the estimated coefficient is 0.08578).
 
At this point you should be bothered; we have two totally unrelated things, and earlier I said that according to the P-value we could say with relative confidence that one variable didn’t have much to do with the other. We’re good, though; as the P-value is demonstrating increasing confidence, the estimated coefficient is going toward zero, which is exactly what should happen. As n increases, our conclusion that X and Y aren’t related is based less in the relative variance of the two quantities, and more in our increasing certainty that an increasingly narrow window around something close to 0 as the estimated parameter correctly captures the true relationship.

Leave a Reply

Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao
H2O Release 3.34 (Zizler)

There’s a new major release of H2O, and it’s packed with new features and fixes!

September 15, 2021 - by Michal Kurka
From the game of Go to Kaggle: The story of a Kaggle Grandmaster from Taiwan

In conversation with Kunhao Yeh: A Data Scientist and Kaggle Grandmaster In these series of interviews,

September 13, 2021 - by Parul Pandey

Start your 14-day free trial today