June 7th, 2013

BIG VS. LITTLE: P-Values and Coefficients

RSS icon RSS Category: Uncategorized
Fallback Featured Image

The Quick and Dirty:

For the moment let’s assume that we have some a priori hypothesis, and we want to test. We can talk about two things: how big the relationship is and how strong it is. P-values don’t care about big – they only care about strong.
To get a sense for this recall from ANOVA the fairly common test statistic F. We decide whether or not there is reason to believe that our data reflect a true underlying relationship between the variables of interest based on whether the F statistic generated by our data falls on one side or another of the rejection threshold corresponding with a critical F.
We have some chosen level α (normally .05 by convention). In the simplest sense, F depends on the number of conditions we’re testing, the number of observations, and the variance in the dependent variable relative to the treatment variable of interest. Regardless of whether you are working from a logit regression or a gaussian, most statistical softwares, including R will give you a P-value. The P-value gives you information about the independent variable, and is unique to each independent variable in your model. You can see it here in the output from R below in the far right column.

glm(formula = HrsSlp ~ +HrsTVR)


Estimate     Std. Error  t value  Pr(> t )
(Intercept)   1.7483       1.2957  1.349   0.19037
HrsTVR         0.4615      0.1449   3.186   0.00412

The P value is the exact probability of the observed test statistic; it allows us to skip over entirely the cumbersome calculation of a threshold statistic and states directly the chance that if all of your assumptions hold and if you repeat sampling (or your experiment) in the same way that you will observe a test statistic more extreme (with a smaller P-value) than the one you have now.  It doesn’t tell you the probability that your hypothesis is true.
In the output above I ran a really quick little glm on Hours Of Sleep as a function of Hours of TV. Neglect everything else, and note that the P-value associated with HrsTVR is  0.00412 – not surprising, considering that the two variables came from a set of numbers that I cooked up to play around with (and so are related by design). We reject our tacit assumption that the two variables we’re interested in aren’t related IF P < α.

Moving along

Let’s assume that the two variables we’re interested in aren’t really that related. In the example, I’ve chosen the Number of Hours of Sleep and the Number of Hours of TV Watched (by the guy three houses away).
I operationalized this in R as: “`glm(formula = HrsSlp ~ +HrsTVNR)“` for n = 25 observations and gotten the following feedback:

Estimate   Std. Error   t value   Pr(> t )
(Intercept)      5.3218      0.7607    6.996  3.95e-07
HrsTVNR       0.1454      0.2674    0.544  0.592

Not surprising – the P-value tells us that if we were to take some different sample of both of these, we would  get a totally different estimate for how much some guy’s TV habits impact hours of sleep.  But what if the two are really related? What if we just by chance randomly sampled 25 nights that are totally not characteristic?
Below are the P-values for increasingly large samples of observed nights sleep and TV watching.  Not surprisingly, the p-values get smaller as the data get larger. Were we to continue on this way we would eventually get to the point where we have such a large number of observations, the significance can be taken for granted; we will have not only passed the threshold P < α, but we will have passed it long ago.
50 0.353
150      0.220
300 0.152 (the estimated coefficient is 0.08578).
At this point you should be bothered; we have two totally unrelated things, and earlier I said that according to the P-value we could say with relative confidence that one variable didn’t have much to do with the other. We’re good, though; as the P-value is demonstrating increasing confidence, the estimated coefficient is going toward zero, which is exactly what should happen. As n increases, our conclusion that X and Y aren’t related is based less in the relative variance of the two quantities, and more in our increasing certainty that an increasingly narrow window around something close to 0 as the estimated parameter correctly captures the true relationship.

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today