###### By: H2O.ai

**The Quick and Dirty:**

For the moment let’s assume that we have some a priori hypothesis, and we want to test. We can talk about two things: how *big *the relationship is and how *strong *it is. P-values don’t care about big – they only care about strong.

To get a sense for this recall from ANOVA the fairly common test statistic F. We decide whether or not there is reason to believe that our data reflect a true underlying relationship between the variables of interest based on whether the F statistic generated by our data falls on one side or another of the rejection threshold corresponding with a critical F.

We have some chosen level α (normally .05 by convention). In the simplest sense, F depends on the number of conditions we’re testing, the number of observations, and the variance in the dependent variable relative to the treatment variable of interest. Regardless of whether you are working from a logit regression or a gaussian, most statistical softwares, including R will give you a P-value. The P-value gives you information about the independent variable, and is unique to each independent variable in your model. You can see it here in the output from R below in the far right column.

```
glm(formula = HrsSlp ~ +HrsTVR)
```

Coefficients:

Estimate | Std. Error | t value | Pr(> | t | ) |
---|---|---|---|---|---|

(Intercept) | 1.7483 | 1.2957 | 1.349 0.19037 | ||

HrsTVR | 0.4615 | 0.1449 | 3.186 0.00412 |

* *The P value is the exact probability of the observed test statistic; it allows us to skip over entirely the cumbersome calculation of a threshold statistic and states directly the chance that *if *all of your assumptions hold and *if *you repeat sampling (or your experiment) in the same way that you will observe a test statistic more extreme (with a smaller P-value) than the one you have now. It doesn’t tell you the probability that your hypothesis is true.

In the output above I ran a really quick little glm on Hours Of Sleep as a function of Hours of TV. Neglect everything else, and note that the P-value associated with HrsTVR is 0.00412 – not surprising, considering that the two variables came from a set of numbers that I cooked up to play around with (and so are related by design). We reject our tacit assumption that the two variables we’re interested in aren’t related IF P < α.

### Moving along

Let’s assume that the two variables we’re interested in aren’t really that related. In the example, I’ve chosen the Number of Hours of Sleep and the Number of Hours of TV Watched (by the guy three houses away).

I operationalized this in R as: “`glm(formula = HrsSlp ~ +HrsTVNR)“` for n = 25 observations and gotten the following feedback:

Estimate | Std. Error | t value | Pr(> | t | ) |
---|---|---|---|---|---|

(Intercept) | 5.3218 | 0.7607 | 6.996 3.95e-07 | ||

HrsTVNR | 0.1454 | 0.2674 | 0.544 0.592 |

Not surprising – the P-value tells us that if we were to take some different sample of both of these, we would get a totally different estimate for how much some guy’s TV habits impact hours of sleep. But what if the two are really related? What if we just by chance randomly sampled 25 nights that are totally not characteristic?

Below are the P-values for increasingly large samples of observed nights sleep and TV watching. Not surprisingly, the p-values get smaller as the data get larger. Were we to continue on this way we would eventually get to the point where we have such a large number of observations, the significance can be taken for granted; we will have not only passed the threshold P < α, but we will have passed it long ago.

N P

50 0.353

150 0.220

300 0.152 (the estimated coefficient is 0.08578).

At this point you should be bothered; we have two totally unrelated things, and earlier I said that according to the P-value we could say with relative confidence that one variable didn’t have much to do with the other. We’re good, though; as the P-value is demonstrating increasing confidence, the estimated coefficient is going toward zero, which is exactly what should happen. As n increases, our conclusion that X and Y aren’t related is based less in the relative variance of the two quantities, and more in our increasing certainty that an increasingly narrow window around something close to 0 as the estimated parameter correctly captures the true relationship.