Significance: To p or not to p?

Something that is deemed significant is, by definition, both worthy of attention and important. This holds true for most “significant” things – be they words or events – that we might encounter in our day-to-day lives. It is due to this familiarity that, when faced with something that has been found to be statistically significant, we may very well overestimate its weight and value. For you see, statistical significance has a slightly different meaning – it applies to the probability that research findings are not merely due to chance. While this in itself sounds plausible and irreproachable, this probability is in fact easily manipulated to suit the needs of the researcher.This can lead to the influential reporting of statistically significant results that are, in fact, of little consequence.

Let’s take a brief (click here for a comprehensive view) look at what we’re discussing. Statistical significance applies to statistical hypothesis testing wherein two hypotheses are developed: the research hypothesis, which is the theory that the researcher wishes to find supporting evidence for, and the null hypothesis, which is the hypothesis that is investigated during the project and which states that the opposite to the research hypothesis is true. Using sample data collected during the experiment, the null hypothesis is tested statistically leading to the generation of what is called a p-value. These p-values tend to be interpreted in one of the following three ways: p is the probability that the results found were due to chance; 1-p is the reliability of the result (the replicable value of the finding); p is the probability that the null hypothesis is in fact true. The last interpretation is the most common and is in fact what we were taught, almost as gospel, last year. However (brace yourselves), this is not true. Berger and Sellke presented a case where the probability that the null was true was 0.52, and yet, the associated p-value was a statistically significant 0.05. Clearly, these two values aren’t remotely near each other. So, as you see, a low p-value is not all it’s cracked up to be.

On the subject of low p-values, I’m sure you all remember that this is what we were taught to look for in a study (this particular value was our “significance level”). But for something that is given as much respect and attention as a low p-value (usually lower/smaller than the arbitrarily chosen 0.05, sometimes 0.01), it is ridiculously easy to manipulate.  The p-value is related to sample size in such a way that, one could have a result of p = 0.051 which is just above the common significance threshold, and by simply increasing the sample size, one can change this value to a p=0.049 or lower. Exactly how much difference is there between the first and second stated p value? There is a minuscule 0.002 between the two, and yet, the latter value is significant while the former is nonsignificant. Here we can see that our p-values can be easily manipulated and our judgement of them is dictated by an arbitrary value.

Things are not looking good for statistical significance so far and I hate to kick a theory when it’s down, but I have one further criticism to make of it. Statistical hypothesis testing can also lead one to think that analysing studies by comparing their respective significance values is allowed – it has been done many a time. This article is a lengthy (but fascinating!) discourse on why this should never be done. It presents the example of two studies: one that was significant with p ≤ 0.01 and one that failed to get a significant result (being one standard error from 0). However, despite this, the difference between the two results wasn’t significant in and of itself. It is often the case that the gap between significant and nonsignificant findings is so small as to be completely nonsignificant!

So should we really make use of p-values for significance? The answer is that we probably shouldn’t as they are both misused and misleading. It was for this reason that the APA considered banning their use in publications (they backed down on this as they did not wish to be a censoring body). The use of effect size measurements is now considered to be a better test of results, and I hope that having read my blog, you can see why it is recommended that you not use p-value significance tests.

Thank you for reading and tune in again next week 🙂

7 thoughts on “Significance: To p or not to p?

  1. The first thing I noticed about this blog were the links. I dont know if these were intentional or not but I must admit I was confused. They make it seem as though the work was just copied. Im sure it wasnt but thats the impression they gave to me. Were they to try and engage the reader and encourage them to delve deeper into where your information is coming from? or were they to be used as a form of reference? If it is the latter then perhaps it would have been better to put a reference list at the end. Also not keen on how the word THIS is hyperlinked as appose to the article or publisher itself. Im sorry to sound so critical, but just trying to get clarification.

    • * Not a statistical comment: not for grading by TA.

      My links are given in the standard blog format – when posting like this, it is usual to present hidden hyperlinks. WordPress provides an easily used button for this function! I’m sorry that you would prefer a reference list, but as this is a blog and not an essay, I will continue following standard procedure. I don’t understand how these links made my work appear plagiarised. It is all my own!

      Hope that’s cleared everything up and that I haven’t scared you away.

  2. Hi there. Your topic this week is a very hard one that could of easily ended badly…but I like your blog so do not fret! Towards the end of the first paragraph you mention briefly how the researcher tries to cater for their own needs. In my opinion with regards to this subject, you should of gone into further detail with this topic especially in relation to outliers. Researchers could have been so close to getting a significant result, so much so that they manipulate the outliers or change a two tailed test to a one tailed changing the results in order to reach that significance. This kind of researcher abuse of significance levels is something that you could have addressed more with regard to reliability in order to have further support with your final statement.

  3. Pingback: TA: Blogs commented on in week 4 « So…what do you think?

  4. This is a very good blog, I appreciated your example of how significance tests can be easily manipulated, your example was parsimonious an easy to follow. I chose the same topic for my blog and came across a vast amount of research discrediting significance testing, I was shocked by this, and I am now even more shocked that the APA considered banning their use! There is a lot of emphasis placed upon significance testing in psychology, despite its contentious status. Some of the research I found whilst writing my blog (Lykken, 1968; Schmidt, 1996) argued that significance testing is the least important aspect of a good experiment. The fact that there is nearly a thirty year difference between the two reports, informs us that the dispute about the importance of significance testing has been going on for a long time.
    Another contentious issue I encountered whilst writing my blog was that statistical significance does not imply practical significance. Your results may satisfy the criterion set by your alpha level but your results may not have any importance in the real world. This may be because the effect size is too small. This is similar to a type one error, finding evidence for a significant result when there is no effect in the population.
    In contrast to this, I managed to find a study (G. Krueger et al., 2000) on psoriasis treatments, they reported that the development of useful treatments was being limited due to the results for treatments having to show clinical significance. For Krueger et al. (2000) statistical significance alone was enough to approve treatments.

    References

    Schmidt, Frank L. (1996)
    Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.
    http://psycnet.apa.org/journals/met/1/2/115/

    Gerald G. Krueger et al. (2000)
    Two considerations for patients with psoriasis and their clinicians: What defines mild, moderate, and severe psoriasis? What constitutes a clinically significant improvement when treating psoriasis?
    http://www.sciencedirect.com/science/article/pii/S0190962200605911

    Lykken, David T. (1968)
    Statistical Significance in Psychological Research
    http://psycnet.apa.org/psycinfo/1968-18058-001

  5. Pingback: Comments for TA – Week 5 « psuc1a

  6. Very good blog. Indeed Beale(1972) found that the importance of getting the good significant P value for researchers meant that if they don’t get it they will ignore the results and try and manipulate the data. They are under such pressure that they will do anything to their data in order to get it even if this makes the data invalid etc. This of course damages the point of the study.

Leave a comment