To further confirm this observation, we also plotted the data when n = 14 and when n = 126. As illustrated in Fig. 1, the relationship was kept exactly the same, except that a “+” in the right illustration represents nine pairs of data, rather than one. However, selleck inhibitor p value reduced as the sample size increased and became less than 0.05 or “significant” when n reached 126 or at our 8th addition of data duplication. If

we now draw a conclusion about the relationship between GPA and GRE based on the p value, we will arrive at a completely different one: GRE is significantly related to GPA. This clearly demonstrates the problem in drawing conclusions merely based on a p value since it is BIASED by the sample size! When a sample size is large enough, almost all statistical findings could get a p value less than 0.05 and become “significant”; in contrast, even if there is a high correlation, or a meaningful treatment effect, the p value could be larger than 0.05 if the sample size is small. The problem of making a research conclusion based on merely p value has been criticized for a long time. It is nearly

a century (over actually if we count Karl Pearson’s work in 1901) since Ronald Fisher advocated the concept and procedure of hypothesis testing in 1925. Known today as “significance” testing, the hypothesis testing is the most widely used decision-making procedure in scientific research. Meanwhile, hypothesis testing has been criticized from the very beginning, mainly for three aspects PLX4032 supplier 1, 2, 3, 4 and 5: (a) hypothesis testing (deductive) and scientific inferences (inductive) address different questions; (b) hypothesis testing is a trivial through exercise, to which Tukey 6 drove home this point when he commented “the effects of A and B are always different—in some decimal place—for any A and B. Thus asking ‘Are the effects different?’ is foolish”; and (c) hypothesis testing adopts a fixed level of significance (i.e., p < 0.05 or 0.01), which forces researchers to turn a continuum of uncertainty into a dichotomous “reject or do-not-reject” decision. Furthermore, as illustrated above, since a large sample size can lead to almost every

comparison being “significant”, this makes the word “significant” itself meaningless. In 1970, a group of sociologists criticized extensively the p value practice in their book The Significance Test Controversy 7 (see also more recent similar publications What if There Were No Significance Tests? edited by Harlow et al. 8 and The Cult of Statistical Significance by Ziliak and McCloskey 9). Almost 20 years ago, Cohen 3 published his well-known article The earth is round (p < 0.05), in which he concluded that “After four decades of severe criticism, the ritual of null hypothesis significance testing (mechanical dichotomous decisions around a sacred 0.05 criterion) still persists.” If we look at today’s widely spread, much worse p value driven practice, we have to conclude Sadly, the earth is still round (p < 0.