Please download to get full document.

View again

of 69

Experimental design and statistics

1. Course on Laboratory Animal Science  Experimental design and statistics Peter Klaren Dept. of Animal Ecology & Physiology Institute for Water and Wetland…
2 views69 pages
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Documenttranscript
  • 1. Course on Laboratory Animal Science  Experimental design and statistics Peter Klaren Dept. of Animal Ecology & Physiology Institute for Water and Wetland Research Faculty of Science
  • 2. “I have a study and I need to know how many patients I need. I think I only need three patients.” 2 https://www.youtube.com/watch?v=PbODigCZqL8
  • 3. 3
  • 4. Biomedical research’s replication crisis 4 Douglas Altman (1994): “Huge sums of money are spent (...) on research that is seriously flawed through the use of inappropriate designs, unrepresentative samples, small samples, incorrect methods of analysis, and faulty interpretation.” https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC2539276/pdf/bmj00425-0005.pdf http://blogs.bmj.com/bmj/201 4/01/31/richard-smith- medical-research-still-a- scandal/ Richard Smith (2014): “Twenty years later I fear that things are not better but worse.”
  • 5. 5 Healthy Cancer Vitamin D3 + calcium 1111 45 Placebo 1083 64 χ2 = 3.63, p = 0.06 (NS), but 30% lower risk... doi:10.1001/jama.2017.2115 “Scientific conclusions (...) should not be based only on whether a p-value passes a specific threshold.” Wasserstein & Lazar (2016) The ASA’s statement on statistical significance and p-values: context, process and purpose. Am. Stat. 70: 129-133 (https://doi.org/10.1080/00031305.2016.1154108)
  • 6. Average statistical power reported for detecting: Research domain “Small” effects “Medium” effects “Large” effects Medicine 14% 39% 61% * Sociology 55 84 94 Applied psychology 25 67 86 Social psychology 18 48 83 Mass communication 34 76 91 Marketing research 24 69 87 6 Why talk about power, sample size, experimental design? Because many studies are underpowered. Lipsey MW (1990) Design Sensitivity. Statistical Power for Experimental Research. SAGE, London. * Interpretation: when a “large” effect indeed is present, then only 61% of the reported experimental designs will detect this as a statistically significant result. 61% is only marginally better than the flip of a coin (50%). Medicine in particular seems to be doing a pretty lousy job.
  • 7. Button et al. (2013) Nature Rev. Neurosci. 14: 365-76. doi:10.1038/nrn3475 median power coin flip power 7 Frequency distribution of the median power calculated from 49 meta-analyses of 730 individual neurosciences studies. One third (15/49) of the meta-analyses concerns studies with a median power <11%. The median power (24/49) is 21%.
  • 8. Why talk about power, sample size, experimental design? Authors never present a sample size calculation in their papers 8 Prevalence (%) of reporting of: (A) randomisation (B) blinded assessment of outcome (C) sample size calculations (D) conflict of interest for 2,671 publications describing the efficacy of interventions in animal models of eight different diseases. Macleod et al. (2015) Risk of bias in reports of in vivo research: a focus for improvement. PLoS Biol. 13(10): e1002273. doi:10.1371/journal.pbio.1002273
  • 9. 9 PLoS Medicine 2(8): e124. http://dx.doi.org/10.1371/journal.pmed.0020124 “Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies to the most modern molecular research. There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims. However, this should not be surprising. It can be proven that most claimed research findings are false.”
  • 10. 10 Today’s menu: Why talk about experimental design and statistics? • What is science all about? - Measuring stuff and demonstrating causes and effects (cause  effect) • What are experimental design & statistics all about? - Control of variability - Isolating causal factors - Demonstrating interactions between factors • Which are the key steps in experimental design? - Start out with a well-defined hypothesis and research question - Design the appropriate control group(s) - Determine appropriate sample size and replicate. - Randomly allocate experimental units to treatments - Control and reduce variability through blocking (stratification) and factorial experiments http://articles.extension.org/pages/67849/experimental-design
  • 11. Example: Does drug X lower diastolic blood pressure in mice? 1. 11 • In an ideal world without biological variation, without measurement errors... • ...one observation would be enough. BP (mmHg) in mouse given drug X 89 Blood pressure is 89 mmHg. Mission accomplished?
  • 12. Example: Does drug X lower diastolic blood pressure in mice? 2. 12 • In an ideal world without biological variation, without measurement errors... • …I have to include a control group for comparison… • ...but one observation per group still would be enough. BP (mmHg) in control mouse BP (mmHg) in mouse given drug X 95 89 Okido. Blood pressure drops by 6 points (or 6.3%). Mission accomplished?
  • 13. Example: Does drug X lower diastolic blood pressure in female mice? 2. 13 • In the ugly real world we have biological variation, and measurements come with errors... • ...replicate! Obser- vation BP (mmHg) in control group Obser- vation BP (mmHg) in experimental group 1 95 1 89 2 92 2 85 3 90 3 86 4 98 4 94 5 99 5 91 n ... n ...
  • 14. How to house treatment (control, experimental) groups? The importance of randomization. 14 Animals on drug X Control animals
  • 15. Randomization is key to statistical analysis • Experimental units (“animals”) must be assigned to treatment or control groups at random. • Measurements must be made at random and blind to the group assignment. • Randomization and blinding prevent bias. • Randomization allows us to use probability distributions and statistical theory. 15
  • 16. 16 Sampling at random is not the same as haphazardly sampling. (From: Weyts et al. (1997) Brain Behav. Immun. 11: 95-105) 2 h 1 2 4 5 3 6 catching order Catching order correlates with plasma levels of the stress hormone cortisol, a glucocorticoid, in fish (common carp, Cyprinus carpio).
  • 17. Pseudoreplication: analyzing experiments in which treatments are not replicated (though samples may be) 17 Whitlock & Schluter (2015) The Analysis of Biological Data, 2nd ed., Roberts and Company Publishers, Inc. Hurlbert (1984) Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54: 187-211 No replication of treatments (n = 1). Segregation of treatments (n = 1, 4 measurements). Consider: are both chambers identical? Possible confounders: Completely randomized (n = 4).
  • 18. Example: Does drug X lower diastolic blood pressure in female mice? 3. 18 • In the ugly real world we have biological variation, and measurements come with errors... • ...replicate! BP (mmHg) in control group BP (mmHg) in experimental group 95 89 92 85 90 86 98 94 99 91 mean ± s (n = 5) 94.8 ± 3.8 89.0 ± 3.7* Student’s t-test for independent samples: t = 2.44, df = (5+5)-2 = 8, p = 0.020 So, blood pressure is significantly reduced by 6% following treatment. Drug X is effective, then? When I repeat the experiment, will I get a significant result again?
  • 19. Samples (experiments) are a snapshot of the population Hence: You never draw the same sample twice And thus: You never do the same experiment twice 19 Critical p- value: 0.05 In ca. 7200/10000 experiments the effect is detected... ...and in ca. 2800/10000 experiments the effect is missed. Power ca. 72% 104 samples (n = 5) from two populations with body weights 94.8 ± 3.8 and 89.0 ± 3.7 g (mean ± s), tested with a one-sided t-test. Power analysis by simulation. Effect size: 94.8 – 89.0 = 5.8 g
  • 20. The power of a statistical test indicates the sensitivity of a test to detect an effect when there is one 20 In ca. 7200/10000 experiments the effect is detected... Power = 72% www.gpower.hhu.de Power analysis by formal calculation using a t-distribution.
  • 21. B·E·A·N·S (more on this later…) 21 𝑧 𝛽 = 𝑧 𝛼 − 𝐸𝑆 ∙ 𝑛 2 ∙ 𝜎 𝑧 𝛽 = 1.645 − 94.8 − 89.0 ∙ 5 2 ∙ 3.75 = −0.800 𝑝(𝑧 ≥−0.800) = 0.7881; power 𝒄𝒂. 𝟕𝟗% Power analysis by formal approximation assuming a normal (z) distribution (which gives an over- estimation of power).
  • 22. How sample size, variability, and significance level affect power of a statistical analysis. 22 Sample sizes doubles, power increases ca. 30% (from 72% to 95%) Variability (SD) doubles, power more than halves (from 72% to 30%) Significance level five times more stringent (from 5% to 1%), power halves (from 72% to 39%) Power 72%
  • 23. How many times will your test give a significant outcome when there is no difference between groups? 23 105 samples (n = 5) from two populations with the same body weights 94.8 ± 3.8 g (mean ± s), tested with a one-sided t-test.
  • 24. Example: Does drug X lower diastolic blood pressure in female mice? 4. 24 • In the ugly real world we have biological variation, and measurements come with errors... • ...replicate! • ...control! - Paired controls remove inter-individual “background” variation Mouse BP (mmHg) before 4-week treatment BP (mmHg) after 4-week treatment Difference,  1 95 89 -6 2 92 85 -7 3 90 86 -4 4 98 94 -4 5 99 91 -8 mean ± s 94.8 ± 3.8 89.0 ± 3.7 -5.8 ± 1.8 Student’s t-test for paired samples: t = 7.25, df = 5-1 = 4, p = 0.00096 ***
  • 25. Example: Does drug X lower diastolic blood pressure in female mice? 5. 25 • In the ugly real world we have biological variation, and measurements come with errors... • ...replicate! • ...control! - Pair up! - Placebo! Mouse BP before 4- week treatment (mmHg) BP after 4-week treatment (mmHg)  1 (drug X) 95 89 -6 2 (drug X) 92 85 -7 3 (drug X) 90 86 -4 4 (drug X) 98 94 -4 5 (drug X) 99 91 -8 mean ± s 94.8 ± 3.8 89.0 ± 3.7 -5.8 ± 1.8 6 (placebo) 90 88 -2 7 (placebo) 96 93 -3 8 (placebo) 99 92 -7 mean ± s 95.0 ± 4.6 91.0 ± 2.7 -4.0 ± 2.6 Student’s t-test for independent samples, using the mean differences of the two groups: t = 1.17, df = 6, p = 0.143 But, how ‘bout the power of this experiment? ns
  • 26. Control in the context of experimental design: • Ceteris paribus: all other things being equal • Any treatment against which one or more other treatments is to be compared - “no treatment” - “procedural treatment” – saline injection – sham operation – placebo administration – paired observations (“before – after” measurements) Control in the technical context of experimentation: • The regulation of the physical environment in which an experiment is conducted - …but this is not a statistical consideration 26
  • 27. 27 “A study with low statistical power has a reduced chance of detecting a true effect, ... ...but low power also reduces the likelihood that a statistically significant result reflects a true effect.” Button et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Rev. Neurosci. 14: 365-76 doi:10.1038/nrn3475
  • 28. 28 “In well-planned research investigations, the question of appropriate sample size is crucial.” Hinkle, Wiersma, Jurs (2003) Applied Statistics for the Behavioral Sciences, 5th ed.
  • 29.  Group without “Y” (placebo, control)  Group treated with “Y” Let’s perform an experiment: does drug “Y” affect the body weight of rats? 29
  • 30. Again: in a perfect word, without any variation or error... 30  Group without “Y” (placebo, control)  Group treated with “Y”
  • 31. ...two (2) observations would suffice (because the difference in body weights can only be explained by the treatment) 31  Animal without “Y” (placebo, control)  Animal treated with “Y”
  • 32. But: we have to deal with biological variation – 1. 3,2-dimethyl-4-aminobiphenyl (DMAB) carcinogenicity differs between rat strains 32 Shirai et al. (1990) Carcinogenesis 11: 793-97. CH3 CH3 NH2
  • 33. Biological variation – 2. Variation in avoidance learning between strains and between individual mice 33 Bovet et al. (1969) Science 163(3863): 139-49. Outbred InbredInbredInbred
  • 34. Biological variation – 3. Clinical reference values are not fixed, but come in an interval 34
  • 35. Biological variation – 4. Clinical reference intervals differ per hospital/medical centre Jeroen Bosch Ziekenhuis, ‘s-Hertogenbosch St. Elisabeth Ziekenhuis, Tilburg VU Medisch Centrum, Amsterdam osmolality (mOsmol/kg) 275 – 300 275 – 300 280 – 300 albumin (g/l) 35 – 50 35 – 50 35 – 52 glucose (fasted) (mmol/l) 4.0 – 6.0 3.6 – 5.6 < 6.1 creatinine (μmol/l) 60 – 110 ♂ 50 – 100 ♀ 60 – 110 ♂ 50 – 90 ♀ 60 – 104 ♂ 49 – 90 ♀ iron (μmol/l) 14 – 30 ♂ 10 – 25 ♀ 11 – 28 ♂ 6.6 – 26 ♀ 11 – 32 ♂ 11 – 27 ♀ TSH (mU/l) 0.35 – 4.0 0.3 – 3.2 ♂ 0.3 – 3.9 ♀ 0.3 – 4.5 35 * Intervals comprise 95% of the measurements in healthy persons. (So 2.5% of these will have a higher, and 2.5% a lower than normal result.)
  • 36. (Biological) variation is not without consequence: In the real world, no two samples are ever identical... 36 2 4 Population mean = 120 Population mean = 100 1 3 Four simulated samples (n = 4) from two normally distributed rat populations with body weights of, respectively, 100 ± 15 g () and 120 ± 15 g (). (The standard deviation is 15 g.)
  • 37. 37 Samples from two populations N(100,152) () and N(120,152) (). P = 0.00028 *** P = 0.060 n.s. The larger the sample, the more reliable its statistics reflect the population’s parameters. 𝑥 = 116.0 s = 12.6 𝑥 = 104.2 s = 15.0
  • 38. The larger the sample the better, but too large a sample is a waste of resources • Reduce the number of treatment groups being compared. • Find a more precise measurement. • Decrease the variability in the measurements. - Make subjects more homogeneous. - Use stratification. - Control for other variables (e.g., sex, weight). - Average multiple measurements on each subject. • Remember the 3 Rs in laboratory animal practice: 38 38 Russell WMS, Burch RL (1959) The Principles of Humane Experimental Technique, Methuen, London
  • 39. Which factors determine whether our statistical test gives a significant outcome? 39 B·E·A·N·S
  • 40. B beta (); or Type-2 error (false negative) probability. Power is calculated as (1-). Typically: 0.10 ≤  ≤ 0.20 E effect size (ES) and effect direction; one- or two-sided testing To be stated by the researcher A alpha (); or Type-1 error (false positive) probability Conventionally: 0.01 ≤  ≤ 0.05 N n; number of experimental units (“number of animals”) To be calculated by the researcher S s; variability in measurements (standard deviation (s, ) or standard error (SE)) To be controlled by the researcher 40
  • 41. 41
  • 42. You never draw the same sample twice from the same population… 42 Body weight (in kg) of 50 Bioscience students Body weight distributions in 9×2 repeated samples (n = 5) from 50 Bioscience students
  • 43. You never draw the same sample twice from the same population… but larger samples are more reliable 43 Body weight (in kg) of 50 Bioscience students Body weight distributions in 9×2 repeated samples (n = 15) from 50 Bioscience students
  • 44. You never draw the same sample twice from the same population… 44 If this represents your experiment, you would conclude that the samples are from two different populations. They aren’t. You see an effect where there is none. This is a false positive. You commit a Type-1 error. Body weight distributions in 9×2 repeated samples (n = 5) from 50 Bioscience students Body weight (in kg) of 50 Bioscience students
  • 45. Any effect size becomes statistically significant...when the sample size is large enough. Population IQ (µ ± ) Sample size (n) Sample IQ (mean ± s) Effect size (ES) p 100.0 ± 15.0 4 112.3 ± 15.0 12.3 0.05 25 104.9 4.9 0.05 64 103.1 3.1 0.05 400 101.24 1.24 0.05 2500 100.50 0.50 0.05 10000 100.25 0.25 0.05 45 A difference of ca. 12 IQ points in a sample of 4 is just as significant, statistically, as a difference of 0.25 IQ points in a sample of 10000. Statistically significant, yes. But (clinically) relevant?? A p-value alone is uninformative. Equally important are: • How big is the effect? (What is the “oomph” factor?) • Should we care/who cares? (What is the relevance?)
  • 46. 46
  • 47. You never draw the same sample twice from the same population… 47 Body weight distributions in 9×2 repeated samples (n = 5) If this represents your experiment, you would conclude that the samples are from the same population. They are not. You missed an effect. You see a false negative You commit a Type-2 error.
  • 48. Let’s repeat the experiment not 9, but 1000 times... 48 females: 61.7 ± 9.1 kg, males: 75.9 ± 9.1 kg ES = 14.2 kg, d = 14.2/9.1 = 1.56 ca. 2700 out of 104 samples (27%) do not reach statistical significance in a t-test (p > 0.05). Power is 1 – 0.27 = 0.73 = 73%
  • 49. Biological variation causes Type-1 and Type-2 errors: false positives and false negatives 49 You’re not pregnant. You’re pregnant. False negative False positive
  • 50. 50
  • 51. Any effect becomes statistically significant...when the sample size is large enough. 51 Population IQ (µ ± ) Sample size (n) Sample IQ (mean ± s) Relative effect size (d) p 100.0 ± 15.0 4 112.3 ± 15.0 12.3/15.0 = 0.82 0.05 25 104.9 0.33 0.05 64 103.1 0.2 0.05 400 101.24 0.08 0.05 2500 100.50 0.033 0.05 10000 100.25 0.0167 0.05 This illustrates an “inverse square law”: to detect an effect 2× as small, sample size has to increase 22 = 4-fold. How ‘bout when effect size is 3× smaller?
  • 52. Effects of body weight (BMI) and relative risk for disease. Small effects do matter, sometimes! 52 Willet et al. (1999) N. Engl. J. Med. 341: 427 For a female 1.70 m tall, going from BMI 23.5 to 22.5 requires a weight loss of ca. 3 kg ( 4%). Would you by a diet pill for that? Her relative risk for type-2 diabetes is reduced by 1/3. That would be worthwhile a diet pill!
  • 53. 53
  • 54. Reducing variability by clever experimental design. Example. Suppose: • We have 20 male mice and 20 female mice. • Half to be treated; the other half left untreated. • We can only work with 4 mice per day. • How to assign individuals to treatment groups and to days? 54 54
  • 55. A very bad experimental design Mon Tues Wed Thurs Fri C C C C C C C C C C C C C C C C C C C C 55 Mon Tues Wed Thurs Fri T T T T T T T T T T T T T T T T T T T T C, control; T, treated; ▪, female; ▪, male Week 1 Week 2
  • 56. A fully randomized design (using a table of random numbers, tossing a die, flipping a coin, ...) Mon Tues Wed Thurs Fri T T C T T C T C C T C C C T C C C C T T 56 Mon Tues Wed Thurs Fri C T C C T T C T C C C T T T T T C C T T Week 1 Week 2 Full randomization doesn’t always work well with small sample sizes as it can lead to segregation of treatments or factors. (See Week 1/Wednesday, Week 2/Wednesday)
  • 57. A stratified (blocked) design Mon Tues Wed Thurs Fri C C C C T C T T T C T T T C C T C C T T 57 Mon Tues Wed Thurs Fri C C C T T T C T C T C T T T C T T C C C Week 1 Week 2 Two strata (blocks): 1. Treatment group: treated vs control animals 2. Sex: female vs male
  • 58. Randomization and stratification • If you can (and want to), fix a variable. - e.g., use only 8-week old male mice from a single strain. • If you don’t fix a variable, stratify (block) it. - e.g., use both 8-week and 12-week old male mice, and stratify with respect to age. • If you can neither fix nor stratify a variable, randomize it. 5858
  • 59. 59
  • 60. B·E·A·N·S ̶ Knowing 4 parameter values, you can calculate the 5th. 60 𝒏 = 2𝑠2 𝑧 𝛼 − 𝑧 𝛽 2 𝐸𝑆2 𝐍 = 2𝐒2 𝑧 𝐀 − 𝑧 𝐁 2 𝐄2 𝑬𝑺 = 2𝑠 𝑧 𝛼 − 𝑧 𝛽 𝑛 𝒛 𝜷 = 𝑧 𝛼 − 𝐸𝑆 ∙ 𝑛 2 ∙ 𝑠 B – β, Type-2 error proba
  • Advertisement
    MostRelated
    View more
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x