Biostatistics & Epidemiology Slide Book 2020 - Amazon Web Services

Transcription

Boards & Beyond:Biostatistics & Epidemiology SlidesColor slides for USMLE Step 1 preparationfrom the Boards and Beyond WebsiteJason Ryan, MD, MPH2021 EditionBoards & Beyond provides a virtual medical school curriculm usedby students around the globe to supplement their education andprepare for board exams such as USMLE Step 1.This book of slides is intended as a companion to the videos foreasy reference and note-taking. Videos are subject to changewithout notice. PDF versions of all color books are available via thewebsite as part of membership.Visit www.boardsbeyond.com to learn more.Copyright 2021 Boards and BeyondAll rights reserved.i

ii

Table of ContentsBasic Statistics .1Hypothesis Testing .6Tests of Significance .9Correlations . 13Study Designs . 15Risk Quantification . 20Sensitivity/Specificity . 25Positive & Negative Predictive Values . 29Diagnostic Tests . 32Bias . 36Clinical Trials . 41Evidence-Based Medicine . 44iii

iv

Basic StatisticsStatistical DistributionRandom Blood Glucose Healthy Subjects908710111010593929588112Basic StatisticsJason Ryan, MD, MPHStatistical 9295881121151121019285791009986102Central TendencyNormal or Gaussian Distribution Center of normal distribution Three ways to characterize: Mean: Average of all numbers Median: Middle number of data set when all lined up in order Mode: Most commonly found numberNo.SubjectsBlood Glucose LevelMean and ModeMedian Six blood pressure readings: Odd number of data elements in set 90, 80, 80, 100, 110, 120 80-90-110 Middle number is median 90 Mean (90 80 80 100 110 120)/6 96.7 Mode is most frequent number 80 Even number of data elements 80-90-110-120 Halfway between middle pair is median 100 Note: Must put data set in order to find median1

Central TendencyCentral TendencyMeanMedianModeNo.SubjectsNegative SkewMode is always highest pointIf distribution even, mean/median modeNo.SubjectsBlood Glucose LevelKey Points If distribution is equal, mean mode median Mode is always at peak In skewed data:Median Mean is always furthest away from mode toward tail Median is between Mean/ModeMean Mode is least likely to be affected by outliers Adding one outlier changes mean, median Only affects mode if it changes most common number Outliers are unlikely to change most common numberBlood Glucose LevelDispersion10mg/dlMeanCentral TendencyPositive skewNo.SubjectsModeBlood Glucose LevelCentral TendencyModeMedianDispersion Measures 10mg/dl2Standard deviation (SD)VarianceStandard error of the mean (SEM)Z-scoreConfidence interval

Standard DeviationStandard Deviationσ Σ(x-x)2n-1Standard Deviationσ Σ(x-x)2n-1Group 1(mean 10)9891011121010σ Σ(x-x)2n-1x-x difference b/w data point and meanΣ(x-x) sum of differencesΣ(x-x)2 sum of differences squaredGroup 2(mean 10)5691012131514n number of samplesStandard DeviationGroup 1 Difference(mean 10) from mean9-18-29-1100111122100100Group 1 Difference(mean 10) from mean Squared9-118-249-1110001111122410001000Group 2 Difference(mean 10) from mean5-56-49-1100122133155144Standard DeviationGroup 1 Difference(mean 10) from mean Squared9-118-249-1110001111122410001000σ 11 1.247σ Σ(x-x)Cn-11111σ Σ(x-x)2n-196Standard DeviationGroup 2 Difference(mean 10) from mean Squared5-5256-4169-111000122413391552514416σ 96 3.77Group 2 Difference(mean 10) from mean Squared5-5256-4169-111000122413391552514416-1σ 1σ9668%3

Standard Deviation-2σStandard Deviation 3σ-3σ 2σ95%99.7%Standard DeviationSample Question A test is administered to 200 medical students. Themean score is 80 with a standard deviation of 5. Thetest scores are normally distributed. How manystudents scored 90 on the test? 90 is two standard deviations away from mean 2.5% of students score in this range (1/2 of 5%) 2.5% of 200 5 % 2σ95%VarianceStandard DeviationVarianceStandard Error of the Mean σ Σ(x-x)2n-1How precisely you know the true population meanSD divided by square root of nMore samples less SEM (closer to true mean)Big σ means big SEM Need lots of samples (n) for small SEM Small σ means small SEMσ2 Σ(x-x)2n Need fewer samples (n) for small SEMSEM σn4

Z scoreZ score Z score of 0 is the mean Z score of 1 is 1SD above mean Z score of -1 is 1SD below mean3σ-32σ-2-11σ0 1 2 3Confidence IntervalsConfidence Intervals Mean values often reported with 95% CIs Mean is 120mg/dl /- 5mg/dl Range in which 95% of repeated measurements wouldbe expected to fall Confidence intervals are for estimating populationmean from a sample data set Suppose mean 10SD 4; n 16SEM 4/sqrt(16) 4/4 1CI 10 1.96*(1) 10 295% of repeated means fall between 8 and 12 Upper confidence limit 12 Lower confidence limit 8Suppose we take 10 samples of a population of 1M peopleMean of 10 samples is XHow sure are we the mean of 1M people is also X?Confidence intervals answer this questionCI95% Mean /- 1.96*(SEM)Confidence Intervals95% Don’t confuse SD with confidence intervals Standard deviation is for a given dataset Suppose test grade average (mean) 79Standard deviation 5Your grade 89Your Z score (89-79)/5 2 This value often confusing Read carefully: What are they asking for? Range in which 95% of measurements in a dataset fallSuppose we have ten samplesThese samples have a mean and standard deviation95% of these samples fall between /- 2SDThis is descriptive characteristic of the sample Mean /- 2SD 95% confidence interval of the mean Mean /- 1.96*SEM Confidence intervals This does not describe the sample An inferred value of where the true mean lies for population5

Hypothesis TestingHypothesis Testing A cardiologist discovers a protein level that may beelevated in myocardial infarction called MIzyme. Hewishes to use this to detect heart attacks in the ER. Hesamples levels of MIzyme among 100 normal subjectsand 100 subjects with a myocardial infarction. Themean level in normal subjects is 1mg/dl. The meanlevel in myocardial infarction patients is 10mg/dl. Can this test be used to detect myocardial infarction inthe general population?Hypothesis TestingJason Ryan, MD, MPHHypothesis TestingHypothesis TestingScatter Other way to think about it: Does the mean value ofMIzyme in normal subjects truly differ from the meanin myocardial infarction patients? Or was the difference in our experiment simply due tochance? Depends on several factors:d Difference between means normal/MI Scatter of data Number of subjects testedHypothesis TestingScatterNormalHypothesis TestingKey Point: Scatterof data pointsinfluenceslikelihood that there isa true differencebetween meansNumber of samplesddddd dd d dddddNormalMIMIzyme levelNormalMIMIzyme level6MIMIzyme levelKey Point: Numberof data pointsinfluenceslikelihood that there isa true differencebetween means

Hypothesis TestingHypothesis Testing Hypothesis testing mathematically calculatesprobabilities (ie. 5% chance, 50% chance) that the twomeans are truly different and not just different bychance in our experiment Math is complex (don’t need to know) Probabilities by hypothesis testing depend on: Two possibilities of our test of MIzyme #1: MIzyme does NOT distinguish between normal/MI Difference in means was by chance; true means are the same #2: MIzyme DOES distinguish between normal/MI Difference in means is real Null hypothesis (H0) #1 Alternative hypothesis (H1) #2 Difference between means normal/MI Scatter of data Number of subjects testedHypothesis TestingHypothesis TestingHypothesis TestingHypothesis Testing In reality, either H0 or H1 is correct In our experiment, either H0 or H1 will be deemedcorrect Hypothesis testing determines likelihood ourexperiment matches with reality Four possible outcomes of our experiment: #1: There is a difference in reality and our experiment detectsit. This means the alternative hypothesis (H1) is found true byour study. #2: There is no difference in reality and our experiment alsofinds no difference. This means the null hypothesis (H0) isfound true by our study. #3: There is no difference in reality but our study finds adifference. This is an error! Type 1 (α) error. #4: There is a difference in reality but our study misses it. Thisis also an error! Type 2 (β) error.Experiment Each of the four outcomes has a probability of beingcorrect based on:H1 Difference between means normal/MI Scatter of data Number of subjects testedH0RealityH1PowerβH0αH0 CorrectPower Chance of detecting differenceα Chance of seeing difference that is not realβ chance of missing a difference that is really therePower 1- β7

PowerPower Chance of finding a difference when one exists Or chance of rejecting no difference (because therereally is one) Maximize power to detect a true difference In study design, you have little/no control over: Scatter of data Difference between means Also called rejecting the null hypothesis (H0) Power is increased when: You DO have control over Increased sample size Large difference of means Less scatter of data (more precise measurements) Number of subjects Number of subjects chosen to give a high power This is called a power calculationStatistical ErrorsStatistical Errors Type 1 (α) error Type 2 (β) error False positiveFinding a difference/effect when there is none in realityRejecting null hypothesis (H0) when you should not haveExample: Researchers conclude a drug benefits patients but itdoes not Null hypothesis generally not rejected unless α 0.05False negativeFinding no difference/effect when there is one in realityAccepting null hypothesis (H0) when you should not haveExample: Researchers conclude a drug does not benefitpatients but a later study finds that it does Can get type 2 error if too few patients Similar (but different) from p value p value calculated by comparison α set by study design8

Tests of SignificanceComparing Groups Many clinical studies compare group means Often find differences between groups Different mean ages Different mean blood levels, etc. Need to compare differences to determine thelikelihood that they are real and not due to chanceTests of Significance Are the differences “statistically significant?”Jason Ryan, MD, MPHComparing GroupsComparing GroupsddLittle scatter of data in groupsGroups far apart relative to scatterGroup 1Lots of scatter of data in groupsGroups not far apart relative to scatterGroup 1Group 2Test ResultKey PointGroup 2Test ResultComparing Groups Scatter of data points relative to difference in meansinfluences likelihood that difference between means isdue to chance This is how differences between means are tested todetermine likelihood that they are different due tochance Don’t need to know the math Just understand principleddddd dd d ddddGroup 19Group 2Test ResultKey Point: Numberof data points alsoinfluenceslikelihood thatdifference betweenmeans is due to chance

Comparing GroupsData Types Three key tests Quantitative variables: t-test ANOVA Chi-square 1, 2, 3, 4 Categorical variables: High, medium, low Positive, negative Yes, No Determine likelihood difference between means is dueto chance Likelihood of difference due to chance based on Quantitative variables often reported as number Scatter of data points How far apart the means are from each other Number of data points Mean age was 62 years old Categorical variables often reported as percentages 40% of patients take drug A 20% of patients are heavy exercisersT-testT-test Compares two MEAN quantitative values Yields a p-value p value is chance that the null hypothesis is correct A researcher studies plasma levels of sodium inpatients with SIADH and normal patients. The meanvalue in SIADH patients is 128mg/dl with a standarddeviation of 2. The mean value in normal patients is136mg/dl with a standard deviation of 3. Is thisdifference significant? Common questions: No difference between means If p 0.05 we usually reject the null hypothesis andstate that the difference in means is “statisticallysignificant” Which test to compare the means? (t-test) What p-value indicates significance? ( 0.05)T-testANOVA A researcher studies plasma levels of sodium inpatients with SIADH and normal patients. The meanvalue in SIADH patients is 128mg/dl with a standarddeviation of 2. The mean value in normal patients is136mg/dl with a standard deviation of 3. Is thisdifference significant? If the p value is high (non-significant) why might thatbe the case? Analysis of variance Used to compare more than two quantitative means Consider: Plasma level of creatinine determined in non-pregnant,pregnant, and post-partum women Three means determined Cannot use t-test (two means only) Use ANOVA Need more patients Increase sample size increase power to detect differences Yields a p-value like t-tests10

Chi-squareConfidence IntervalsConfidence IntervalsConfidence Intervals Compares two or more categorical variables Must use this test if results are not hard numbers When asked to choose statistical test for a datasetalways ask yourself whether data is quantitative orcategorical Beware of percentages –often categorical data Sixteen normal subjects have their blood glucose levelsampled. The mean blood glucose level is 90mg/dlwith a standard deviation of 4md/dl. What is thelikelihood that the mean glucose level of another tensubjects would also be 90mg/dl? How confident are we in the number 90mg/dl? In scientific literature, means are reported with aconfidence interval To calculate a confidence interval you need 2 things Standard deviation (σ) Number of subjects tested to find mean value (n) Study subjects: Mean glucose was 90 /- 4 Authors believe that if the study subjects were resampled, the mean result would fall between 86 and94 for 95% of re-samples For 5% of re-samples, the result would fall outside of86 to 94Confidence Interval /- Z * σZ 1.96 for 95% CIZ 2.58 for 99% CIConfidence IntervalConfidence Interval Sixteen normal subjects have their blood glucose levelsampled. The mean blood glucose level is 90mg/dlwith a standard deviation of 4md/dl. What is thelikelihood that the mean glucose level of anothersixteen subjects would also be 90mg/dl? Don’t confuse with standard deviation Mean /- 2SD 95% of samples fall in this range Mean /- CI 95% chance that repeated measurement of mean in this range If you see 95% in a question stemConfidence Interval Z * σ 1.96 * 4 1.96 2nn Read carefully: What are they asking for? Range of 95% of samples? 95% confidence interval of mean?1695% chance that next 16 samples would fallbetween 88 and 92mg/dl11

Confidence IntervalsOdds and Risk RatiosGroup Comparisons Some studies report odds or risk ratios with CIs If range includes 1.0 then exposure/risk factor doesnot significantly impact disease/outcome Example: Risk of lung cancer among chemical workers studiedRisk ratio 1.4 /- 0.5Confidence interval includes 1.0Chemical work not significantly associated with lung cancer(Formal statement: Null hypothesis not rejected)Many studies report differences between groupsCan average differences and calculate CIsIf includes zero, no statistically significant differenceExample: Confidence IntervalsGroup Comparisons Some studies report group means with CIs If ranges overlap, no statistically significant difference Group 1 mean: 10 /- 5; Group 2 mean: 8 /-4 Confidence intervals overlap No significant difference between means Similar to p 0.05 for comparison of means Group 1 mean: 10 /- 5; Group 2 mean: 30 /-4 Confidence intervals do not overlap Significant difference between means Similar to p 0.05 for comparison of means12Mean difference between two groups is 1.0 /- 3.0Includes zeroNo significant difference between groupsSimilar to p 0.05(Formal statement: Null hypothesis not rejected)

CorrelationsCorrelation CoefficientLifespanPearson CoefficientCorrelationsJason Ryan, MD, MPHPack-years of smokingCorrelation CoefficientCorrelation CoefficientLifespanPearson CoefficientLifespanPearson CoefficientPack-years of smokingPack-years of smokingCorrelation CoefficientCorrelation CoefficientPearson CoefficientPearson CoefficientLifespan Measure of linear correlation between two variablesRepresents strength of association of two variablesNumber from -1 to 1Closer to 1, stronger the relationship(-) number means inverse relationship More smoking, less lifespan ( ) number means positive relationship More smoking, more lifespan 0 means no relationshipPack-years of smoking13

Correlation CoefficientCorrelation CoefficientPearson CoefficientPearson CoefficientDirection of Relationshipdr 0.9(stronger relationship)r -0.5NegativeCorrelation Coefficientr 0No relationshipCoefficient of DeterminationPearson Coefficientr2 Studies will report relationships with CC Example: r 0.5Positivedr 0.5dStrength of Relationship Sometimes r2 reported instead of r Always positive Indicates % of variation in y explained by xStudy of pneumonia patientsWBC on admission evaluated for relationship LOSr 0.5Higher WBC Higher LOS Sometimes a p value is also reported P 0.05 indicates significant correlation p 0.05 indicates no significant correlationr2 0.6(60% variation y explained by x)14r2 1(100% variation y explained by x)

Study DesignsEpidemiology Studies Goal: Determine if exposure/risk factor associatedwith disease Many real world examples Study DesignsJason Ryan, MD, MPHTypes of StudiesHypertension strokeSmoking lung cancerExercise fewer heart attacksToxic waste leukemiaCross-sectional StudyDetermine association of exposure/risk with disease Cross-sectional study Case-control study Cohort study (prospective/retrospective ) Patients studied based on being part of a group New Yorkers Women Tall people Frequency of disease and risk factors identified How many have lung cancer? How many smoke? Snapshot in time Patients not followed for months/yearsCross-sectional StudyCross-sectional Study Main outcome of this study is prevalence New Yorkers were surveyed to determine whetherthey smoke and whether they have morning cough.The study found a smoking prevalence of 50%. Amongresponders, 25% reported morning cough. Note the absence of a time period 50% of New Yorkers smoke 25% of New Yorkers have lung cancer May have more than one group 50% men have lung cancer, 25% of women have lung cancer But groups not followed over time (i.e. years) Patients not followed for 1-year, etc. Can’t determine: Likely questions: How much smoking increases risk of lung cancer (RR) Odds of getting lung cancer in smokers vs. non-smokers (OR) Type of study? (cross-sectional) What can be determined? (prevalence of disease)15

Cross-sectional StudyCross-sectional Study Using a national US database, rates of lung cancerwere determined among New Yorkers, Texans, andCalifornians. Lung cancer prevalence was 25% in NewYork, 30% in Texas, and 20% in California. Theresearchers concluded that living in Texas isassociated with higher rates of lung cancer. Key points: Researchers discover a gene that they believe leads todevelopment of diabetes. A sample of 1000 patients israndomly selected. All patients are screened for thegene. Presence or absence of diabetes is determinedfrom a patient questionnaire. It is determined that thegene is strongly associated with diabetes. Key points: Presence of different groups could make you think of otherstudy types However, note lack of time frame Study is just a fancy description of disease prevalence Note lack of time frame Patients not selected by disease or exposure (random) Just a snapshot in timeCase SeriesCohort Study Purely descriptive study (similar to cross-sectional) Often used in new diseases with unclear cause Multiple cases of a condition combined/analyzed Compares group with exposure to group without Did exposure change likelihood of disease? Prospective Patient demographics (age, gender) Symptoms Monitor groups over time Retrospective Done to look for clues about etiology/course No control groupCohort StudyExposed(smokers)CohortUnexposed(non-smokers) Look back in time at groupsCohort StudyDisease(cancer) Main outcome measure is relative risk (RR) How much does exposure increase risk of disease Patients identified by risk factor (i.e. smoking or non)No Disease Different from case-control (by disease) Example results Disease(cancer)No Disease1650% smokers get lung cancer within 5 years10% non-smokers get lung cancer within 5 yearsRR 50/10 5Smokers 5 times more likely to get lung cancer

Cohort StudyCohort Study A group of 100 New Yorkers who smoke wereidentified based on a screening questionnaire at alocal hospital. These patients were compared toanother group that reported no smoking. Both groupsreceived follow-up surveys asking about developmentof lung cancer annually for the next 3 years. Theprevalence of lung cancer was 25% among smokersand 5% among non-smokers. Likely questions: A group of 100 New Yorkers who smoke wereidentified based on a screening questionnaire at alocal hospital. These patients were compared toanother group that reported no smoking. Hospitalrecords were analyzed going back 5 years for allpatients. The prevalence of lung cancer was 25%among smokers and 5% among non-smokers. Likely questions: Type of study? (retrospective cohort) What can be determined? (relative risk) Type of study? (prospective cohort) What can be determined? (relative risk)Cohort StudyCase-control study Problem: Does not work with rare diseases Imagine: 100 smokers, 100 non-smokers Followed over 1 year Zero cases of lung cancer both groups In rare diseases need LOTS of patients for LONG time Easier to find cases of lung cancer first then compareto cases without lung cancerCase-Control StudyExposedComparerates of exposureUnexposedExposedUnexposedCompares group with disease to group withoutLooks for exposure or risk factorsOpposite of cohort studyBetter for rare diseasesCase-control study Main outcome measure is odds ratio Odds of disease exposed/odds of disease unexposed Patients identified by disease or no diseaseDisease(cases)No Disease(controls)17

Case-control studyMatching A group of 100 New Yorkers with lung cancer wereidentified based on a screening questionnaire at alocal hospital. These patients were compared toanother group that reported no lung cancer. Bothgroups were questioned about smoking within thepast 10 years. The prevalence of smoking was 25%among lung cancer patients and 5% among non-lungcancer patients. Likely questions: Selection of control group (matching) key to gettinggood study results Want patients as close to disease patients as possible(except for disease) Matching reduces confounding Want all potential confounders balanced betweencases and controls Type of study? (case-control) What can be determined? (odds ratio)Randomized TrialsCase-control vs. Cohort Don’t confuse with case-control Patients identified by disease like case-control Exposure determined randomlyCase ControlPatients by diseaseOdds ratioHow to Identify Study Types?CohortPatients by exposureRelative RiskHow to Identify Study Types? #1: How were patients identified? #2: Time period of the study Cross-sectional: By location/group (i.e. New Yorkers) Cohort: By exposure/risk factors (i.e. Smokers) Case-control: By disease (i.e. Lung cancer) Cross-sectional: No time period (i.e. snapshot) Retrospective: Look backward for disease/exposure Prospective: Follow forward in time for disease/exposure18

How to Identify Study Types? #3: What numbers are determined from study? Cross-sectional: Prevalence of disease (possibly by group) Cohort: Relative risk (RR) Case-control: Odds ratio (OR)19

Risk QuantificationWhy Risk is Important Understanding of disease causes comes fromestimating risk Smoking increases risk of lung cancer Exercise decreases risk of heart attacks We know these things from quantifying riskRisk Quantification Smoking increases risk of lung cancer X percent Exercise decreases risk of heart attacks Y percentJason Ryan, MD, MPHData for Risk EstimationThe 2 x 2 Table Obtained by studying:AC - Cohort study Case-control studyUses of the 2x2 TableDisease-BDRisk of Disease Can calculate many things: Risk in exposed group A/(A B) Risk in unexposed group C/(C D)Risk of diseaseRisk ratioOdds ratioAttributable riskNumber needed to harmExposure Exposure Presence/absence of risk factor/exposure In people with and without disease -20 ACDisease-BD

Risk RatioRisk Ratio Risk of disease with exposure vs non-exposureAC - Usually from cohort study Ranges from zero to infinity RR 1 No increased risk from exposure RR 1 Exposure increases risk RR 1 Exposure decreases riskDisease-BDRR A/(A B)C/(C D)Risk RatioRisk RatioRisk RatioRisk Ratio Example #1: Example #2: 10% smokers get lung cancer 10% nonsmokers get lung cancer RR 1 50% smokers get lung cancer 10% nonsmokers get lung cancer RR 5 Example #3: A group of 1000 college students is evaluated over tenyears. Two hundred are smokers and 800 are nonsmokers. Over the 10 year study period, 50 smokersget lung cancer compared with 10 non-smokers.10% smokers get lung cancer50% nonsmokers get lung cancerRR 0.2Smoking protective!Exposure Exposure RR 5 Smokers 5x more likely to get lung cancer than nonsmokers - Disease-RR A/(A B) C/(C D)21

Odds RatioOdds RatioExposure Usually from case control study Odds of exposure-disease/odds exposure-no-disease Ranges from zero to infinity - OR 1 Exposure equal among disease/no-disease OR 1 Exposure increased among disease/no-disease OR 1 Exposure decreased among disease/no-disease ACDisease-BDOR A/C A*DB/D B*COdds RatioOdds RatioOdds RatioRisk vs. Odds Ratio Example #1: Example #2: 10x lung cancer patients smoke vs. non-smokers 10x non-lung cancer patients smoke vs. non-smokers OR 1 50x lung cancer patients smoke vs. non-smokers 10x non-lung cancer patients smoke vs. non-smokers OR 5 Example #3: Risk ratio is the preferred metric 10x lung cancer patients smoke vs. non-smokers 50x non-lung cancer patients smoke vs. non-smokers OR 0.2 Easy to understand Tells you how much exposure increases risk Why not calculate it in all studies? Not valid in case-control studies RR is different depending on number cases you choose22

Risk vs. Odds RatioRisk vs. Odds RatioSuppose we find 100 cases and 200 controlsRR 50/100 2.050/2005050 -Lung Cancer-SmokingSmoking Now suppose we find 200 cases and 200 controlsRR 100/150 1.6100/25050150100 -200Risk vs. Odds Ratio 5050100-50150200OR 50/50 3.050/150 - 100100200Lung Cancer-10010020050150200Risk vs. Odds Ratio Risk ratio is dependent on number of cases/controls Invalid to use risk ratio in case-control Must use odds ratio insteadOR does not change with case number - -50150200OR 100/100 3.050/150Rare Disease AssumptionRare Disease AssumptionExposureOR A/C A*DB/D B*C -RR A/(A B) A/B A*DC/(C D) C/D B*COR RRWhen B A and D C ACDisease-BDOR RRWhen B A and D C23

Rare Disease AssumptionRare Disease Assumption OR RR Most exposed/unexposed have no disease (-) Few disease ( ) among exposed/unexposed Allows use of a case-control study to determine RR Commonly accepted number is prevalence 10% Case-control studies easy/cheap But odds ratio is weak association Classic question: Attributable RiskAttributable RiskSuppose 1% chance lung cancer in non-smokersSuppose 21% chance in smokersAttributable risk 20%Added risk due to exposure to smokingExposure Description of case-control studyRR reportedIs this valid?Answer: Only if disease is rare - ACDisease-BDAR A/(A B) – C/(C D)Attributable Risk PercentageNumber Need to Harm (risk exposed – risk unexposed)/risk exposed Represents % disease explained by risk factor Number of patients on average needed to be exposedfor one episode of disease on average to occur Example: Average number of people who need tosmoke for one case of lung cancer to develop If attributable risk to smoking is 20%, then NNH is1/0.2 5 Supposed ARP for smoking and lung cancer 80% Indicates 80% of lung cancers explained by smoking Can be calculated directly from RRARP RR – 1RRNNH 1AR24

Sensitivity/SpecificityIncidence and Prevalence Suppose 1,000 new cases diabetes per year This is the incidence of diabetes Suppose 100,000 cases of diabetes at one point in timeSensitivity andSpecificity This is the prevalence of diabetes for populationJason Ryan, MD, MPHIncidence and PrevalenceIncidence and Prevalence Incidence rate new cases / population at risk For chronic diseases Prevalence incidenceDetermined for a period of time (e.g. one year)Population at risk total pop – people with disease40,000 people10,000 with disease1,000 new cases per yearIncidence rate 1,000 / (40k-10k) 1,000 cases/30,000 For rapidly fatal diseases Incidence prevalence New primary prevention programs Both in

Biostatistics & Epidemiology Slides Color slides for USMLE Step 1 preparation from the Boards and Beyond Website Jason Ryan, MD, MPH 2021 Edition Boards & Beyond provides a virtual medical school curriculm used by students around the globe to supplement their education and prepare for board exams such as USMLE Step 1.