Self Help

Numbers Rule Your World - Kaiser Fung

Author Photo

Matheus Puppe

· 33 min read



  • The book looks at positive examples of how people are using statistics and data effectively, rather than focusing on common misuses and lies with numbers.

  • It profiles 10 case studies of individuals and organizations that are applying statistical thinking and analysis successfully in areas like transportation, public health, risk assessment, testing, data mining, lotteries, queues/waiting times, credit scoring, and air travel.

  • The statistical way of thinking focuses on deviations from averages, variability, patterns of correlation even without known causes, and noticing hidden nuances between groups that average statistics may obscure.

  • The stories are organized around 5 principles: 1) concern with variability not just averages, 2) acceptance of “wrong” models based on correlation, 3) awareness of differences between groups, 4) asymmetry in impacts and responses, 5) reckoning with improbable events.

  • The goal is to highlight how statistics are being adapted, refined and applied scientifically in these fields to better understand systems and improve outcomes, rather than just on inventions.

So in summary, the introduction presents the book’s positive perspective on statistics and sets up the 10 case studies that will explore the statistical way of thinking in practice.

The insurance industry adjusts prices to reflect the difference in the amount of exposure to hurricanes between coastal and inland properties. Coastal properties face higher risks from hurricanes and thus are charged higher premiums.

When designers of standardized tests attempt to eliminate the gap in performance between black and white students, they try to make the tests culturally neutral and applicable to all groups. However, some argue that standardized tests can still reflect cultural biases that disadvantage some groups.

The passages then go on to discuss several principles of statistical thinking based on examples from different fields like insurance, testing, transportation, and more. It covers ideas like balancing different types of errors in decision making, using statistical testing to decide if evidence fits a scenario or if alternative explanations are needed, and following specific protocols rather than assuming conclusions. The chapters aim to show how these statistical principles can be applied to make better decisions.

  • The “average bear” that Yogi Bear can outsmart is Boo Boo Bear.

  • The “average” conference call would be a routine, regularly scheduled call between work colleagues.

  • The “average” day would be a typical day without anything unusually good or bad happening.

The passage discusses how averages simplify reality by reducing diversity and variations. It warns against over-relying on averages and overlooking individual differences. Statisticians study the nature of variability - how much things change and what causes variations. Computing averages was meant to measure diversity, not be the end goal. The passage then uses examples of theme park wait times to illustrate how variability, not design flaws, cause queues even when capacity meets average demand. Computer simulations that model thousands of scenarios are needed to account for uneven patron arrivals and movement throughout the day.

The passage discusses how variable commute times are more frustrating for drivers than average commute lengths. Factors like accidents, weather, and poorly designed infrastructure contribute to unpredictable delays on roads.

Ramp metering is presented as an effective traffic management technique used in places like Minnesota and Seattle. It regulates the flow of vehicles entering highways using stoplights during congested periods. This helps maintain highway speeds near an optimal range to maximize throughput.

Ramp metering reduces variable commute times in two ways. First, it regulates speeds which leads to more reliable journeys. Second, spacing out entering vehicles through metering lowers accident rates, reducing unexpected delays. Fewer accidents means less overall congestion.

Studies show ramp metering can cut travel times in half during peak periods while increasing traffic volumes - referred to as the “freeway congestion paradox.” This challenges the notion that more vehicles always mean slower speeds. Properly throttling inflows helps keep highways operating near full capacity for longer stretches of time.

In summary, the passage discusses how variable commute times frustrate drivers more than average lengths. It presents ramp metering as an effective traffic management strategy that helps reduce unpredictability through regulating highway speeds and spacing out vehicles.

  • Ramp meters work by controlling the flow of traffic entering freeways at on-ramps. They space out vehicles to reduce traffic fluctuations and ensure consistent freeway speeds.

  • Minnesota pioneered ramp metering in the US. It now has over 400 metered ramps, one of the largest systems in the country. Experts view Minnesota’s system as a national model.

  • Both freeway managers and Disney face the challenge of congestion. While capacity expansion helps, it is limited and unreliable due to fluctuating demand. Optimizing existing capacity is a cheaper, faster alternative.

  • Disney pioneered perception management to shape people’s experience of waiting. Tactics like theming lines and posting estimated (longer) wait times make lines feel shorter. The FastPass system also changes perceptions, allowing people to avoid physically waiting in lines.

  • In reality, FastPass does not reduce total waiting time. It just allows waiting to occur off-site doing other activities. But the changed perception greatly enhances customer satisfaction.

  • Ramp meters and FastPass both work by controlling flows and spacing out arrivals to reduce fluctuations and ensure consistent throughput. Both optimize existing capacity to manage congestion.

  • Senator Day argued that ramp metering (traffic lights on highway on-ramps) were not helping traffic congestion in the Twin Cities and were a symbol of overreach by the government. He tapped into frustration from drivers about waiting at ramp meters.

  • The state legislature mandated an experiment where all ramp meters were turned off for 6 weeks. Engineers predicted much worse traffic, while Senator Day predicted better traffic.

  • During the experiment, both objective data (traffic speeds, volumes, crashes) and subjective public opinions showed that traffic actually got worse without ramp meters. Travel times increased 22% and crashes during merging increased 26%.

  • However, many individual drivers still perceived that their commute was better without waiting at ramp meters, even if data showed overall traffic flow was reduced. This showed a blind spot in how engineers accounted for public perception.

  • The study concluded ramp metering reduced overall congestion and travel times, but steps were taken to limit individual wait times at meters based on feedback that drivers disliked any forced waiting, even if it benefited traffic flow overall. Engineers learned to better account for subjective public perceptions in their policies.

This paragraph discusses the continued efforts by public health officials in Wisconsin and at the federal level to prevent more citizens from falling ill due to an outbreak of E. coli illness. It mentions that in early September 2006, health officials in Wisconsin and Oregon detected clusters of E. coli cases with matching DNA fingerprints, suggesting a common source. The CDC was notified and began investigating by comparing the fingerprints to those in their national database to try and identify the source. The goal was to support the agencies and protect public health.

  • An outbreak of E. coli O157:H7 was reported across multiple states. The DNA profile matched cases from nine other states.

  • Epidemiologists from the CDC pulled the trigger and launched an investigation across states. Interviews with patients failed to uncover the source initially.

  • In Wisconsin, one victim’s husband said she did not eat red meat or drink, but liked green salads, puzzling investigators who usually link E. coli to beef.

  • Oregon conducted extensive interviews with 450 questions each to cast a wide net. Four of five mentioned bagged spinach, suggesting it as the source.

  • Wisconsin interviews also implicated spinach. New Mexico separately tested bags of spinach.

  • Scientists from different states connected and concluded bagged spinach was the likely cause of the multistate outbreak based on matching DNA profiles and epidemiological evidence.

  • The case count rose across more states. On the 7th day of testing leftover spinach, E. coli was found matching the outbreak strain, confirming bagged spinach as the source.

  • Past outbreaks of E. coli linked to bagged spinach farms in California narrowed the investigation to nine ranches. Initial environmental samples from these ranches tested negative.

  • A lot code from leftover spinach packaging (“P227A”) led investigators to four specific fields in California. Samples from one field tested positive for the outbreak strain in river water and animal feces.

  • As new cases ceased and the bagged spinach theory gained traction, investigators were finally able to declare victory after a 39-day investigation involving tight teamwork between state and federal agencies.

  • However, the recall’s impact was unclear as the outbreak may have dissipated on its own due to the perishable nature of spinach. The alternative scenario without a recall could not be known.

  • The recall destroyed the $300M bagged spinach industry for 6 months and impacted other salad sales. Small farms were also affected despite not being implicated.

  • Later evidence found four unrelated E. coli cases in Wisconsin, suggesting overreaction. Epidemiology involves complex statistical analysis and multiple corroborating lines of evidence are needed to determine causation.

  • Case-control studies were invented in the 1950s to prove smoking causes lung cancer. Sir Bradford Hill established nine viewpoints on determining causation that are still used today.

  • The 2006 spinach E. coli outbreak investigation used a case-control study design. It satisfied six of Hill’s nine viewpoints for determining causation between eating spinach and the outbreak.

  • Statisticians play a vital role in outbreak investigations despite facing challenges like minimal data, urgency, incomplete information, and consequence of mistakes. Their work has significantly contributed to outbreak control success.

  • The US credit reporting system enables things like instant credit approval, which would not be possible otherwise. Credit scores summarize credit reports and are heavily relied on by lenders and other organizations to assess risk.

  • Before credit scoring, lending decisions involved interviewing applicants and applying experience-based rules of thumb to judge creditworthiness. FICO introduced standardized credit scoring models that transformed lending by objectively assessing risk based on past payment history data.

  • In the 1960s, Bill Fair and Earl Isaac developed the FICO credit score, which uses statistical modeling to predict the likelihood a borrower will default on a loan in the next two years. Higher scores indicate a lower risk of default.

  • Credit scoring algorithms use large datasets to develop “rules” linking various borrower characteristics (like years employed, income, credit history, etc.) to creditworthiness ratings. Computers can consider many more factors and combinations than humans.

  • The FICO model focuses on a borrower’s past loan repayment, current debt levels, credit history length, credit applications, and types of existing credit.

  • Credit scoring streamlined and automated loan underwriting, allowing much faster approvals. This expanded access to credit and fueled growth in consumer spending and the US economy.

  • Statistical modeling proved more accurate than subjective human judgment. It allowed customized assessments beyond crude rules like “don’t lend to painters.” Over time, scoring identified new combinations of characteristics correlated with default risk.

  • Widespread credit scoring adoption in the 1980s-1990s transformed the consumer lending industry by slashing costs and boosting throughput while maintaining or reducing default rates. This expanded access to credit across socioeconomic groups.

  • Credit scoring technology has allowed lenders to efficiently evaluate and manage risk, expanding access to credit. Those who adopted scoring (“Haves”) could cherry-pick good risks, while “Have-Nots” saw deteriorating performance and eventually also adopted scoring.

  • However, consumer advocacy groups argue credit scoring is flawed and perpetuates economic inequalities. They want more regulation and transparency around scoring models and credit report data.

  • Critics argue scoring models don’t prove causal relationships and rely on inaccurate/incomplete credit report data. Supporters counter that correlation is sufficient and unavoidable errors are mitigated through model design.

  • Fair Credit Reporting Act amendments aimed to increase consumer rights like accessing/repairing scores, but this risks damaging the instant credit system through credit repair scams and distorted scores. Overall there is debate around balancing risks and expanding access through scoring versus regulating potential harms and inequities.

This passage discusses the challenges and tensions around group risk pooling in the insurance industry. It uses the example of J. Patrick Rooney, a prominent Republican businessman who ran a large individual health insurance company called Golden Rule.

While Rooney was politically conservative, he unexpectedly fought for civil rights by suing testing company ETS over unfair licensing exams that disqualified black applicants at disproportionate rates. The resulting “Golden Rule settlement” established a method to identify unfair test questions where white and black test-takers performed significantly differently. However, statisticians were unhappy with this approach, as it could undermine the validity of standardized testing.

More broadly, the insurance industry faces a dilemma around how to pool risk across diverse policyholders in a fair and equitable way. Grouping people together based on broad characteristics inevitably leads to some cross-subsidization, where lower-risk individuals effectively subsidize higher-risk ones. This poses challenges for balancing access, affordability, and the accurate assessment of individual risk. The passage suggests there are no easy answers in balancing these competing objectives around group risk pooling.

  • In 1975, Illinois launched a new licensing exam for insurance agents developed by ETS. The passing rate was only 31%, much lower than the previous exam.

  • One of Rooney’s managers was concerned about the lack of Black insurance agents in Chicago, a key market. Rooney seized on this issue to sue ETS and Illinois, arguing the new exam was effectively excluding Blacks.

  • ETS twice revised the exam, raising the overall pass rate but not closing the gap between Black and white pass rates.

  • In 1984, Rooney and ETS settled, requiring ETS to conduct scientific analysis to ensure fair testing and prevent unintended discrimination.

  • The lawsuit and settlement stimulated significant rethinking and research on fair testing at ETS, which administered many admissions exams. This helped address issues of ensuring tests did not favor some groups over others due to factors unrelated to ability.

  • While Rooney had clear commercial motivations as an insurance executive, his advocacy also helped promote fairness in standardized testing more broadly beyond just the insurance industry exam.

  • Predicting test item difficulty is challenging, as there are many factors to consider beyond just content. Identifying unfair items that put minorities at a disadvantage is even more difficult.

  • Constructing standardized tests like the SAT is a massive undertaking involving many statisticians carefully selecting and arranging test items over 18+ months based on extensive analysis and reviews to avoid unintended bias.

  • Issues like the racial scoring gap on tests have long been observed but interpreting the causes is complex, with debate around differences in ability, unfair test construction, or both.

  • A 1976 lawsuit filed by Patrick Rooney alleged unfair tests underestimated black students’ true ability. This led to the 1984 Golden Rule settlement formalizing fairness reviews and consideration of racial impact, pioneering new scientific techniques like DIF analysis.

  • However, the Golden Rule thresholds also produced many “false alarms” by questioning items that appeared biased but developers could not identify actual unfairness. Identifying unfairness remained challenging without fully explaining why differences occurred.

The key insight from ETS statisticians was to compare test performance between groups that have similar ability levels, rather than directly comparing overall group performance. This approach, called DIF (differential item functioning) analysis, helped untangle whether score differences were due to unfair test questions or underlying differences in ability levels between groups.

DIF analysis works by matching students of similar ability across racial/gender groups, then comparing their performance on individual test questions. If students of similar ability perform differently based on their group, that suggests the test question may be unfairly favoring one group.

Two ETS researchers, Curley and Schmitt, tested variations of questions that previously showed DIF to understand why. Their research, using real SAT data, found modifying questions could reduce or eliminate DIF in some cases. For example, substituting a less ambiguous vocabulary word reduced DIF between racial groups on one question. Changing a question’s context from military to economic reduced DIF between girls and boys.

This work helped validate that test questions could unintentionally disadvantage groups, and highlighted the challenge of identifying unfairness without real test data and statistical analysis like DIF. It also showed the potential to revise questions to make them fairer to all groups. DIF analysis remains a key method used by test developers to screen for and address unfair test questions.

  • Bill Poe was a prominent insurance entrepreneur in Florida who founded the largest insurance brokerage there and later started his own insurance underwriting company, Poe Financial.

  • Poe Financial grew rapidly by taking on home insurance policies from the state government after hurricanes, slashing rates. It became Florida’s largest private property insurer.

  • However, the extremely active 2004 and 2005 hurricane seasons, with 8 hurricanes hitting Florida, caused unprecedented losses that wiped out Poe Financial’s profits and surpluses. The company became insolvent despite following regulations.

  • The huge losses from those seasons also caused national insurers to raise rates significantly or drop policies altogether. Over 500,000 policies were terminated as insurers said hurricane risk in Florida had become uninsurable. This created a crisis in the state’s property insurance market.

  • For insurance markets to function properly, people must see hazards as insurable and willingly pay premiums. But the back-to-back hurricane disasters made insurers believe the risks were no longer acceptable, causing the market breakdown.

  • Insurance actuaries set rates to cover expected claims on average and in any given year. However, if payouts drain the cash reserves, the insurer cannot cover claims and goes bust.

  • Risk pools need similar risk levels across members. Small differences are addressed by charging higher rates for riskier drivers. Large differences drive away safe drivers who feel they pay too much, and attract risky drivers hoping to profit.

  • Hurricane insurers underestimated risks in Florida for years. The 2004-2005 seasons caused $36 billion in losses far above the $6-8 billion expected. This exposed flawed risk projections and pricing.

  • Insurers relied too much on the concept of a “100-year storm” without recognizing multiple severe storms could occur in succession. This gave a false sense of security.

  • Natural disaster insurers cannot diversify risk geographically like auto insurers. The 2004-2005 seasons caused too many concurrent claims, bankrupting some insurers with concentrated Florida risks.

  • Risk pools need some members not claiming in disaster seasons, or all claims together overwhelm insurers. But revealing subsidies caused low-risk inland residents and foreigners to leave risk pools. This further concentrated remaining risks.

  • Steroid and drug testing in sports has become more common, but baseball’s testing still lags behind international standards due to player resistance. Players worry about false positives ruining their careers.

  • This focus on false positives has unintentionally helped drug cheats, as it allows the discussion to center on false accusations rather than cleaning up the sport.

  • Separately, US troops in Iraq and Afghanistan faced challenges screening local job applicants to identify any insurgent ties. Interrogators relied on experience and intuition but made mistakes, putting soldiers at risk. Over time, newer recruits had less experience for this type of work.

  • The baseball and military examples both involve trying to separate one group from another (clean athletes from dopers, safe applicants from insurgents) through some form of testing or screening. But imperfections in the methods allowed some to evade detection, compromising safety or fairness. The focus on avoiding false positives unintentionally enabled cheating or security risks. Experience was important for the difficult work of distinguishing between groups.

  • Portable lie detectors are handheld computers that measure skin conductivity and pulse rate when asking subjects yes/no questions. They provide a verdict of green (truthful), red (deceptive), or yellow (inconclusive) within minutes by processing the physiological data.

  • Unlike polygraphs, they remove the human element but come with risks of false positives and negatives. Army leaders instructed researchers to optimize the devices to minimize false negatives (insurgents being cleared), which unintentionally impaired their ability to detect potential suspects.

  • Statisticians note all detection systems involve a tradeoff between false positives and negatives. Focusing on one type of error tends to neglect and increase the other type of error.

  • Mike Lowell provided a rational explanation for why players unions are wary of steroid testing - even a 1% error rate could mean false positives that destroy players’ careers. He argued testing must be 100% accurate to avoid this risk.

  • False positives ruin careers, as cyclist Tyler Hamilton discovered when he tested positive for blood doping in 2004 despite claiming innocence. Defending against such accusations typically involves claims of laboratory errors or medical anomalies.

  • Tyler Hamilton claimed he had never tested positive before, but had his appeal rejected and received a two-year ban for doping. If he was truly clean, this false positive would validate concerns about careers being ruined.

  • However, truly innocent and guilty athletes often pursue the same defense strategies when testing positive - hiring lawyers, challenging procedures, etc. It is difficult to distinguish true and false positives once they are “commingled”.

  • Athletes like Mike Lowell demand 100% accurate tests with no false positives, but statisticians say even perfect tests would have false negatives. False negatives, not false positives, are the real issue with doping detection.

  • Both Bjarne Riis and Marion Jones passed all drug tests during their careers but later admitted to doping. Their stories show that testing negative means very little, as the vast majority of dopers likely receive false negatives. False negatives, not positives, allow most dopers to escape detection.

  • False negatives in drug testing, where dopers test clean, have largely been ignored compared to concerns over false positives. Athletes caught doping often claim every negative test proves their innocence.

  • False negatives have victims besides just other competitors - teammates who lost medals and sponsors due to doping by others on their team. Accurate testing is needed to ensure the deserved receive glory and rewards.

  • Estimates suggest drug testing only catches about 1 in 10 dopers, as testers face incentives to minimize false positives over false negatives. This allows many athletes to get away with cheating.

  • Common tricks used by athletes to produce false negatives include tampering with samples, using new undetectable drugs, and exploiting loopholes in testing procedures and exemptions.

  • The tradeoff testers face is that minimizing false positives leads to more false negatives, while minimizing false negatives increases false positives. This was seen in the use of hematocrit testing in cycling - where the threshold was set impacted the balance between the two error types.

  • While anti-doping aims to reduce false positives, athletes still argue most positives are false, showing a lack of trust and issues with accurately detecting all cases of doping.

Zach Lund, an American skeleton rider, claimed his masseuse rubbed a steroid-laced cream on his legs without his knowledge. While the cream contained finasteride, a baldness treatment, this did not constitute a false positive on drug tests since it still introduced a banned substance to his body.

The modern polygraph machine measures physiological responses like breathing, blood pressure, and skin conductivity that indicate anxiety or stress, not deception itself. A skilled examiner is needed to interpret these readings and determine if inconsistencies suggest lying. Lie detector tests are still not admissible in US courts but are eagerly used by athletes, politicians, and celebrities to defend their reputations in the court of public opinion.

Jose Canseco took and passed polygraph tests to support claims made in his books about widespread steroid use in Major League Baseball. He specifically said he tested truthfully about conversations with and injections of Mark McGwire. Other athletes like Marion Jones and Roger Clemens also tried using polygraphs to deny doping allegations. While not conclusive proof, polygraph results are persuasive to many viewers and can sway public opinion.

  • Polygraphs have been repeatedly shown to be unreliable by scientific reports, but are still widely used by law enforcement. They are used more to coerce confessions than for accurate lie detection.

  • Police can legally tell suspects they failed a polygraph even if they didn’t take one, to coerce confessions. Confessions hold powerful sway in court.

  • The FBI, CIA and most local police routinely use polygraphs to screen employees, despite the 1988 law banning their use in private companies. Government agencies screen thousands of employees.

  • The case of Jeffrey Deskovic shows how police convinced a innocent suspect to confess using a polygraph. He was wrongfully convicted based on this coerced confession and lack of evidence.

  • Despite the National Academy of Sciences’ conclusion that polygraphs have unacceptable error rates for screening, the military continued to develop and use the portable PCASS polygraph in Iraq and Afghanistan for screening without adequate testing.

So in summary, it outlines the widespread yet unreliable use of polygraphs by law enforcement and government, and how they are used more to coerce confessions than for accurate lie detection, which can lead to wrongful convictions like in Deskovic’s case.

  • Congress has not held any hearings to seriously examine the claims of efficacy of the portable lie detector known as PCASS, despite concerns about lack of science backing its effectiveness.

  • In response to these concerns, the Army acknowledged weaknesses in PCASS and restricted its use to screening job applicants and potential insurgents, rather than using it for immediate consequences. However, critics argue this just lowers expectations and that PCASS is still not useful for screening large numbers of mostly innocent people.

  • Research shows that when accuracy is the same, lie detectors are far less useful for screening large populations where very few are actual threats, compared to targeted investigations like police lineups. This is because even a small error rate results in many false positives when screening vast numbers of innocent people.

  • The Jeffrey Deskovic case illustrated how a false positive polygraph result led an innocent man to confess and be convicted, spending 16 years in prison before being exonerated by DNA evidence. Psychological research shows false confessions are common due to investigative techniques used during polygraphs.

  • Experts warn that deploying polygraphs like PCASS for screening thousands of people could result in hundreds or thousands of false positives and wrongful confessions/convictions for every actual security threat identified.

  • National security screening systems like polygraphs aim to minimize false negatives (undetected threats) but unavoidably generate many false positives due to the tradeoff between errors.

  • This prioritizes preventing rare but harmful events like terrorist attacks over wrongly accusing innocents. One false negative can have devastating consequences but false positives are less visible.

  • Statistics show polygraphs result in many more false positives than true detections. For every insurgent caught, around 100 regular people may face false accusations.

  • False positives ruin innocent lives and careers through coerced confessions, wasted investigations, and damaged reputations. It took 16 years for one man, Jeffrey Deskovic, to clear his wrongful conviction.

  • Large-scale government data mining programs intended to detect rare terrorist plots will likely generate an unmanageable number of false alarms due to the tradeoff between errors and the rarity of meaningful patterns. Even very accurate systems may find vastly more false threats than real plots.

  • The asymmetric costs of errors in national security screening skew systems towards minimizing rare but dangerous false negatives at the expense of tolerating widespread false positives and their harmful impacts on civil liberties and innocent people.

  • On October 31, 1999, EgyptAir Flight 990 plunged into the Atlantic Ocean near Nantucket Island, killing all 217 people on board. This shocked local residents who witnessed a fireball in the sky.

  • Plane crashes get significantly more media coverage than other forms of death like homicides or cancer. The EgyptAir crash was heavily covered by major news networks.

  • Newspaper coverage followed typical disaster story formats - facts of the case, human interest stories, feel-good community response stories, investigative stories compiling expert analysis, and editorials looking at broader issues.

  • Editorials often included tables of recent plane crashes in the area, fueling a narrative of a “Bermuda Triangle” effect near Nantucket. Polls also found increased public worry about air travel after such incidents.

  • Theories about the crash cause ranged widely, from equipment failures to atmospheric anomalies. Speculation increased as experts gave conflicting analyses. The available information created more confusion and speculation among the public.

  • After the EgyptAir crash in 1999, public fear and anxiety around air travel increased significantly. Many travelers canceled or postponed trips and opted to drive instead.

  • People were trying to make sense of the unlikely coincidence of four fatal air crashes in four years in the same general region. While crashes are statistically very rare, some believed there must be some hidden causal factor like equipment failure, pilot errors, or the Bermuda Triangle.

  • Statistical experts argue this type of reasoning about patterns is valid and called “statistical testing.” However, aviation safety experts say the fears are overblown given how exceptionally rare crashes actually are.

  • The story then shifts to an example of a lottery scam in Canada. A store owner cheated an elderly man out of a $250,000 lottery win by tricking him into thinking he only won a free ticket. Statistical analysis of lottery data uncovered insider wins were far more common than expected by chance, indicating fraud.

  • This example illustrates how statistical analysis of patterns, as in the lottery wins, can reveal anomalies too extreme to be due to random luck alone and point to underlying causes like equipment issues or human errors.

  • Professor Arnold Barnett studied airline safety data for over 30 years and found that fatal airline crashes have become essentially random events with an extremely small chance of occurring.

  • He proved that major US airlines have equivalent safety records and that no airline can be consistently predicted to be safer than others. Crashes are simply matters of chance.

  • Barnett also showed that foreign airlines have equivalent safety records to US airlines on international routes where they directly compete. Developing world carriers are no more risky than carriers from developed countries in these situations.

  • Barnett used statistical analysis and testing to back up his conclusions, examining proportions of flights and fatalities over many years. His work helped show that fears about airline safety are generally unfounded and not supported by data.

  • Barnett delivered a prescient lecture in 2001 warning of new threats to aviation like sabotage and collisions, just before the 9/11 terrorist attacks confirmed his predictions. He is considered a pioneering researcher in using data and statistics to accurately measure airline safety.

  • Statistical thinking focuses on variability rather than just averages. Averages hide fluctuations and risks. Investors in Bernie Madoff’s Ponzi scheme relied too heavily on the average returns they were shown, ignoring variability.

  • Reducing variability, rather than just average wait times, is important for things like traffic congestion and Disney lines. FastPass and metered highways aim to smooth out spikes in demand rather than just shorten average wait times.

  • Perceived wait times matter more than actual wait times to customers. Disney focused on perception management through techniques like inflated wait time estimates posted on signs.

  • Applied scientists have to consider political and social factors in addition to technical solutions. Minnesota ramp meters faced opposition from a state senator even though they reduced travel times, showing the need to manage public perceptions of policies.

So in summary, statistical thinking looks beyond averages to understand variability and risks, aims to reduce variability rather than just optimize averages, considers human psychology and perception, and navigates political/social impacts in addition to technical impacts.

  • Before a “meters shutoff” experiment in Minnesota, engineers tried to delay traffic congestion to maintain highway capacity and smooth traffic flow. The experiment found this approach had benefits like smoother traffic that outweighed drawbacks like waiting at on-ramps.

  • However, commuters disliked waiting at ramps more than stop-and-go traffic. Statistical analysis of the experiment looked at pre-experiment and post-experiment data on traffic metrics to attribute differences to the ramp meter shutoff.

  • Proper experimental design and statistical expertise is important for interpreting such pre-post studies, as there may be hidden factors besides the intervention that influence outcomes.

  • Variability is key to insurance - by pooling many independent risks, insurers can reliably predict average losses and set premiums. But catastrophe insurance faces more extreme variability, as major events can cause claims exceeding total premiums collected and insurer balances.

  • Variability also complicates developing accurate tests, like those for doping - natural variations mean tests must distinguish normal from enhanced levels, accepting some errors to minimize false accusations. Correlation without fully understood causation can still yield useful statistical models.

  • Statistical models like credit scoring can generate uniform scoring rules to evaluate all loan applicants in a consistent, unbiased way. This contributed to a large expansion of consumer credit and economic growth.

  • However, models do not replace human judgment. Businesses still determine their risk tolerance and make final lending decisions based on model scores.

  • In epidemiology, merely showing a correlation is not enough. Investigators must establish causation by tracing the complete causal path from source to infection. This requires combining statistical tools with fieldwork and laboratory analysis.

  • Outbreak investigations are challenging, and mistakes sometimes occur due to uncertainties. But statisticians see value even in being wrong occasionally, and in clearly solving complex puzzles.

  • The same principles apply to other domains like food safety monitoring and law enforcement that rely on correlations. Multiple lines of evidence are needed to avoid falsely implicating innocent parties based solely on correlated factors. Causal theories must keep pace with new technologies.

  • When reporting group statistics, differences between groups are important. Aggregating dissimilar groups can obscure important nuances or create misleading overall impressions through Simpson’s paradox effects. Factors like ability levels may vary across demographic groups.

  • Simpson’s paradox is a statistical phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined. It often occurs when not properly accounting for missing or concealed variables.

  • Applying differential item functioning (DIF) analysis, which divides test-takers into ability groups and compares average performance within groups, helped resolve paradoxes in standardized test scoring and established a standard for fair testing.

  • Stratification creates like groups for comparison, avoiding Simpson’s paradox. Case-control epidemiological studies also implement this by matching sick and healthy individuals on relevant factors to compare exposure rates between the groups.

  • While stratification facilitates fair comparisons, random assignment of subjects to groups is generally preferred by statisticians as it ensures all groups are similar prior to any treatment or exposure. However, randomization is not always feasible.

  • When generalizing from study results, statisticians acknowledge margins of error to account for possible false positives or false negatives. The tolerances for these errors involve considering asymmetric costs and incentivizing accurate detection of both positive and negative cases.

Negative test results are invisible unless those who test positive confess, while false positives are publicly mocked. This creates incentives for test administrators to underreport positives, missing real cases.

In national security screening, false negatives could have serious consequences, while false positives only become known if authorities reverse mistakes and victims report them. As a result, the U.S. Army configures portable polygraphs to minimize false negatives.

Decision makers focus most on errors that invite bad press, neglecting other errors that go unseen. This means false negatives are a concern in steroid testing and false positives in polygraph/terrorist screening. Estimates suggest 10 dopers escape detection for every one caught, and hundreds or thousands of innocent people are falsely implicated for each terrorist found. The ratios are worse for rarer targets.

Behavioral economics shows that falsely minimizing one error type inevitably increases the other due to asymmetric costs and incentives. Textbooks often assume error costs are equal, but in reality costs differ depending on societal goals and individual characteristics.

The passage discusses a statistical test conducted on ramp metering to reduce highway congestion. Senators doubted the effectiveness of ramp metering and wanted to know the likelihood that average trip times would rise by 22% (the claimed improvement by traffic engineers) if the meters were shut off. Consultants analyzed an experiment where the meters were shut off and concluded ramp metering was effective at reducing congestion, as the likelihood (p-value) of seeing the observed deterioration in travel times without the meters was small. Statisticians avoided claiming some rare event caused the results, rather than shutting off the meters. For experiments to provide useful insights, they should be conducted under normal conditions - if an abnormal event influenced results, a new experiment would be needed.

  • The article discusses various studies and resources related to traffic congestion, transportation optimization problems, and efforts to manage reliability and reduce trip times. It mentions works by the PATH research group at UC Berkeley on ramp metering, two FHWA reports on ramp management and bottlenecks, and Anthony Downs’ book Still Stuck in Traffic.

  • It then shifts to discussing Walt Disney World’s FastPass system for minimizing wait times, citing analyses and tips from various Disney fans and resources like The Unofficial Guide to Walt Disney World.

  • Next, it covers epidemiological investigations of disease outbreaks like the 2006 E. coli outbreak linked to bagged spinach. It mentions seminal texts, case studies, reports on the specific outbreak investigation, and media coverage.

  • Finally, it discusses credit scoring algorithms and references works analyzing FICO scores and other credit scoring models from statisticians and companies like Fair Isaac Corporation. It provides an overview of the key topics and sources covered without summarizing any specific studies or analyses in depth.

  • The passage discusses several references for learning about differential item functioning (DIF) analysis, which identifies unfair test items that function differently for different groups. The seminal reference is Holland and Wainer’s volume from the 1980s.

  • Techniques for DIF analysis include standardized differences, Mantel-Haenszel statistics, and item response models. ETS researchers have employed these techniques in studies of SAT items.

  • Identifying the sources of inequity in unfair test items is challenging. DIF analysis requires classifying examinees by ability based on their test scores excluding unfair items, which critics argue can be circular.

  • The passage also discusses references related to analyzing the impacts of credit scoring, natural disaster insurance risk modeling, Simpson’s paradox example in graduate admissions data, and measuring achievement gaps between student groups.

  • Key sources include books, research papers, industry reports, and news articles providing both technical and practical perspectives on these statistical topics as applied in different contexts.

  • The detection of performance-enhancing drugs like steroids in sports relies on statistical analysis of test results using concepts like conditional probabilities and Bayes’ rule. However, publicly reported rates of positive tests at major events are much lower than what textbook analyses would predict based on assumed test accuracy levels.

  • Detecting certain drugs like erythropoietin (EPO) requires hematocrit testing as well as advanced testing techniques, and tests still have false negative limitations. Experts acknowledge some positive samples likely go undetected.

  • Obtaining a therapeutic use exemption allows athletes to legally use certain banned substances for medical reasons, and TUE rates are high in some sports. This opens issues around abuse of TUE systems.

  • When athletes do test positive, the real debate often centers on alternative explanations for the results rather than the chemical analysis, moving the discussion into the realm of lie detection. Published sources provide details on drug testing programs, exemptions, and limitations.

  • The passage discusses numerous sports doping scandals involving baseball, track and field, cycling, and other sports. It covers investigations, reports, and fallout from steroid and other performance-enhancing drug use by high-profile athletes like Barry Bonds, Marion Jones, Lance Armstrong, and others.

  • It discusses investigative reports that named athletes using steroids, congressional testimony on the issue, and bans/suspensions of athletes after testing positive or admitting to doping.

  • Coaches, trainers, and suppliers of drugs to athletes were also sometimes investigated and faced legal consequences.

  • The passage references reports, books, and interviews that helped uncover the rampant doping in some sports and its ethical implications. It also discusses the challenges of drug testing and catching drug cheats.

  • Overall, the passage provides an overview of the proliferation of performance-enhancing drug use in elite sports over decades, and the gradual exposure of widespread cheating programs and their impacts on athletes’ careers and reputations. It touches on medical, legal, and ethical dimensions of the high-stakes doping scandals.

  • A source of contamination was discovered during the investigation of an E. coli outbreak that sickened 37 people. However, the exact source is not specified in the provided text.

  • Bob Edmonds is mentioned on pages 144-146 but no additional context is provided about who he is or his relevance.

  • The Educational Testing Service (ETS) administered the SAT and was involved in lawsuits related to claims of racial bias in test questions. They worked with Golden Rule and other organizations on studies analyzing test fairness.

  • Other organizations mentioned include the Epidemic Intelligence Service (EIS) and PulseNet, which were involved in disease outbreak investigations and information sharing.

  • Numbers provided include 37 people sickened in the E. coli outbreak, and pages 144-146 where Bob Edmonds is referenced.

  • No other details or full narratives are available based on the brief passages and isolated quotes provided. Context is limited.

Author Photo

About Matheus Puppe