Literary Insights

“If you liked the book, you can purchase it using the links in the description below. By buying through these links, you contribute to the blog without paying any extra, as we receive a small commission. This helps us bring more quality content to you!”

BOOK LINK:

CLICK HERE

Here is a summary of the key points from the introduction of the book “Winning with Data Science”:

The book aims to help non-technical managers, executives, and business professionals work effectively with data science teams by understanding basic concepts and terminology.
It teaches how to communicate requirements clearly, evaluate proposed solutions, and ask good questions to maximize value from data science projects.
The goal is not to turn readers into data scientists, but rather to speak the language and understand at a high level what different analytical approaches can do.
It covers important topics like common tools/technologies, roles on data science teams, foundational statistical concepts, modeling techniques for different problems, project management best practices, and ethics.
By the end, readers should be able to evaluate which models might work best for their needs, probe experts intelligently with insightful questions, and avoid wasting time/money on improper tools or approaches.
Programming and advanced math skills are not the focus. The book aims to demystify data science for non-technical partners so they can effectively work with these teams.
The passage introduces Steve, who has work experience as a handyman but is now in his first real job out of business school at Shu Money Financial.
Steve is tasked with developing a new strategy for prioritizing cases sent to the Recoveries Department at Shu Money. The current process is that if a customer hasn’t paid anything owed in 6 months, the debt is charged off.
An analogy is made that just like different tools are needed for different jobs as a handyman, different specialists are needed for a successful data science team.
The roles of data engineers, machine learning specialists, data visualization experts and other specialists on a data science team will be explained. How to prioritize projects and allocate resources will also be covered.
The passage will introduce some basic concepts involving key programming languages and modeling techniques used in data science projects. Ethics in data science and measuring return on investment from projects will also be discusses.
The concepts will be explained through the perspectives of Steve at Shu Money Financial and Kamala at a health insurance company as they both seek to advance their careers by extracting value from data science.
The Recoveries Department at Shu Money Financial is tasked with collecting debt from customers who have charged-off or defaulted on loans, credit cards, etc.
Steve met with the data science team to discuss using data and analytics to improve how the Recoveries Department prioritizes debt collection cases.
Currently, prioritization is done through simple rules based on past payment history and contact with the customer. But there may be other predictive factors that could optimize recovery amounts.
The data science workflow involves data collection, storage, preparation, exploration, and modeling. Proper data quality checks and transformations are important before modeling to ensure accurate results.
The data team would extract relevant data from internal and external sources. Data would be transformed through standardization, error-checking, cleaning, etc. Then the cleaned data would be loaded into storage for analysis.
Steve wants the team to analyze the data and develop a predictive model to identify which customers are most likely to provide the highest recovery amounts. This would help optimize how collection cases are prioritized.
In the past, data was stored locally on devices like hard drives and backups were inconsistent. This led to data loss risks from physical failures or disasters.
Cloud computing provides data storage and computing resources over the internet, avoiding these risks. Data can be accessed from anywhere with an internet connection.
Cloud storage has advantages of automatic backups, scalability, global accessibility, and pay-per-use pricing. Security and maintenance are handled by cloud providers.
However, cloud storage also has disadvantages like potential performance issues from internet speeds, vendor lock-in, and concerns about hacks targeting major cloud providers.
For a large company, corporate-level decisions will dictate the data architecture. But for smaller companies or individual projects, a local solution may suffice depending on needs for computing power, storage, security, backups, etc.
Integrating APIs allows sharing data across different systems and organizations by facilitating communication between applications and servers. Data scientists must understand API specifications to extract needed data.
Companies have access to vast amounts of data from both internal and external sources like social media, websites, APIs, data vendors, etc. This includes both structured data stored in databases as well as unstructured data like text, images, audio, etc.
Unstructured data is often stored initially in a data lake before being processed and loaded into a data warehouse for analysis. Technologies like NLP can extract useful insights from unstructured data sources.
Identifying all potential data sources is important for a data science project. The breadth and quality of data impact the accuracy of analyses and models.
Data quality must be assured through steps like removing duplicates, validating data types, checking for valid values, and understanding missing data. Customer involvement helps improve data quality.
Common coding languages for data science include SQL for querying databases, Python and R for advanced modeling, analysis and reporting. Other languages like Scala, Julia, JavaScript and Java are also used.
While learning a new language can be worthwhile, regularly using it is important to retain skills - otherwise knowledge decays quickly. Online resources make learning programming accessible.
Data products are the practical solutions developed by the data science team to solve the customer’s problem based on the raw data and analysis.
The customer should work with the data science team to define the specific type of data product needed - e.g. raw data, automated decision tools, or something in between.
The optimal product depends on the problem being solved and the customer’s existing resources like talent, software, hardware, and budget.
For a customer with in-house analytics skills, raw cleaned data may be sufficient if they can do additional work.
A more sophisticated option is a quality-checked, documented derived database ready for modeling, to save time for customers without dedicated data preparation support.
The key is delivering a product tailored to the customer’s specific needs and capabilities, rather than just raw analysis results, for the solution to be implemented successfully.
Kamala is a talented director of clinical strategy and marketing at Stardust Health Insurance. Her goal is to increase the company’s profitability through data-driven decisions on prescription drugs and medical procedures.
Specifically, her team aims to analyze which options are most cost-effective for common medical conditions like pneumonia and high blood pressure when accounting for patient factors like demographics, medical history, and other treatments.
They need to determine if some options lead to better health outcomes while also reducing future healthcare costs through fewer visits, ER trips, or hospitalizations. If so, those options should be added to Stardust’s drug formulary and incentivized through lower copays.
However, the patient still has the final say in their treatment. Kamala’s role also includes marketing campaigns to promote Stardust and sway patients toward preferred, more cost-effective options through incentives and awareness of benefits.
The overall objective is to nudge patients to choose treatments that provide the best outcomes at the lowest total costs for Stardust, thereby increasing the company’s profitability through data-driven clinical and marketing strategies.
The project aims to analyze patient demographic data to better serve underserved populations through insurance advertising and healthcare access.
Kamala emphasized the importance of project management after her previous startup failed due to lack of deliverables and prototypes.
She introduces Phil, an experienced project manager, to oversee the data science team’s work.
Phil walks the team through the key phases of project management: concept, planning, implementation, and closeout.
In the planning phase, activities, sequencing, resources, budget, and risks are discussed.
Implementation considers human resources, quality control, and risk management. Roles and contingency plans are clarified.
Quality standards are established and how they will be tested is discussed.
Risk identification and mitigation strategies are developed to address potential issues like costs, schedules, performance, operations, legal risks, and human resource risks.

So in summary, proper project management processes and oversight are being established to help ensure the demographic data analysis project is successfully delivered.

The key is to treat data science projects like other projects by applying best practices of project management to reduce failure and ensure the product meets customer needs.
Phil drafted a thorough project plan document outlining activities, roles, tasks, risks, specifications, etc. and got sign-off from all involved.
As project manager, Phil’s role is to keep the project on track for on-time delivery of high quality work while documenting risks and issues.
Specific roles like data engineer, data analyst, data scientist were discussed. Data engineers build data structures and pipelines. Analysts do data preparation, cleaning, and transformations. Scientists build machine learning models with skills in programming, feature engineering, etc.
Other roles include data visualization specialists who communicate results to non-technical audiences, as well as more specialized roles focused on areas like NLP, computer vision, networks, etc. depending on the project needs.
Proper project management involves understanding each person’s specific skills and how they contribute to ensure the right people are assigned to the right tasks.
When doing statistical tests, the statistician should verify that the key assumptions of the test are valid and the test is well matched to the specific problem and data set. Medical and healthcare fields employ many statisticians since their work is often scrutinized.
Identifying the needed skills at different project phases is important for project management of a data science project. This includes determining when data engineers are needed and for how long, and where machine learning expertise is required.
The available resources must be mapped against the identified skill requirements to find any gaps. Short-term options include internal transfers, but long-term needs may require external hires.
Prioritizing projects is important when there are more projects than resources. Factors to consider include return on investment, problem/opportunity importance, project feasibility, data availability/quality, and costs.
Qualitatively rating projects according to factors like impact, testability, addressing customer issues helps to compare projects initially. More quantification is ideal when possible.
OKRs provide a framework with objectives and key results to measure project achievement.
Understanding available internal and external data sources, including quality, limitations, costs and representativeness, is key.
Alternatives should be considered in case a project is unsuccessful or over budget to allow flexibility in planning.
The buy versus build decision for software involves comparing costs of licensing to potential benefits like reduced timelines and improved skills. Both costs and benefits need quantification.

The discussion focuses on quantifying the benefits of adopting automated machine learning (AutoML) software in a cost-benefit analysis for model development. Several key points are mentioned:

Assuming future staff costs would be reduced due to efficiencies and less need for advanced data science talent with AutoML. This should be included in the cost-benefit analysis.
Performing a sensitivity analysis that does not include staff cost savings to be conservative.
Addressing concerns about less confidence in AutoML outputs since it uses a “black box” approach. Phil responds that AutoML provides interpretability and insights into models.
Phil develops some rough cost-benefit analysis estimates for AutoML. The sensitivity analysis shows it is not necessarily a major win, so they decide to stick with current Python/R approach for now.

The discussion emphasizes quantifying potential human resource savings from efficiencies of AutoML for model development and leverage by less technical personnel. A sensitivity analysis without this assumption is recommended to be cautious. Interpretability of AutoML outputs helps address concerns about it being a “black box.”

Here are the key points associated with the project team’s questions:

Contingency plan if an important team member is unable to work: The project should have contingency plans in place in case a key team member gets unexpectedly sidelined. This could involve having backup resources ready to take over responsibilities or ensuring documentation is kept up-to-date so others can easily get up to speed if needed.
Standards of quality and testing: The project should establish clear standards for expected quality and performance. There needs to be processes for testing work against these standards throughout development to ensure quality is maintained. Any issues should be documented and addressed.
What can go wrong: Some potential problems that could arise include delays, cost overruns, lack of required resources or skills, technical difficulties, incorrect assumptions, change requests, lack of stakeholder buy-in or oversight, security vulnerabilities, and poor communication/coordination between teams. Thorough risk planning can help mitigate many of these.

The key points focus on contingency planning, quality assurance, and risk management. These are important aspects for any project to address to help prevent and handle potential problems that could arise. Clear documentation, backup plans, testing processes, and risk planning can help the project team navigate issues and meet expectations.

The data science team is exploring claims data from a large health insurance company to understand high-cost patients. The data includes over 1 million patients each year from 2018-2019.
Some initial descriptive statistics found that the total number of patients and average claims per patient both increased from 2018 to 2019. However, the range of claims per patient in 2019 went unusually high, up to almost 200,000.
Measures of central tendency like the average/mean can be useful high-level metrics but they mask variation in the data. The wide range in 2019 claims suggests outliers that skew the mean.
Other measures of spread like percentiles and interquartile range provide more context on variation. The analyst calculates these measures to better understand the typical claims range without outliers skewing it.
Excluding outliers and using measures like the interquartile range that are less sensitive to them can provide a more accurate picture of how most patients’ claims are distributed. This will help identify which patients are true high-cost outliers.
The discussion emphasizes how different summary metrics can capture different aspects of large, variable datasets and considering multiple perspectives is important for accurate insights. Outlier handling and alternative measures of spread beyond just the mean are important exploratory tools.

In summary, the team dialogues how to best summarize and understand variation in the claims data through meaningful statistical metrics, outliers identification and handling, to accurately identify high-cost patient patterns.

An exploratory data analysis found a moderately positive correlation (0.4) between claimant age and number of claims filed. This suggested that older claimants on average file more claims.
Kamala cautioned against jumping to conclusions, as correlation does not necessarily mean causation. She proposed some alternative explanations for the trend, like bringing on employers with older workers or that older patients may use healthcare more for preventive care.
Kamala asked the data science team to help understand why older patients tend to file more claims - is it because they are sicker, use healthcare more, or something else?
The team identified three hypotheses to test: 1) Are older patients sicker? 2) Do older patients use healthcare more? 3) Is the relationship driven by recently added employers of older workers?
They defined age groups as under 40, 40-65, 65+ and agreed to measure sickness using preexisting conditions and diagnosis codes initially.
However, Kamala realized this was not a good measure of how sick someone is, as it does not capture severity. The team will need to consider alternative ways to measure health status/sickness.

In summary, the exploratory analysis raised questions the team aims to investigate more rigorously by testing specific hypotheses and defining their variables and measures more clearly.

Here is a summary of the key points discussed:

Kamala and the data science team aligned on the specific hypotheses/questions they wanted to test regarding age, health status, and healthcare utilization.
They agreed to use procedure codes to define relevant healthcare services, with Kamala’s team providing the list of codes.
To test if older patients are sicker, they would conduct a statistical hypothesis test comparing average claims between age groups.
The effect size from the test was a difference of 22 claims per year on average between older and younger patients.
While an effect size provides a quantified result, it does not tell the whole story on its own. We must also consider the p-value and limitations of the data/analysis.
Random chance can influence effect size estimates, especially with small sample sizes. More data increases confidence in the effect size.
Statistical power refers to the probability that a hypothesis test will correctly detect an effect if there is truly a difference between groups. More data increases power.

So in summary, the discussion centered on aligning on clear questions, methodology, interpreting results appropriately considering limitations and randomness, and how sample size impacts effect size estimates and statistical power. Effective collaboration and communication were important to define the analysis properly.

Correctly rejecting the false null hypothesis, also known as achieving statistical power, refers to the probability of detecting an effect when there truly is an effect.

The key points are:

Statistical power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true. In other words, it’s the probability of detecting an effect when there is truly an effect.
The power of a hypothesis test depends on the number of data points, the size of the effect, and the significance level/p-value threshold. More data, larger effects, and lower p-value thresholds increase power.
Data scientists calculate needed sample sizes using a desired power level (typically 80%), significance level (typically 5%), and an estimated effect size. Larger estimated effects require fewer data points to achieve adequate power.
Power calculations are done beforehand to determine the appropriate amount of data needed to correctly detect effects, avoiding unnecessary additional data collection.

So in summary, statistical power refers to the ability of a test to correctly detect real effects and reject false null hypotheses, with key determinants being sample size, effect size, and significance level. It quantifies the probability of avoiding Type II errors when alternatives are true.

Modeling can help identify the drivers of certain phenomena by examining the relationships between different variables. One approach is to generate a scatterplot to visualize the association between two continuous variables, such as height and weight. A line of best fit can then be fitted to the scatterplot to quantify the relationship. The slope of this line indicates how much one variable changes on average for each unit change in the other variable.

This process of regression analysis allows us to understand how an outcome or dependent variable is associated with various independent or predictor variables. The goal can be inference, to understand general trends in a population, or prediction, to forecast outcomes for individuals. It is important to carefully define the outcome variable to ensure it accurately reflects the phenomenon of interest. Independent variables that may help explain changes in the outcome can then be included in the model. This modeling approach aims to identify the key drivers and their relative influences on the outcome.

Here are the key points from the summary:

The team built two models - one focusing on younger patients and one on older Medicare Advantage patients, to examine which variables were associated with number of claims filed for each group.
A major finding was that presence of chronic diseases was a significant predictor of number of claims filed per year. However, it affected the two groups differently.
For younger patients, chronic diseases were associated with about 4 more claims per year. For older patients, chronic diseases were associated with around 15 more claims per year.
This raises the question of why chronic diseases have a greater effect on claims for older vs younger patients. Some potential explanations discussed were that older patients tend to be sicker overall.
Generating this question from the modeling results was the goal - to use statistical analysis to identify meaningful patterns and insights, not just report numbers.
Key tools for evaluating results included checking assumptions, sample size, effect sizes, alternative explanations like confounding variables, and interpreting p-values alongside effect sizes.

The summary demonstrates how hypothesis-driven modeling approach can yield useful findings, and importance of thoughtful interpretation and follow-up questions to ensure results are meaningful. Comparing groups addressed their original question in an insightful way.

Effectively choosing only positive or significant results from data analysis while ignoring non-significant ones leads to a biased representation of the analysis. This cherry-picking approach does not present an accurate picture of what the data shows.
“Data dredging” or “p-hacking” refers to the unethical practice of performing many statistical tests on data until a significant result is found by chance. Repeated testing increases the likelihood of finding statistically significant results even if no true effect exists. This misleads readers into thinking an effect was found when it was just due to the multiple testing.

In summary, to avoid bias and present results accurately, analysts should not selectively report only favorable results and should not perform excessive testing that artificially inflates statistical significance through chance alone. Highlighting only positive findings or hunting for significance through multiple comparisons distorts the truth about what the data actually demonstrates.

Kyra from the marketing team explains how they use randomized experiments (A/B tests) to inform their advertising strategy.
The first step is to clearly define a causal question in the form of “What is the effect of [intervention] on [outcome]?”
For an example marketing test, the intervention may be a new advertisement and the outcome could be number of incoming client calls.
It’s important to precisely define both the intervention and outcome to allow the experiment to be replicated. Vague terms should be avoided.
Outcomes also need to be meaningful measures of the question being asked.
Kyra describes how they work with the data science team to embed randomized experiments and causal inference into their workflow. This has provided valuable insights into what marketing strategies are most effective.
Researchers want to conduct a study to assess the effectiveness of a new drug, ClaroMax, for prostate cancer. They must clearly define the intervention (ClaroMax), the outcome they want to measure (such as survival time, side effects, etc.), and how it relates to the research question.
They will also need to define the study population using inclusion/exclusion criteria (e.g. only include prostate cancer patients over 65).
Participants will need to be recruited in a way that allows results to generalize beyond the specific study population (e.g. recruit from multiple clinical settings).
The intervention needs to be randomly allocated to participants, such as randomizing at the clinic level.
Data on the intervention and outcomes must be accurately collected over time, such as through nurses recording drug administrations and coordinators following up with patients.
Once completed, the study would compare outcomes like survival time between the ClaroMax and placebo groups using statistical tests to draw causal conclusions about ClaroMax’s effectiveness.
However, randomized experiments are not always possible. Observational studies using techniques like natural experiments can sometimes estimate causal effects from non-experimental data when randomization is not feasible.

This natural experiment allowed economists to draw a causal conclusion about the effect of minimum wage increases on employment by comparing employment levels before and after a minimum wage policy change was enacted. The key assumptions were that the exact timing of the policy change was essentially random, so the period before the change could be treated as a control group and the period after could be treated as an intervention group. Comparing the two time periods using an interrupted time series methodology provided evidence about the causal impact of the policy on employment levels.

Here is a summary of the key points about biases and fallacies in data analysis:

Interrater and intrarater reliability refer to the consistency of ratings or judgments between or within raters. Low reliability can be an issue for studies involving multiple data collectors. Standardizing data collection processes can help improve reliability.
Reporting bias occurs when what is reported does not accurately reflect reality, such as plane crashes receiving more news coverage than tuberculosis despite the latter causing far more deaths.
Publication bias is a form of reporting bias where studies showing significant/positive findings are more likely to be published than those without. This can skew the overall evidence.
p-hacking is exploring many statistical associations in a data set until a significant finding emerges by chance, inflating the chances of false positives. Pre-specifying analysis plans can help avoid this.
Correlation does not necessarily imply causation. Higher mortality rates for one surgeon could correlate with harder case assignments rather than skill.
The conditional probability fallacy involves confusing P(A|B) with P(B|A), like assuming anyone with a fever has Ebola just because Ebola causes fever.
Absolute versus percentage changes can give different impressions. A 30% increase from 10 to 13 people is modest absolutely.
Improved detection, like from new breast cancer screening methods, can cause higher reported rates without affecting actual disease incidence or mortality.

In summary, these biases and logical fallacies need to be guarded against to ensure valid interpretation and avoid misleading conclusions from data analysis.

Here are the key steps I would recommend to Steve and the data science team to determine if there are different types of application fraud and how to potentially address them:

Perform exploratory data analysis and visualize the application fraud data. Look for patterns, groupings or outliers. This may reveal initial evidence of different fraud profiles.
Run clustering algorithms like k-means to group the fraudulent applications based on their attributes. Specify a range of cluster numbers and evaluate the resulting clusters to identify a good structure.
Interpret the clusters to understand if they represent meaningfully different fraud strategies. Compare the attributes of applications in each cluster.
Build separate prediction models for each cluster, if they seem to capture different fraud types. Evaluate if the cluster-specific models perform better than a single model.
Consider dimensionality reduction techniques like principal component analysis prior to clustering to reduce noise. Techniques like t-SNE can also be used to visualize clustering in lower dimensions.
Continue monitoring new fraud patterns over time. Periodically repeat the analysis to ensure cluster structure and models stay up-to-date. Fraud strategies may evolve.
Communicate findings to stakeholders clearly - focus on the potential business impact of differentiated fraud types and detection capabilities.

The goal is to determine if a single model is oversimplifying, and if targeting specific fraud profiles could improve results. Clustering is a logical first step to explore different types of application fraud present in the data.

The data scientist Brett proposes using unsupervised machine learning like principal component analysis (PCA) to explore patterns in fraudulent credit card applications without predicting targets.
PCA helps visualize high-dimensional data by reducing dimensions while retaining important information. It maps data to a new space defined by principal components.
Components represent combinations of variables that explain variance in the data. The first component explains most variance, second less, and so on.
Interpreting component coefficients relative to variables helps characterize what each represents, like one component correlating credit score and early spending with fraud.
Focusing on the top few components (e.g. first 3) that explain most variance makes analysis more tractable while retaining important information.
Dimensionality reduction through PCA allows faster solutions by working with fewer dimensions/components rather than all original variables.
Partnership between data scientists and business experts is important to properly interpret results and meet business needs.

So in summary, PCA is proposed as an initial exploration technique to visualize patterns in fraud data through dimensionality reduction and characterization of principal components.

The data science team identified different clusters/groups of fraudulent credit card applications through principal components analysis and scatterplots.
Three main groups were identified based on credit score and speed/types of transactions: “bold, hesitant, and riding the fence.”
They planned to build separate models targeted at each cluster to better detect different fraud styles.
Using the principal components directly as inputs could provide a stronger yet more minimal model.
Reason codes would help investigators understand why accounts were flagged and what type of fraud was suspected.
Too many clusters could exceed investigators’ capacity to consider different protocols, so discussions on operational limitations were needed.
Cluster profiling describes each cluster based on distinguishing variable averages to understand differences.
Getting feedback from fraud experts on cluster distinguishability and usefulness could improve the analysis.
Exploration of clusters can also uncover new predictive features not originally in the model.

The summary focuses on the key aspects of identifying clusters of fraudulent applications, using them to build targeted models, linking the results to investigators, and getting feedback to improve the analysis, without discarding critical details.

Brett explained K-means clustering to Steve, noting it is a popular partitioning method where the number of clusters K is preset. Observations are assigned to the nearest cluster center and centroids are recalculated iteratively until cluster assignments stabilize.
They discussed how to determine the optimal number of K - by testing all values from 1 to 10 clusters and letting the data show the best solution. Multiple measures can assess cluster quality.
Hierarchical clustering was also described. It creates nested clusters similar to Russian dolls, either bottom-up (agglomerative) or top-down (divisive). Distance measures impact results.
Clustering can segment customers operationally or build separate predictive fraud models for clusters. Labeled data can help interpret clusters by known fraud types.
For Steve, Brett ran K-means with 6 optimal clusters found. Principal components analysis preprocessed the data first.
Clustering allows building more accurate predictive models tailored to specific customer segments vs one single model. This improves fraud detection rates and prevents more losses.
The company Stardust Health is seeing rising costs and poor decision making around approving or denying spine surgeries for back pain through their prior authorization process.
They want to build a machine learning model to help predict which patients have a high likelihood of success with nonsurgical treatment, so those patients could be denied surgery. This would help control costs and ensure better outcomes.
Scoping the problem is important - they need to define the prediction goal, outcome measure, data sources, and intended model use.
The outcome being predicted is the chance of a patient’s success with nonsurgical treatment. Possible outcomes discussed are overcoming back pain, avoiding hospitalization/complications, or not needing additional care.
Insurance claims data will be used since it provides accessible historical data on patient health care utilization, but it does not directly capture patient-reported pain levels.
The goal is prediction, not explanation, so the model building approach will differ from a prior cost explanation model built by the data team.
The team discusses using healthcare utilization as the outcome variable for their predictive model, looking specifically at utilization related to back pain like medications, injections, and physical therapy. They agree to measure utilization over a 3-year period after the initial prior authorization request for surgery.
They acknowledge this outcome could introduce bias if some patients are less likely to seek healthcare. They plan to account for this by analyzing differences in utilization across patient groups.
Setting criteria like maximum age (e.g. 60 or younger) is proposed to avoid issues with patients who may die within the measurement period.
Features are identified as key determinants of model success. Temporal and data type restrictions are discussed. Features can be engineered from various data sources like clinician notes, lab results, and claims data by converting unstructured data into a table format the model can understand.
Claims data in particular offers opportunities to create new features by aggregating and combining existing data in different ways, like counting prior hospitalizations from claims records.
In summary, the team works through decisions around the outcome, population, available data sources, and approach to feature engineering to define the core components needed to build their predictive model.
Feature engineering involves creating new features from existing data that may help predict the outcome. Some techniques discussed include aggregating data (e.g. total visits, medications), calculating rate of change (e.g. % increase in prescriptions over time), and features from text/PDF data.
Having more features is better up to a point, as it gives the model more information to learn from. However, too many features relative to the amount of data can cause overfitting. The ratio of data points to features should generally be at least 10:1.
During model training, features are evaluated to identify the most predictive ones. Methods include filtering low-variance features, wrappers that add/remove features, and embedded methods where selection is part of model optimization.
Model testing involves showing the model previously unseen data to evaluate its predictive performance after training. The goal is for the model to generalize what it learned from the training data.
Feature selection is important to identify relevant predictors without introducing too much noise. Both domain knowledge and data-driven methods are useful. Methods like LASSO perform selection automatically during model optimization.
In summary, feature engineering, selection, training and testing are key parts of building an effective predictive model from data. The goal is for the model to learn meaningful patterns without overfitting.

Here is a summary of the key points about how machine learning models identify statistical patterns:

Models are trained on labeled training data to find patterns that predict the labels/outcomes. This involves estimating parameters that optimize the “fit” of the model to the training data.
For simple linear regression models, the parameters are the slope and y-intercept of the best-fit line. For more complex models, there are many more parameters.
The goal is to find the parameters that minimize the error between predicted and actual labels on the training data. This process is called “fitting” the model.
Once trained, the model can make predictions on new, unlabeled data by applying what it learned from the training process.
It’s important to validate the model on a separate test set to avoid overfitting, where the model fits the noise in the training data too closely.
Performance could degrade significantly on new data that is very different from the training data in meaningful ways. Representativeness and diversity of training data is important.
The validation set can be used for iterative development, but final performance should be evaluated on the held-out test set to avoid “cheating.”

So in summary, models learn statistical patterns by optimizing parameters during training to fit labeled examples, and these patterns are then applied to make predictions on new examples.

The passage discusses various approaches to evaluate machine learning models, including cross-validation, tuning hyperparameters, and measuring model performance.

Cross-validation is recommended over using a single validation set to avoid overfitting. It involves splitting the data into folds and training/validating on different combinations of folds.

If a model’s performance is not satisfactory, there are three main ways to improve it: modifying the data (adding features), changing the model type/structure, and tuning hyperparameters.

To determine if a model is good enough, its performance needs to be compared to requirements for the intended use case. There is no fixed rule for when to stop improving - it depends on tradeoffs like feature engineering costs.

Common metrics for regression models include mean absolute error and mean squared error. For binary classification, metrics include Brier score and calibration. Brier score evaluates predicted probabilities against outcomes, while calibration checks if predicted probabilities match true probabilities.

The key points are that cross-validation helps avoid overfitting, and there are different levers to pull and metrics to use depending on the type of model and prediction task. Determining acceptable performance involves understanding application needs.

Based on the passage, some key points about building and improving machine learning models include:

Start with a simple model first to validate the approach and understand the data, before building more complex models. This helps manage resources efficiently.
It’s common to not achieve perfect performance (e.g. 0% incorrect decisions), as there are inherent uncertainties in data and outcomes. The goal is iterative improvement.
After initial validation of a simple model, more advanced modeling techniques can be explored to further optimize performance.
Regular feedback from domain experts on model performance helps identify areas for improvement and refine model objectives.
An iterative, evidence-based approach of building, evaluating, getting feedback and making incremental enhancements is effective for machine learning applications in practice.
Collaboration between data scientists and domain experts is important throughout the model development process.

So in summary, the key is to start simple, validate the approach works, get operational feedback, and then explore more sophisticated techniques - while recognizing perfect performance may not be achievable and focusing on iterative improvement. This balanced approach manages resources effectively.

The initial model used linear regression with features like demographics, medical claims, and prior authorization details to predict healthcare expenditures. It had an average error of $12,000.
To improve it, David organized a hackathon where teams tried different approaches.
The first team examined residuals and found manual laborers had higher errors. Adding job features like physical demands reduced the error for that group by 40%.
The second team added interaction terms between all feature pairs to model nonlinear relationships. They also transformed features like age that weren’t linear. This reduced the average error to $8,000.
The third team used weighting and outliers to make the model more robust. Weighting frequent but low-cost patients less addressed data skewness. Removing outliers improved extrapolation to new data.
Each team improved the model without changing the architecture, just optimizing features or techniques like residuals, interactions, transformations, weighting, and outliers. This showed that complex models may not help if the right tweaks to simpler models aren’t tried first.

Here is a summary of the key points about the data used to train the model:

The training data included prior authorization requests from the past 5 years.
About 2-3 years ago, the company changed its procedures to make it easier for Medicare Advantage patients (typically aged 65+) to submit prior authorizations.
This meant the demographic of patients requesting authorizations skewed older in the past 2 years of data compared to earlier years.
The training data included 3 years of data from before the policy change and 2 years after, when the patient demographic had changed.
There was a concern that the older data from before the change would skew the model toward younger patients and not reflect the current demographic.
To address this, the team used weighted regression to weight the more recent 2 years of data more heavily than the older 3 years when fitting the model.
This allowed them to leverage more of the data while accounting for the shift in patient demographic over time.
Naive Bayes is a simple and fast classification algorithm that works well with small datasets and high-dimensional feature spaces. It requires little training data to estimate parameters for classification.
It works by calculating the probability of each class independently for each feature, and then multiplying the probabilities together. This assumes features are conditionally independent given the class.
The training process involves estimating the prior probability of each class and the conditional probability of each feature given the class. These probabilities are then used to classify new data points.
It may not perform as well if the independence assumption doesn’t hold and features are highly correlated. However, it is powerful and efficient for classification problems with many features.
Decision trees like CART can model nonlinear relationships without needing to specify them manually. They are easy to interpret and work well with mixed data types and missing data.
Ensembling methods like stacking and bagging combine multiple models to create a single, more accurate predictive model. This allows weighting individual models where they perform best to improve overall predictions.
Steve was promoted to help build models for a new Shu Financial division focused on buying, renovating, and renting homes.
He will work closely with Jerry, who has business development and data science experience, to develop the data science strategy.
Jerry emphasizes focusing on critical problems first while flagging nice-to-haves for later.
AI refers broadly to automating tasks usually done by humans. Deep learning uses neural networks with many layers, but other methods may work better depending on the problem.
Autoencoders use neural networks to compress input data into fewer features, such as for image compression. The goal is to remove unnecessary information while keeping essential components.
Jerry advises focusing on solving problems rather than buzzwords, and using the right tool for each problem rather than whatever was recently read about.
Their key challenge is accurately predicting future rental prices, which drives their financial model’s return on investment calculations for purchase and renovation decisions.
Jerry and Steve are working on building a model to predict future rental prices for a company. Currently they are using an off-the-shelf model but it is expensive and not very accurate.
Steve suggests building their own model using machine learning techniques. Jerry agrees this is a good idea but notes they will need relevant data to build an accurate model.
They discuss potential data sources like real estate listings, neighborhood data from third parties, government census data, natural language analysis of home descriptions, and computer vision of property photos.
Steve will focus first on learning natural language processing to analyze home descriptions. Jerry advises starting small with NLP since it’s a complex area.
Charissa approves Brett joining the project to help combining business and data science knowledge.
Jerry and Steve plan an initial focus on identifying important data sources and features before building modeling techniques.
A brainstorming session identifies additional potential features from property manager insights like online home viewing reviews and key words used by real estate brokers.
Steve realizes he needs basic familiarity with techniques like NLP rather than expert-level knowledge to help guide the business decisions.
Common NLP preprocessing steps include data cleaning, tokenization, removing stop words, stemming/lemmatization. These prepare the text for analysis.
Bag-of-words and term frequency-inverse document frequency (tf-idf) are commonly used to measure similarities between texts and importance of words.
Word2vec creates word vectors to measure semantic similarity between words. It can be used for sentence completion.
Sentiment analysis uses text to identify if a sentiment is positive, negative or neutral. It can provide additional insights for tasks like rental price estimation.
Sentiment analysis approaches include rule-based (using predefined word lists) and machine learning (trained on pre-scored texts).
To determine if a team is truly doing NLP vs just text searching, ask detailed questions about the techniques, tools and reasoning for their approach. More advanced NLP uses complex models trained on large datasets.

The key steps described what specific NLP preprocessing, modeling and analysis techniques the team planned to apply, including bag-of-words, word2vec, sentiment analysis to gauge if it provided additional insights for their rental price estimation task.

Generative language models (LLMs) like GPT, LLaMA and PaLM have significantly advanced natural language processing (NLP) by enabling the generation of human-like text. They work by predicting the most likely next word(s) based on their training.
With simple prompt engineering, users can now perform tasks like sentiment analysis and text extraction using LLMs with minimal effort.
The most effective use of LLMs is to combine them with high-quality, domain-specific datasets. Proprietary organizational data is more valuable than any specific model.
While powerful, LLMs have limitations like hallucination and need to be deployed carefully with security in mind.
Geospatial analysis involves using location data like latitude/longitude as part of modeling. Key sources include property locations and demographic/quality of life data available at various geographic levels.
Linking location data to other datasets and assigning values allows inclusion of geospatial features. Metrics like average prices within a radius can be computed.
More advanced analysis includes travel times, density mapping, and other GIS techniques for data visualization and insight.
Geospatial data comes in vector and raster formats, with the latter representing gridded/pixel data useful for mapping trends.
Combining available internal and external data sources with geospatial analysis can enrich predictive modeling.

Here is a summary of the key points about potential features of computer vision for use in rental price modeling:

Computer vision can analyze images and videos to extract useful information like objects present, their location, distinguishing characteristics, distance, movement, and anomalies.
For real estate applications, computer vision could identify features of homes like roof/chimney condition, kitchen appliances, bathroom fixtures, floor/ceiling quality from photos.
Convolutional neural networks are commonly used computer vision algorithms that can identify objects and their properties with various processing layers.
Additional photos of properties may need to be collected from listings, social media, or on-site photos to have a sufficient dataset for computer vision analysis.
Computer vision outputs like probabilities of features present could be added as new data points to enhance rental price models.
Network analysis of real estate agents was also proposed, to examine how well-connected agents may achieve higher prices and identify inaccurate listing descriptions. This would require access to an agent social network like Agentster.

So in summary, computer vision has potential to automatically extract new visual data points on property features that could strengthen predictive models, if sufficient imagery data sources can be accessed.

Networks consist of nodes (individuals/entities) and connections between nodes. Common ways to represent networks include adjacency matrices and edge lists.
The brokerage network being analyzed is undirected, meaning connections go both ways. Some networks are directed, with connections only going one way.
Individuals are more likely connected to similar others (homophily). Connections also tend to form between friends of friends (triadic closure).
Key network metrics include density (how many connections exist out of possible ones), average path length (steps between nodes), and network width (longest path between nodes).
Important node-level metrics include number of connections (degree centrality), eigenvector centrality (quality of connections), and closeness centrality (distance to other nodes). These help identify influential nodes.
While degree centrality is simple, eigenvector and closeness centrality provide more meaningful measures of influence by accounting for quality/strength of connections and distance within the network.
Calculating these standard centrality metrics is straightforward and can help identify key brokers to analyze in predicting rental pricing outcomes.

Here are the key points from the summary:

Kamala and her company were concerned about data ethics after a competitor faced backlash for an algorithm that discriminately increased insurance premiums for Black patients.
The CEO called a meeting to develop a comprehensive data ethics strategy to avoid similar issues.
Kamala and David were tasked with leading the strategy development, as Kamala oversees decision-making teams and David’s team is closest to the data handling.
The CEO wants a strategy to govern everything from data storage and analysis to use in decision-making, to minimize harms and disservices to patients.
Kamala recognizes she needs to quickly get up to speed on the technical aspects of how data is handled, as David’s team would know best.

The main focus is on developing a strategy to ensure data is collected, handled and used ethically by the company to avoid discrimination or unfair harm to patients. Kamala needs to collaborate with David’s technical team to understand current practices and develop comprehensive policies. Timeliness is also important, as the CEO wants a strategy within a couple days.

Kamala asks David if there is a code of ethics for data scientists similar to the Hippocratic Oath for doctors.
David outlines several existing codes of ethics for data science, including oaths from the National Academies and checklists from organizations like Datapractices.org.
The key concepts in these codes are fairness, privacy/security, transparency/reproducibility, and social impact.
David then discusses bias and fairness in machine learning models. He explains how biases in training data can lead to biased models, and provides examples like Google Photos labeling black people as gorillas.
Kamala shares an analogy about medical textbooks only showing light-skinned patients, which can lead to poorer treatment of skin conditions in dark-skinned patients.
They discuss how metrics like group fairness and accuracy across demographic groups can be used to quantify and address biases in machine learning models. The goal is ensuring equitable performance and outcomes for all groups.

In summary, David outlines existing codes of ethics for data science and explains how biases can arise in machine learning if not properly addressed. They discuss the importance of fairness testing and metrics to identify and mitigate biases.

predictive models need to be evaluated for fairness and potential bias before deployment, to avoid exacerbating existing healthcare inequities
models should be tested on different racial/demographic subgroups to check for differences in accuracy or outcomes
if bias is found, potential remedies include removing biased data elements, modifying predictions across groups, or debiasing techniques during training
sensitive variables like race are complex - they may be predictive but could also encode real-world biases. Approaches include checking necessity, data quality, and excluding from models like credit/insurance where prohibited
representativeness of training data is important - models should reflect intended usage populations
“data drift” can occur over time as populations change, affecting model performance. Periodic re-evaluation and retraining may be needed to adjust for concept or feature drift

So in summary, thorough fairness evaluations, representative data, transparency around sensitive variables, and monitoring for data drift over time are all important considerations for developing models for healthcare applications in an ethical manner.

When deploying a machine learning model, it is important to monitor for data drift over time by comparing new input data to the original training data on a recurring basis. This helps detect if the data distribution is changing in a way that impacts model performance.
It is also important to monitor model performance over time to catch issues like calibration drift. If data drift or performance decreases are detected, the model may need to be retrained on more recent data.
When working with sensitive health data, strong data security practices must be followed to comply with regulations like HIPAA and protect patient privacy. This includes encrypting devices, secure servers with access restrictions, using VPNs, and de-identifying data when possible.
For research to be credible and reproducible, independent scientists should be able to design similar experiments and achieve matching results. Lack of reproducibility could point to scientific uncertainty, errors, or misconduct.
In data science, reproducibility could be facilitated by openly sharing data and code. However, privacy and IP issues may sometimes prevent full sharing. Documentation of methods and decisions is also important for reproducibility. Overall transparency improves accountability and allows others to evaluate scientific claims.

It would be a waste of company resources to have data scientists struggling to figure out what was done in past work due to a lack of clear documentation. Proper documentation is important so that future work can build upon past work efficiently. Without documentation, data scientists will have to redo previous analysis and work to understand what was already explored. This slows down the progress of projects and means the company’s resources are not being used effectively. Clear documentation is crucial so that data scientists can focus their efforts on advancing the work, rather than repeating prior steps due to a lack of information about what has already been completed.

Besides generating useful insight, machine learning can often be a data processing step before developing a final predictive model. Supervised machine learning aims to predict outcomes and is widely used across industries for applications like predicting sales, risk levels, spam detection, and more. While techniques can vary, the basic process is similar - preparing data, identifying important features, training and testing models, and improving performance. A key question is assessing impact - did the implementation make a difference? Randomized studies comparing groups that did and did not receive a program are often used. Other approaches are also possible but require care to avoid confusing biases with true program effects. Ethical considerations have also grown in importance to ensure technical solutions are not just good scientifically but also morally - that they do the right things.

Here is a summary of the key points from the paper “A Draft of a Paper for Interface ’98” (unpublished manuscript, January 1998):

The paper discusses best practices and frameworks for successful data science projects, including project management methodologies, team structures, defining goals and metrics, and data foundations.
It outlines elements important for success like defining clear objectives, using an iterative process, managing expectations, ensuring resources are available, and establishing governance.
Statistical and data analysis foundations covered include data cleaning, missing data imputation, descriptive statistics, correlation, hypothesis testing, parameter estimation, and model selection.
The paper discusses making decisions from data through approaches like causal inference, assessing relationships between variables, and drawing conclusions rather than just predictions.
Overall it provides an overview of frameworks, processes, and analytical techniques relevant for building and managing effective data science projects that can inform decisions. The references cited cover topics in project management, statistics, data analysis, and causal inference.

Here is a summary of key points from the papers:

Paper 1 examines the effect of prior authorization on opioid prescribing rates and healthcare costs. It finds prior authorization reduced opioid prescriptions filled and lowered healthcare spending.
Paper 2 argues for more conservative treatment of chronic back pain given limited evidence supporting common interventions like opioids and surgery. It advocates for “backing off” such treatments.
Paper 3 provides an overview of data science and discusses ethical issues like privacy, bias, and fairness that must be addressed when building models using patient data.
Paper 4 describes the data and demographics of patient cohorts in a clinical study of back pain treatments. It details characteristics like age, gender, symptoms for groups with disc herniation, stenosis and spondylolisthesis.
Paper 5 identifies barriers to healthcare access in rural areas such as lack of insurance, providers, transportation and health literacy. It discusses how these factors impact utilization and health outcomes.
Paper 6 discusses techniques for auditing algorithms used in risk scoring and predictive models to assess fairness and address biases that could disadvantage certain groups. It emphasizes the importance of model transparency.

Here is a summary of the key points from the paper “Accountable Decision Systems, New York, February 2018”:

The paper discusses the need for accountability in machine learning systems and decision making algorithms. As these systems are increasingly used to make important decisions about people’s lives, it is critical that they can be explained and evaluated.
Different technical approaches for developing accountable systems are described, including model explainability techniques like LIME and anchoring. These allow the predictions or decisions of a model to be explained on a case-by-case basis.
Procedural accountability ensures a fair decision making process through mechanisms like impact assessments, oversight, and opportunities for individuals to challenge or appeal decisions.
The tradeoffs between maximizing predictive performance and ensuring accountability are discussed. In some cases it may be necessary to sacrifice some predictive power to gain accounting abilities.
A principled framework for accountable algorithmic decision making systems is proposed, incorporating both technical and procedural accountability. This could guide the development, assessment and governance of these important systems.
Overall the paper argues that accountability must be prioritized as machine learning is increasingly applied to domains like healthcare, justice and public services where decisions profoundly impact people’s lives. Both technical and procedural approaches are needed to ensure these systems and their decisions can be properly understood, evaluated and challenged.

Here is a summary of 48/rg.2016150080:

This article discusses the use of machine learning and data science techniques for real estate analysis and applications. It first provides an overview of common natural language processing techniques like text classification, sentiment analysis, named entity recognition and others that can be used to analyze real estate listings, documents, and online discussions. It then discusses machine learning algorithms for prediction and analytics like regression, clustering and neural networks. Potential applications discussed include predicting real estate prices, identifying new market opportunities, computer vision techniques for imaging property photos, and using social media data. The article also reviews network science and graph analytics approaches for understanding real estate markets as complex networks. It recognizes challenges like data biases and ethical issues that need consideration. Overall, the article presents a comprehensive survey of various data science and AI techniques applicable to common real estate analysis scenarios.

Here are summaries of the key papers referenced:

Liu et al. describe a deep learning system that achieved dermatologist-level accuracy for classifying various skin conditions using a dataset of 129,450 clinical images.
Verma and Rubin discuss different definitions of fairness in machine learning and their implications.
Kamiran and Calders discuss techniques for preprocessing data to help reduce discrimination in classification models.
Zhang et al. and Zafar et al. propose adversarial learning and optimization-based approaches, respectively, for mitigating unwanted biases in models.
Jhala and Majumdar provide an overview of model checking techniques for formally verifying properties of software systems.
Bellamy et al. describe AI Fairness 360, an open-source toolkit for auditing risk assessment models and detecting unwanted biases.
Gama et al. survey approaches for detecting concept drift, where the statistical properties of a target variable change over time.
Several papers discussed techniques for detecting concept drift in streaming data.
Annas discusses the introduction of HIPAA regulations governing privacy of medical records in the US.
The National Academies publication discusses challenges and best practices for achieving reproducibility in scientific research.
Two references discuss OpenAI’s decision not to publicly release its text generation model CLIP due to safety and bias concerns.

Here is a summary of the key terms:

Large language models, neural networks, and machine learning engineers are important for natural language processing, which is used in applications like sentiment analysis, speech recognition, and machine translation.
Linear regression, logistic regression, and other predictive models like random forests and neural networks are commonly used supervised learning algorithms. Regularization techniques like LASSO and Ridge regression help with overfitting.
Network analysis examines properties of nodes, edges, connectivity, and centrality in networks. Unsupervised clustering can find patterns in unlabeled data.
Randomization, randomization experiments, and other quasi-experimental methods like regression discontinuity help establish causality from observational data by reducing selection bias.
Core machine learning tasks involve feature engineering, model training/tuning, evaluating performance, and deploying models. Programming languages like Python and R are commonly used.
Privacy, security, bias, and other ethical concerns are important to consider, as are reproducing and sharing results transparently. Stakeholder needs around objectives, impacts, and success measures are also key.

#book-summary

Winning with Data Science A Handbook for Business Leaders - Howard Steven Friedman