Developing a ‘Smart’ Recovery Monitoring and Support System

Precision Mental Health for Continuing Care

“Could you predict not only who might be at greatest risk for relapse …
… but precisely when that relapse might occur …
… and how best to intervene to prevent it?”

Precision Mental Health for Continuing Care

Precision mental health requires us to provide the right interventions and supports to the right people at the right time, every time

SUD continuing care requires

Long-term monitoring
Ongoing lifestyle adjustments and support

Precision Mental Health for Continuing Care

Precision mental health requires us to provide the right interventions and supports to the right people at the right time, every timed

SUD continuing care requires

Long-term monitoring
Ongoing lifestyle adjustments and support

A “Smart” Recovery Monitoring and Support System can provide temporally precise, dynamic, personalized continuing care by combining:

Sensing
Artificial Intelligence/Machine learning

Model Output: Lapses

Lapses

are clearly defined,
have a temporally precise onset, and
can serve as an early warning sign for relapse (precede and predict)

“Abstinence violation effects” can increase relapse risk
Even a single lapse can result in overdose and/or death for some drugs

Model Output: Lapses

Lapses

are clearly defined,
have a temporally precise onset, and
can serve as an early warning sign for relapse (precede and predict)

“Abstinence violation effects” can increase relapse risk
Even a single lapse can result in overdose and/or death for some drugs

Model Inputs: Personal Sensing

Personal sensing collects information from smartphones and wearable sensors to identify a person’s thoughts, feelings, behaviors, and context.

Model Inputs: Personal Sensing

Personal sensing collects information from smartphones and wearable sensors to identify a person’s thoughts, feelings, behaviors, and context.

Sensing allows for “real-world” measurement that

Can be sustained for long periods
Has very high temporal granularity

Model Inputs: Personal Sensing

Personal sensing collects information from smartphones and wearable sensors to identify a person’s thoughts, feelings, behaviors, and context.

Sensing allows for “real-world” measurement that

Can be sustained for long periods
Has very high temporal granularity

Practical requirements

Feasible/acceptable for long term use
Consistent across platforms/devices
Stable/Low churn over time in necessary hardware/software

Model Inputs: Personal Sensing

We use smartphone sensing methods to collect

Ecological Momentary Assessments (EMA)
Contextualized Geolocation
Contexualized Smartphone Communications

Model Inputs: Personal Sensing

We use smartphone sensing methods to collect

Ecological Momentary Assessments (EMA)
Contextualized Geolocation
Contexualized Smartphone Communications

Lapse Prediction for AUD

151 individuals with moderate to severe AUD

Early in recovery (1-8 weeks)

Committed to abstinence throughout study

Followed with sensing for up to 3 months
- Ecological Momentary Assessments
- Contextualized Geolocation
- Contexualized Smartphone Communications
- (also sensed physiology, sleep, coarse self-report)

Lapse Prediction for AUD

151 individuals with moderate to severe AUD

Early in recovery (1-8 weeks)

Committed to abstinence throughout study

Followed with sensing for up to 3 months
- Ecological Momentary Assessments
- Contextualized Geolocation
- Contexualized Smartphone Communications
- (also sensed physiology, sleep, coarse self-report)

Participant Characteristics

Participant Characteristics

All participants met criteria for moderate to severe AUD

Reported abstinence goals

Ecological Momentary Assessments

Current/Recent Experiences
- Craving
- Emotional state
- Recent past alcohol use
- Recent risky situations
- Recent stressful events
- Recent pleasant event

Future Expectations
- Risky situations
- Stressful events
- Abstinence Confidence

So let me tell you a bit more about the ecological momentary assessments (or EMAs) that we used as part of our sensing methods. These EMAs are brief surveys that participants completed on their smartphones. They take 20-30 seconds to complete and we collected them several times per day.

On each EMA, participants reported the date and time of any lapses back to alcohol use that they hadn’t previously reported. These lapse reports are used for the lapse outcomes that we train our models to predict. And these laspes were also confirmed by study staff during lab visits using a follow-back procedure.

All of the EMAs also asked participants about their current craving, emotional state, recent risky situations, and recent stressful and pleasant events since their last EMA.

And on the first EMA each day, they also reported any future risky situations and stressful events that they expected in the next week and their confidence that they would remain abstinent.

Modeling: Feature Engineering

Features based on recent past experiences (12, 24, 48, 72, 168 hours)

Min, max, and median response (all items)

History (count) of past lapses (item 1) and completed EMAs (compliance)

Raw scores and change scores (from baseline/all past responses)

Modeling: Predictions

Predict hour-by-hour probability of future lapse

Lapse window widths
- 1 week
- 1 day
- 1 hour

Modeling: Predictions

Predict hour-by-hour probability of future lapse

Lapse window widths
- 1 week
- 1 day
- 1 hour

Modeling: Algorithms and Resampling

XGBoost - Boosted decision trees
Also considered:
- ElasticNet GLM (e.g., LASSO, ridge regression)
- Random Forest
- KNN

Using grouped (by participant), nested, repeated k-fold CV
- 30 “held-out” test sets
- New participants and observations not used for training

Predicted Lapse Probabilities: Next Week Model

Model predicts probability of lapse in next week for “new observations in test sets

Can panel predictions by Ground Truth (i.e., true lapse vs. no lapse observations

Want high probabilities for true lapses and low probabilities for true no lapses

OK, lets begin to explore how well we can do.

Lets start with the model that provides the coarsest level of temporal specificity – 1 week, and let me take a moment to make the predictions that this machine learning model provides more concrete for you

On the right, you are looking at histograms of the lapse probability predictions that the model makes for all the weeks for all the patients in the held out folds.

I’ve paneled these histograms by whether a lapse did or did not happen in reality for each predicted week. The top panel is for weeks with lapses and the bottom panel is for weeks with no lapses.

Ideally, you want the predicted probabilities to be very high for weeks when there was a lapse and very low for weeks when there was no lapse.

And this is exactly what we see for the one week lapse window model

Model Performance: Area under ROC curve (auROC)

The Area under the Receiver Operating Characteristic Curve (auROC) indicates the probability that any true lapse is scored higher than any true no-lapse by the model

Random performance: auROC = 0.5
Perfect performance: auROC = 1.0

Model Performance: Area under ROC curve (auROC)

The Area under the Receiver Operating Characteristic Curve (auROC) indicates the probability that any true lapse is scored higher than any true no-lapse by the model

Random performance: auROC = 0.5
Perfect performance: auROC = 1.0

Model Performance: Next Week Model

Model Performance: Next Day Model

Model Performance: Next Hour Model

Understanding the Models

[PAUSE]

Of course, if we want to implement these models in a real world system, we need to understand how they work and what features are driving the predictions. And in recent years, the field of interpretable AI has made big strides in developing tools to help us look under the hood, so to speak, of these models to better understand them.

One of the more promising of these tools is SHAP or Shapley Additive Explanations. SHAP is a method for interpreting the output of machine learning models that is based on cooperative game theory. It provides a principled way to assign each feature or category of features an importance value for a particular prediction.

We can use this approach to understand why the model makes a specific prediction for a specific participant at a specific moment in time. And I will do this a bit later when we talk about how to make personalized support recommendations.

But we can also use SHAP values to understand the global feature importance of each feature across all participants and observations for any of our models, so lets take a look at this first.

Understanding the Models: Next Hour Model

All EMA items impact lapse probability

Understanding the Models: Next Hour Model

All EMA items impact lapse probability

Lapse day and lapse hour are useful

Understanding the Models: Next Hour Model

All EMA items impact lapse probability

Lapse day and lapse hour are useful

Demographics not particularly important

Interium Summary and Next Steps

Very strong overall performance
Temporally precise models for immediate future lapse risk
EMA risk features are intepretable and sensible

Next Steps: Algorithmic Fairness

To start, our models have some serious but unfortunately not unexpected problems given what I told you earlier about the some of the demographic limitations of our training data.

I’ve already shown you that our models perform exceptionally well when evaluated across the full sample. However, when we evaluate model performance, it is critical that we look at performance in subgroups that experience health disparities. And too often, these analyses are not done or reported.

Its only very recently that we have begin to take this seriously and we must. If we hope to use our system to address existing disparities in SUD outcomes then our models must perform well with all groups, regardless of their privilege or the use of these models may exacerbate rather than reduce existing mental healthcare disparities.

Next Steps: Algorithmic Fairness

Next Steps: Algorithmic Fairness

Algorithmic Fairness

Next Steps: Algorithmic Fairness

NIDA project recruited ~ 400 patients in recovery from Opioid Use Disorder
National sample (size; diversity: demographics, location)
More variation in stage of recovery (1 – 6 months at start)
Sensing for 12 months

Next Steps: Algorithmic Fairness

Excellent performance: auROC ~ 0.94

Next Steps: Algorithmic Fairness

Excellent performance: auROC ~ 0.94

Next Steps: Sensing Geolocation and Communcations

…Imagine my text messages…

Context is Critical

We gather this contextual information quickly by asking a few key questions about the people and places we interact with frequently over the first couple of months that we record these signals. And we can identify these frequent contacts and locations directly from these signals.

In our current projects, we target people and places that we interact with at least twice a month or more for more detailed follow-up to gather context. And it turns out that this really isn’t that burdensome. Most of us are creatures of habit and if we set a threshold for 2x monthly interactions, we typically only have 10-30 people and places that meet this threshold. And it’s the same people and places each month so we can build this context up when the person first starts to use the system and after that it only needs to be updated occasionally when we go somewhere new or make a new friend.

Contextualized Geolocation

Location type (e.g., home, home of friend, bar, restaurant, liquor store, work, health care, AA/recovery meeting, gym/fitness center)
Is alcohol available at this location
Have you drank alcohol at this location?
Is your experience at this location generally pleasant, unpleasant, mixed or neutral?
This location is (high risk, moderate risk, low risk, no risk) for my recovery

Contextualized Communications

Have you drank alcohol with this person?
What is their drinking status (e.g., drinker, non-drinker)?
Would you expect them to drink in your presence?
Are they currently in recovery from alcohol or other substances?
Do they know about your recovery goals and if so, are they supportive?
Are your experiences with them typically pleasant, unpleasant, mixed or neutral?

Next Steps: Clinical Uses

Do NOT provide model output to clinicians

Clinicians are over-burdened
Not ready for new data streams

When we started this work, we believed we were building this system to inform clinicians about their caseload.

Today’s digital therapeutics have clinician dashboards built into them and we, perhaps naively, thought clinicians could use this information to prioritize their resources to patients who had the greatest need.

But as we talked to clinicians, it became very clear that they do not want any more info at this point. Post-pandemic, they are barely keeping their heads above water and are definitely not ready to add new systems and data streams in place. This may change in the future, but we’ve moved away from this idea for now.

In contrast, our work with participants suggested that they did see potential value in monitoring their recovery using our monitoring and support system. So we have pivoted to considering what information might be most useful to provide directly to them.

Next Steps: Clinical Uses

Do NOT predict class labels (lapse vs. no-lapse)

Iatriogenic effects?
Information loss

Next Steps: Clinical Uses

Do NOT predict class labels (lapse vs. no-lapse)

Iatriogenic effects?
Information loss

Next Steps: Clinical Uses

DO use lapse probability

auROCs range from 0.90 - 0.94

Next Steps: Clinical Uses

DO use lapse probability

auROCs range from 0.90 - 0.94
Probabilities are calibrated and ordinal
Provides fine gradations of relative risk for clinical decision-making

And critically, these probabilities are very well calibrated and at least ordinal in their relationship with the true probability that a lapse will occur.

On the right, I am showing you a simple calibration plot. On the x-axis, I’ve binned predicted lapse probabilities into bin widths of 10 percent and for each of these bins, I display the actual observed probability of lapses for observations in that bin.

If the probabilities were perfectly calibrated, the bin means would all fall on the dotted line with the bin from 0 - .1 having an observed probability of .05, the bin from .1 - .2 having a probability of .15, and so on. And this is essentially what we see for our models.

Given this, we believe that the lapse probabilities can provide precise, fine gradations of risk for clinical decision making.

Next Steps: Personalized Daily Support Recommendations

SHAP values from the NEXT DAY model can identify the most important risk features for a specific individual on each day

These features can be used to personalize daily support recommendations

Next Steps: Personalized Daily Support Recommendations

Next Steps: Personalized Daily Support Recommendations

Next Steps: Personalized Daily Support Recommendations

Next Steps: Personalized Daily Support Recommendations

SHAP values from the NEXT DAY model can identify the most important risk features for a specific individual on each day

These features can be used to personalize daily support recommendations

We can also eventually learn which interventions are best for which risk

Next Steps: Advanced Warning

Previous models only predict immediate future
Advanced warning needed for some types of supports

Next Steps: Advanced Warning

Previous models only predict immediate future
Advanced warning needed for some types of supports

Can lag model up to two weeks into the future
Performance drops but still remains good

Given this, we have begun to explore various methods for predicting lapse probabilities further into the future.

As a first step, we took the model which predicted lapses in the next week and lagged it by different periods from one day up to two weeks into the future. For example, the two week lagged model predicts the probability of a lapse that will occur during a one week period but that period begins two weeks from now.

And as you can see from the plot on the right, this approach does result in a drop in performance as we move further into the future. But even with a two week lag, we still have an auROC of .85 which is still potentially clinically useful.

And we are currently exploring other methods for predicting lapse probabilities further into the future that may perform better than this simple lagging approach and I’d be happy to talk more about those during the discussion period if anyone is interested.

Next Steps: Optimize System Feedback to Patients

Sensing EMA and geolocation for four months
Model updated each night for next day
- Lapse probability predictions
- Important risk features
- Risk relevant support recommendations
Participants receive daily messages varying combinations of these components
Measure trust, engagement, and clinical outcomes

Next Steps: Optimize System Feedback to Patients

Sensing EMA and geolocation for four months
Model updated each night for next day
- Lapse probability predictions
- Important risk features
- Risk relevant support recommendations
Participants receive daily messages varying combinations of these components
Measure trust, engagement, and clinical outcomes

In this project, we have embedded a prediction model that uses inputs from both EMA and geolocation within our Smart recovery monitoring and support system.

Participants will use this system for 4 months. Each day, the model will

make daily lapse probability predictions for each participant
identify current personalized lapse risks contributing to that prediction,
and map those risks to behavioral and support recommendations that are specific to each person each day.

We can then manipulate what information we include in daily messages from the system to participants to increase their trust and engagement with the system as well as formally evaluating its clinical benefits.

Obviously, we are very excited to get started on this project because it will bring us one step closer to providing meaningful support to individuals in recovery.

Thanks for your time. I am eager to hear your reactions to all of this and I’d be happy to answer any questions.

CRediTs

Acknowledgements (recent projects first)

Co-Investigators

Graduate Students

Staff

Acknowledgements (alphabetized)

Co-Investigators

Graduate Students

Staff