Project 2

Project summary

Existing models for multimorbidity tend to be crude and premised on incomplete data that often lack context. In Project 2, we aim to use complex data sets in populations and individuals to build models for predicting specific health outcomes related to diseases and multimorbidities that are common in populations. We will make use of data from two African HDSS communities in which prevalent diseases tend to show specific clustering patterns i.e. Bushbuckridge in the Mpumalanga province in South Africa (rural/semi-urban) and Nairobi in Kenya (urban). They have rich longitudinal data and many nested research studies with additional data specific to individual projects on subsets of residents, including genomic studies.

Aim 1

Aim 1 will provide detailed scoping of the data described and characterised in Project 1. These data will be available for the models and will include individual level data in the following broad domains: demography, clinic visits, health, medication, verbal autopsy, behaviour, infections, laboratory tests, image data and genetic data. For many variables that change over time we have access to longitudinal data (e.g., weight, laboratory assays on blood and urine, disease states, infection and medication). Importantly, we have genome-wide genotyping (~4500 participants) and whole genome sequencing (~100 participants) data from studies in these two regions to assess the contribution of genetic variation to multimorbidities. We will focus on the following common traits and diseases: Hypertension, obesity, dyslipidemia, diabetes, kidney diseases, HIV infection and tuberculosis.

Aim 2

Aim 2 is specifically related to developing prediction models and we will have two approaches. The first will be calibration of existing risk predictive algorithms using local data with a limited number of variables (age, sex, obesity, smoking, alcohol consumption, high blood pressure and blood glucose, medication use, HIV and tuberculosis status). The second will be employing black-box machine learning models that allow for more complex interactions between features. Some of the outcomes of the models will be population stratification by addressing the following questions: What sub-populations appear more susceptible to developing a disease (or a second or third)? How does treatment (or lack thereof) vary between sub-populations? How are the diseases and treatments linked to mortality?

A more general objective will be developed in Aim 3 where we will assess the explainability of the models trained in Aim 2. Critically, we are proposing a new research direction in explainability that detects model interactions for a group of records. Current methods allow for explanations of single records or all (aggregate) records. The main outcomes of Project 2 will be (1) automated stratification to identify key high- risk clusters for targeted public health interventions and (2) models to estimate risk for developing multimorbidities in populations and individuals (including genetic predisposition) to guide clinical decisions and to inform public health interventions.