Predicting the time to get back to work using statistical models and machine learning approaches | BMC Medical Research Methodology

Predicting the time to get back to work using statistical models and machine learning approaches | BMC Medical Research Methodology

Participants

The Inspiring Families programme ( was a partnership between the Department of Work and Pensions and Serco International PLC. It aimed to help people from families that may have complex family issues, including for example where there is a history of crime, substance dependence, or anti-social behaviour. It was open to people legally resident in the UK, with a right to work, aged 16 or over, who self-declared criteria that led to difficulty in gaining employment. Participants lived mainly in north and east London. The intervention began in January 2017 and we had access to data up until a point of freezing the dataset for transfer to University of Warwick in June, 2019.

People were referred by the staff from the Department of Work and Pensions (DWP) to Serco. If people wished to join the programme, they were invited to attend a one-hour (minimum) face-to-face Initial Engagement meeting within five days of a referral from DWP. At this meeting, an initial needs assessment was initiated by a personal adviser, who also conducted a financial ‘better-off in-work’ calculation for the participant. At this time, advisors collected basic demographic data and customers completed a 32-item questionnaire focused on family circumstances, with many binary variables, such as whether there was a history of drug/alcohol abuse, convictions, the ability to drive, etc., and obstacles to getting a job identifying potential challenges customers might face when looking for a job (Table 1 & Appendix, Table S1). It was identifying which of these factors are most important for predicting return to work that was the focus of this project. The customers’ consent to share information with others where appropriate, and consent for Serco to contact employers during the programme as participants find work, was sought during this meeting.

The needs assessment included a series of detailed questions about the participants’ work-history and discussions around an Initial Action Plan commenced based on the identified needs. The Action Plan commenced at this first meeting but could be completed at one of the subsequent follow-up meetings (below). The Action Plan set out details of what it was assessed needed to be done, who needed to do it and by when. It documented plans for support with family issues, facilitation of access to vacancies, job search support, personal coaching to build motivation and confidence, help with CV preparation, help with economic calculations, and advice on other supporting services. Personal advisers aimed to contact participants once per fortnight as a minimum with increased contact as required on a case-by-case basis. Action Plans were regularly reviewed, and actions were recorded as completed as soon as these were identified. Our analyses focus on the first date of work starting from the first interview, i.e., the start of the Initial Action Plan.

Analysis

Our analysis plan was to compare and contrast the results from two statistical time-to-event regressions and two machine learning approaches. Since machine learning algorithms lose efficiency with missing values, and the multiple imputation methods are yet to be integrated, we analysed only an imputed dataset using the Hot Deck method [9, 10]. Nevertheless, we recognise the superiority of the multiple imputation approach for dealing effectively with missing values and the limitations of the Hot-Deck method in complying with the Rubin’s approach [11]. The rationale behind our combinatory approach (statistical and machine learning algorithms methods) was to find out if using machine learning approaches would lead to additional insights into which prognostic variables affect the employability of programme participants, when compared to well-established statistical models. We evaluated the performance of three conventional analytical models and two machine learning approaches on this dataset. Our outcome of interest was the date of starting their first job after joining programme.

Initially, and before implementing any model, we derived Kaplan Meier curves and assessed linearity using Martingale residuals. Then, we checked proportionality using visual assessments and assessed linearity. Then, we checked proportionality using visual assessments and the Schoenfeld residuals test for proportional hazards.

Semi-Parametric Regression – Cox

This model has been used for analysing time-to-event data since the 1970s [12, 13]. Although the semi-parametric approach adds flexibility of this model by allowing the baseline hazard function to be unspecified, there are two strict model assumptions: The hazard ratio is assumed to be constant over time and the survivorship effect is proportional over time. Deviation from this proportionality assumption for the hazard increases the chance of bias, thus accommodation of time-varying effect(s) is suboptimal. The model is not robust to the proportional-hazards assumption deviation and in this case, stratification should be used [14].

Flexible-Parametric (Royston-Parmar) Regression

The distributional specification of the baseline hazard is a more theory-driven alternative as the (typical) step function approximation of the event time improves model fit. This model supports a viable alternative of a flexible specification using splines for a smoothed parameterisation of the baseline hazard [15, 16].

The major features of this model are prediction flexibility due to adjustment for covariates, especially the time-varying covariates, more transparent and more efficient handling of the time-varying covariates. It is suggested that the standard (non-tuned) model specification should be the hazard-scaled model with a three degrees of freedom spline for the spline function that represents the baseline (log cumulative-hazard) function [16]. This is the two-knots restricted cubic spline function. For the time-dependent covariates one degree of freedom is used.

Cox elastic net

Use of penalised regression has been advocated for a long time as it provides a good support for model parsimony. Where a large number of variables are analysed in a model, a method to remove non-contributing variables is very useful. Briefly, the elastic net approach combines two individual regression penalties known as lasso (l1) and ridge regression (l2) borrowing strength from the flexibility of both approaches. The balance of those two penalties is expected to result to a better model performance. Here, we use this regularised regression to ease selection of the important variables in predicting job uptake [17, 18].

Random survival forest

These (semi) parametric approaches are known to perform adequately in time to event studies of average size and few variables where the interest is primarily on pre-specified variables as either factors or interactions, commonly adjusted for selected co-variate [19]. However, when the number of covariates increases substantially, likelihood-based optimisation algorithms underperform, sometimes substantially [20]. On the other hand, classic frequentist inference may be of secondary importance compared to predicting who is going to experience the event. Constrained by numerous p-value based inefficiencies, data analysts are looking for big-data alternatives that can handle numerous covariates simultaneously. Machine learning approaches are expected to perform more efficiently within this high-dimensional data context, either by employing proportional hazard or less restrictive alternatives. Here, we use the Random Survival Forest as an ensemble algorithm that is based on bagging of classification trees for survival data [21, 22]. We expected that the advantages of this fully data-driven approach would have positive effects on variable selection and event occurrence, avoiding common regression issues such convergence, heteroskedasticity and covariates’ correlational influences.

Survival support vector machine

This is a flexible machine learning algorithm that does not rely on asymptotic properties, although it involves statistical learning theory. As a classifier, support vector machine ‘clusters’ observations so to provide the ‘high margin‘, that is the threshold between the clusters that can accurately classify observations into groups/classes of interest [23]. In other words, the aim of this algorithm is to find a classification boundary that maximises the distance from the nearest observations of all classes. It is this optimal boundary, known as maximum margin hyperplane, that makes the algorithm a useful analysis tool. The larger the margin, the lower the error as the boundary is less sensitive to outliers and noise. However, like other models, support vector machine is frequently subject to overfitting.

Implementation software

Two programming languages were used to implement the analyses reported here. The data preparation for analysis and the statistical modeling was implemented in Stata ( Stata v.17, StataCorp. 2022, College Station, TX: Stata Press). The machine learning models were fitted using python v2.8 and the libraries ( pysurvival 0.1.2 2019, scikit-survival 0.19.0 2022, matplotlib 3.6 2022) [24, 26]. We validated the multi-level models using the ‘train_test_split’ function from the library scikit-learn (1.1.3), splitting the dataset 75/25:training/test dataset [27]. We did not validate the semi-parametric and flexible parametric models because the unreliability of validation for time to event models with high levels of censoring [28, 29].

link