Representation learning for clinical time series prediction tasks in electronic health records

Table of Contents

Dataset generation: heart failure selection

The EHR data used in this paper is collected from the Shuguang Hospital which is the first class general hospitals in Shanghai. The CDR of the Shuguang Hospital between January 2005 and April 2016 contains approximately 350,000 hospital records.

In this paper, a sub-repository focusing on heart failure is constructed from the above CDR. We select patients who satisfy the following criteria: One patient has at least two hospital records, and the ICD-10 code associated with heart failure exists in the diagnosis or medical order of these two hospital records. Specially, clinical experts define a list of ICD-10 codes related to heart failure, including 63 codes.

Our dataset consists of 4682 patients with 10,898 inpatient records, where 568 patients (about 12.1%) died in the hospital and the remaining patients are difficult to track. To enrich our dataset, we split the patients’ hospital records and obtain 10,898 samples. For instance, if a patient has three inpatient records, we then construct three samples by respectively selecting only the first record, both the first and second records, and all three records.

Data preprocessing

For each patient in the sub-repository, auxiliary information, general demographic details (i.e., age and gender), and clinical events are retained. Auxiliary information contains EMPI (i.e., patient unique identifier), hospital ID (i.e., inpatient record unique identifier), admission time and death time. We use auxiliary information to organize and preprocessing EHR data. General demographic details (i.e., age and gender) only needs two dimensions to describe, and the value of age should be normalized without breaking sparsity first. Besides, clinical events include diagnoses, medications and lab tests. To convert clinical events to computable sequences, the normalization process for different clinical events varies by their types. In particular, we convert clinical event information of one record to a multi-hot vector. Finally, a multi-hot vector with 1309 dimensions is obtained according to the following principles:

Diagnoses: The patient records of heart failure repository include 1232 ICD-10 codes in total. As a result, we represent the ICD-10 codes with 1232 dimensions.
Medications: According to the universality of medication for heart failure in China, 61 kinds of medications are chosen by clinical specialists manually. Clinical specialists classified these medications into 11 groups, such as ACE-I, ARA, and ARB. Similarly, we represent the medications with 11 dimensions.
Lab Tests: Clinical experts choose 22 laboratory tests related to heart failure in this research. According to the reference value of each lab test, a flag including high, low and normal is used to denote the results. Therefore, three dimensions are required to convert the result of one lab test into binary feature. Eventually, we represent the lab tests with 66 dimensions.

Specially, raw feature includes clinical events and demographic details, and one record of raw feature is described with 1311 dimensions in total.

Patient representation learning

Figure 2 describes a straightforward motivation for using distributed representation for patients. The size of tensor representations is variable because different patients may have various inpatient times(i.e., x,y or z times). As shown in Fig. 2a, it is challenging to use the tensors with variable length as the input of prediction models. To solve this issue, the representation method in Fig. 2b performs statistics for all the inpatient records of each patient, such as summarize, average, and maximize. For example, the value on each dimension of the patient vector is the summary of the corresponding medical event in all inpatient records. Therefore, the dimensions of patient vector is equal to the number of distinct medical events appeared in the raw data. However, these kind of representation is still high dimensional and sparse. Moreover, they do not take the time series information in EHRs into consideration. A better way to represent patients is shown in Fig. 2c. By using RNN-DAE model, we will use distributed representation to better represent patients as multi-dimensional real-valued vectors that will capture the time series information between records.

Fig. 2

Three different forms of the representation of patients. Here, patient may have various inpatient times (e.g., x,y,z). The tensor representation of each patient consists of multiple multi-hot vectors of N-dimensions (i.e., N=1309). The statistic-based representation is derived by operating summary statistics, and it gets a vector with N-dimensions. Typically, distributed representation is a better representation with D-dimensions (i.e., D=300), where D is much lower than N. a Tensor representation of patients. b Statistic-based representation of patients. c Distributed representation of patients

Bild vergrößern

Given a sequence of inpatient records X=(x₁,x₂,⋯,x_n), where x_t(t=1,⋯,n) is a multi-dimensional multi-hot vector which represents an inpatient clinical event record at time step t, our goal is to summarize a feature vector representation c from these sequence of clinical events. Finally, c will be concatenated with demographic details to get our “Deep Feature”.

RNN is widely used to cope with time-series prediction problems [28, 29]. RNN can remember historical information because the value of current hidden layer depends on the input of current layer and the output of previous layer. Based on the standard RNN, Hochreiter et al. [30] proposed long short-term memory (LSTM) model to cope with gradient exploding and vanishing problems [31, 32]. To simplify the structure of LSTM, one of the most popular variants is gated recurrent unit (GRU) model [33] is developed. The GRU model keeps both advantages of RNN and LSTM, that is, supporting longer sequences but consuming less training time [34]. Therefore, we replace the standard RNN unit with GRU in our research.

We develop a recurrent neural network based denoising autoencoder (RNN-DAE) model in this paper, which combines the ideas of SDAs [13] and sequence autoencoders [35]. In detail, our model trains a GRU_encoder to convert input features to a vector, and then a GRU_decoder is developed to predict input features sequentially. Specially, the decoder reconstructs the initial inputs from a noisy version of the input features. Figure 3 illustrates the architecture of our RNN-DAE model.

Fig. 3

The architecture of our proposed RNNDAE model. Multi-hot vectors (x_t) with time series are added by a Gaussian noise and then encoded by a GRU_encoder model into the patient vector (c). Given the patient vector, another GRU_decoder model is used to decode in order to make the input (x_t) and the output (y_t) are consistent as much as possible

Bild vergrößern

In order to avoid over-fitting when train our model, input vectors X are first mapped through a stochastic mapping $\boldsymbol {\tilde {X}} \thicksim \boldsymbol {q_{D}(\tilde {X}|X)}$. Specially, we adopt Gaussian noise as the stochastic mapping to get $\boldsymbol {\tilde {X}}$. Gaussian noise is a series of random numbers with a Gaussian distribution. The GRU_encoder reads the $\boldsymbol {\tilde {X}}$ and turn it into a vector c, where c is actually the last hidden state of GRU_encoder which summarize the whole input sequence. The GRU_encoder predicts the next state h_t at time step t given the input x_t and the previous hidden state h_t−1 as follows:

$$ \boldsymbol{z}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{z}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$

(1)

$$ \boldsymbol{r}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{r}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$

(2)

$$ \boldsymbol{\tilde{h}}_{t} = \boldsymbol{tanh}(\boldsymbol{W}\cdot[\boldsymbol{r}_{t}\ast\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$

(3)

$$ \boldsymbol{h}_{t} = (1-\boldsymbol{z}_{t})*\boldsymbol{h}_{t-1}+\boldsymbol{z}_{t}*\boldsymbol{\tilde{h}}_{t} $$

(4)

where r_t is the reset gate, z_t is the update gate, δ(·) indicates a sigmoid activation function, and tanh(·) represents a tangent activation function. The reset gate reads the values of h_t−1 and x_t and outputs the values (between 0 to 1) to the state h_t−1 of each cell through the Eq. (2). The update gate updates the hidden state to the new state h_t.

After encoding, GRU_decode is used to predict the next state y_t at time step t based on the global patient vector c and the previous hidden state s_t−1 as follows:

$$ \boldsymbol{z}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{z}\cdot[\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$

(5)

$$ \boldsymbol{r}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{r}\cdot[\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$

(6)

$$ \boldsymbol{\tilde{s}}_{t} = \boldsymbol{tanh}(\boldsymbol{W}\cdot[\boldsymbol{r}_{t}*\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$

(7)

$$ \boldsymbol{s}_{t} = (1-\boldsymbol{z}_{t})*\boldsymbol{s}_{t-1}+\boldsymbol{z}_{t}*\boldsymbol{\tilde{s}}_{t} $$

(8)

$$ \boldsymbol{y}_{t} = \boldsymbol{s}_{t} $$

(9)

where s_t is the hidden state of the decoder at time t.

Reconstruction error L(x,y) is defined as the loss function, and the model optimize the parameters by minimizing reconstruction error. We utilize cross-entropy function to calculate the reconstruction error as follows:

$$ {}L(X, Y)\!\! =\! -\sum_{i=1}^{n} \sum_{j=1}^{d} \left[x_{i}^{(j)}\log y_{i}^{(j)} + \left(1-x_{i}^{(j)}\right)\log\left(1-y_{i}^{(j)}\right)\right] $$

(10)

where $x_{i}^{(j)}$ is the j-th element of x_i and $y_{i}^{(j)}$ is the j-th element of y_i. d is the dimension of x_i and y_i.

The Gaussian noise is set with a mean of 0 and a variance of 0.1. The output dimensions of GRU_encoder and GRU_decoder are all 300, therefore, c is a 300-dimensional vector. When training the network, the loss is minimized by gradient-based optimization with mini-batch of size 100.

Finally, each patient vector consists of 302 dimensions and is renamed as “Deep Feature”. Among them, 2 dimensions are demographic details (i.e., age and gender), and the other 300 dimensions are the output of our representation model (i.e., RNN-DAE). We do not input the demographic details into our models because they are of great significant effect on clinical tasks. The vector c is derived by encoding clinical events only.

link