Dataset generation: heart failure selection
The EHR data used in this paper is collected from the Shuguang Hospital which is the first class general hospitals in Shanghai. The CDR of the Shuguang Hospital between January 2005 and April 2016 contains approximately 350,000 hospital records.
In this paper, a sub-repository focusing on heart failure is constructed from the above CDR. We select patients who satisfy the following criteria: One patient has at least two hospital records, and the ICD-10 code associated with heart failure exists in the diagnosis or medical order of these two hospital records. Specially, clinical experts define a list of ICD-10 codes related to heart failure, including 63 codes.
Our dataset consists of 4682 patients with 10,898 inpatient records, where 568 patients (about 12.1%) died in the hospital and the remaining patients are difficult to track. To enrich our dataset, we split the patients’ hospital records and obtain 10,898 samples. For instance, if a patient has three inpatient records, we then construct three samples by respectively selecting only the first record, both the first and second records, and all three records.
Data preprocessing
-
Diagnoses: The patient records of heart failure repository include 1232 ICD-10 codes in total. As a result, we represent the ICD-10 codes with 1232 dimensions.
-
Medications: According to the universality of medication for heart failure in China, 61 kinds of medications are chosen by clinical specialists manually. Clinical specialists classified these medications into 11 groups, such as ACE-I, ARA, and ARB. Similarly, we represent the medications with 11 dimensions.
-
Lab Tests: Clinical experts choose 22 laboratory tests related to heart failure in this research. According to the reference value of each lab test, a flag including high, low and normal is used to denote the results. Therefore, three dimensions are required to convert the result of one lab test into binary feature. Eventually, we represent the lab tests with 66 dimensions.
Specially, raw feature includes clinical events and demographic details, and one record of raw feature is described with 1311 dimensions in total.
Patient representation learning
Three different forms of the representation of patients. Here, patient may have various inpatient times (e.g., x,y,z). The tensor representation of each patient consists of multiple multi-hot vectors of N-dimensions (i.e., N=1309). The statistic-based representation is derived by operating summary statistics, and it gets a vector with N-dimensions. Typically, distributed representation is a better representation with D-dimensions (i.e., D=300), where D is much lower than N. a Tensor representation of patients. b Statistic-based representation of patients. c Distributed representation of patients
Given a sequence of inpatient records X=(x1,x2,⋯,xn), where xt(t=1,⋯,n) is a multi-dimensional multi-hot vector which represents an inpatient clinical event record at time step t, our goal is to summarize a feature vector representation c from these sequence of clinical events. Finally, c will be concatenated with demographic details to get our “Deep Feature”.
The architecture of our proposed RNNDAE model. Multi-hot vectors (xt) with time series are added by a Gaussian noise and then encoded by a GRUencoder model into the patient vector (c). Given the patient vector, another GRUdecoder model is used to decode in order to make the input (xt) and the output (yt) are consistent as much as possible
$$ \boldsymbol{z}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{z}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$
(1)
$$ \boldsymbol{r}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{r}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$
(2)
$$ \boldsymbol{\tilde{h}}_{t} = \boldsymbol{tanh}(\boldsymbol{W}\cdot[\boldsymbol{r}_{t}\ast\boldsymbol{h}_{t-1},\boldsymbol{x}_{t}]) $$
(3)
$$ \boldsymbol{h}_{t} = (1-\boldsymbol{z}_{t})*\boldsymbol{h}_{t-1}+\boldsymbol{z}_{t}*\boldsymbol{\tilde{h}}_{t} $$
(4)
$$ \boldsymbol{z}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{z}\cdot[\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$
(5)
$$ \boldsymbol{r}_{t} = \boldsymbol{\delta}(\boldsymbol{W}_{r}\cdot[\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$
(6)
$$ \boldsymbol{\tilde{s}}_{t} = \boldsymbol{tanh}(\boldsymbol{W}\cdot[\boldsymbol{r}_{t}*\boldsymbol{s}_{t-1},\boldsymbol{c}]) $$
(7)
$$ \boldsymbol{s}_{t} = (1-\boldsymbol{z}_{t})*\boldsymbol{s}_{t-1}+\boldsymbol{z}_{t}*\boldsymbol{\tilde{s}}_{t} $$
(8)
$$ \boldsymbol{y}_{t} = \boldsymbol{s}_{t} $$
(9)
where st is the hidden state of the decoder at time t.
$$ {}L(X, Y)\!\! =\! -\sum_{i=1}^{n} \sum_{j=1}^{d} \left[x_{i}^{(j)}\log y_{i}^{(j)} + \left(1-x_{i}^{(j)}\right)\log\left(1-y_{i}^{(j)}\right)\right] $$
(10)
where \(x_{i}^{(j)}\) is the j-th element of xi and \(y_{i}^{(j)}\) is the j-th element of yi. d is the dimension of xi and yi.
The Gaussian noise is set with a mean of 0 and a variance of 0.1. The output dimensions of GRUencoder and GRUdecoder are all 300, therefore, c is a 300-dimensional vector. When training the network, the loss is minimized by gradient-based optimization with mini-batch of size 100.
Finally, each patient vector consists of 302 dimensions and is renamed as “Deep Feature”. Among them, 2 dimensions are demographic details (i.e., age and gender), and the other 300 dimensions are the output of our representation model (i.e., RNN-DAE). We do not input the demographic details into our models because they are of great significant effect on clinical tasks. The vector c is derived by encoding clinical events only.
link
