I read about the .json
format from this
series on Tutorialspoint.
Looking at the test data (STC_dx.json
) that I’ve been given,
"data"
is an array of objects.
Each object represents a single diagnosis,
containing a diagnosis code, date, patient id, and diagnosis name.
Here are an example 2 objects (out of 9):
{
"data": [
{
"DX_CODE": "427.31",
"DIAG_DATE": "2012/03/25 00:00:00.000000000",
"P_ID": "Z324563",
"DX_NM": "Atrial fibrillation (CMS/HCC)",
"P_MRN_ID": "000167426",
"E_ID": null
},
{
"DX_CODE": "725",
"DIAG_DATE": "2011/11/10 00:00:00.000000000",
"P_ID": "Z273652",
"DX_NM": "Polymyalgia rheumatica (CMS/HCC)",
"P_MRN_ID": "000256352",
"E_ID": 12256152437
}
]
}
My first task is to convert this data into DataFrame
format, with the rows
representing each patient and the columns representing 2-year intervals.
So, each cell should contain a patient’s diagnoses for a 2-year interval.
Here is a table I made to visualize the general form:
YYYY–YYYY+1 | YYYY+2–YYYY+3 | |
---|---|---|
P_IDA | DX_CODE1—DX_NM1; DX_CODE2—DX_NM2 | DX_CODE3—DX_NM3; DX_CODE4—DX_NM4 |
P_IDB | DX_CODE5—DX_NM5; DX_CODE6—DX_NM6 | DX_CODE7—DX_NM7; DX_CODE8—DX_NM8 |
I began by simply reading the .json
file with pd.read_json(json_file)
.
The resulting DataFrame looked like this:
diagnosis
0 {'DX_CODE': '427.31', 'DIAG_DATE': '2012/03/25...
1 {'DX_CODE': 'H53.9', 'DIAG_DATE': '2014/07/17 ...
2 {'DX_CODE': '725', 'DIAG_DATE': '2009/11/10 00...
3 {'DX_CODE': '725', 'DIAG_DATE': '2011/08/11 00...
4 {'DX_CODE': None, 'DIAG_DATE': '2004/10/15 00:...
5 {'DX_CODE': '272.0', 'DIAG_DATE': '2006/09/21 ...
6 {'DX_CODE': '729.81', 'DIAG_DATE': '2012/11/10...
7 {'DX_CODE': '446.5', 'DIAG_DATE': '2010/07/11 ...
8 {'DX_CODE': '725', 'DIAG_DATE': '2011/11/10 00...
All the data was stored in one column; I needed to distribute it to take advantage of Pandas’ functions. Following this Stack Overflow answer, I added code to flatten the object, and I dropped the unnecessary columns. Here is the resulting DataFrame:
DX_CODE DIAG_DATE P_ID DX_NM
0 427.31 2012-03-25 Z324563 Atrial fibrillation (CMS/HCC)
1 H53.9 2014-07-17 Z324563 Visual disturbance
2 725 2009-11-10 Z324563 Polymyalgia rheumatica (CMS/HCC)
3 725 2011-08-11 Z324563 Polymyalgia rheumatica (CMS/HCC)
4 None 2004-10-15 Z273652 Disorder of bone and cartilage
5 272.0 2006-09-21 Z273652 Pure hypercholesterolemia
6 729.81 2012-11-10 Z273652 Swelling of limb
7 446.5 2010-07-11 Z273652 Giant cell arteritis (CMS/HCC)
8 725 2011-11-10 Z273652 Polymyalgia rheumatica (CMS/HCC)
Now we’re talking. My plan now to transform the data is:
- Get the earliest and latest years.
- Create a list of lists for rows.
- Iterate through the year ranges to fill in the sublists.
- Create a new
DataFrame
from the list of lists.
I will implement these steps tomorrow.