Validate and load electronic health records

Introduction

This vignette demonstrates how to clean and validate hospital records and load them into a data warehouse.

Ramses is provided simulated datasets consisting of records of 355 hospitalisations for 199 patients, complete with prescriptions, diagnoses and care episodes records.

Setup

First, packages must be loaded.

library(Ramses)
library(dplyr)    # to facilitate data manipulations

Second, a relational database server is needed.

For the purposes of this vignette, one may use on a small DuckDB database on the local hard drive. This database will behave in a similar way as a dedicated relational database server. This method should only be used for training with simulated data as it is not a secure method of handling actual patient data.

ramses_db <- connect_local_database("ramses-db.duckdb")
#> DuckDB database created in 
#> ramses-db.duckdb
#> Please do not use real patient data.

Alternatively, one may use a dedicated PostgreSQL server (other SQL databases are not currently supported). In this case, we use the DBI and RPostgres packages to initiate the connection:

ramses_db <- DBI::dbConnect(
  RPostgres::Postgres(), 
  host = 'example_server_name' ,
  dbname = 'example_database_name',
  user = 'username', 
  password = 'password', 
  port = 5432, 
  # Do not forget to specify a time zone to correctly collect datetime records
  timezone = "Europe/London")

# Specify the name of the schema Ramses should use on the database 
# (if not, Ramses will use 'public' by default)
DBI::dbExecute(ramses_db, "SET search_path = ramses_schema, public;")

Whatever method is used, the database connection information is stored in an object (here ramses_db) which must be provided to most functions in the Ramses package.

We can now begin preparing records and loading them into the database. This involves three steps:

Transform: adding metadata and transforming records to conform with Ramses data validation constraints
Validate: using dedicated functions (eg: validate_inpatient_episodes()) to verify that all data conform to Ramses data validation constraints
Load: loading records into a SQL database.

Medication records

Drug prescriptions

We begin with prescription records (you may choose to start with other records).

str(drug_prescriptions)
#> 'data.frame':    367 obs. of  12 variables:
#>  $ patient_id        : chr  "5124578766" "4874231672" "6292626973" "6292626973" ...
#>  $ prescription_id   : chr  "66cac1c5eab88d72c8b7687966357f5b" "8ccd67f4730b62ceafb8bcb27996c10c" "72cf4b592b0f4e2143b4bb9d7c569c97" "806c86b55cf50505a20b722f081c4075" ...
#>  $ rxsummary         : chr  "Piperacillin / Tazobactam IVI 4.5 g TDS" "Ciprofloxacin ORAL 500 mg BD" "Flucloxacillin ORAL 500 mg 6H" "Metronidazole ORAL 400 mg TDS" ...
#>  $ authoring_date    : POSIXct, format: "2015-08-04 13:07:16" "2017-07-06 09:49:31" ...
#>  $ prescription_start: POSIXct, format: "2015-08-04 14:45:16" "2017-07-06 10:26:31" ...
#>  $ prescription_end  : POSIXct, format: "2015-08-07 14:45:16" "2017-07-07 22:26:31" ...
#>  $ tr_DESC           : chr  "Piperacillin / Tazobactam" "Ciprofloxacin" "Flucloxacillin" "Metronidazole" ...
#>  $ route             : chr  "IV" "ORAL" "ORAL" "ORAL" ...
#>  $ dose              : num  4.5 500 500 400 4.5 500 4.5 4.5 600 2 ...
#>  $ units             : chr  "g" "mg" "mg" "mg" ...
#>  $ frequency         : chr  "TDS" "BD" "6H" "TDS" ...
#>  $ duration          : num  3 1.5 4 2 4 3 3 2 1 2 ...

Prescriptions records must conform to data validation requirements listed in the documentation for validate_prescriptions()

We first map drug concepts using the as.ab() function provided by the AMR package. This function takes a character variable as input and returns a three-letter code identifying antifungals and antibacterials. These give access to the AMR package’s wide range of analytical methods. Although the AMR package does not currently provide equivalent mapping for antiparasitic and antiviral drugs, these can be mapped to any substance-level terminology of your choice. RxNorm Ingredients and SNOMED CT UK Edition Virtual Therapeutic Moieties are both good options.

# attempting to map drug name using AMR package
drug_prescriptions$drug_code <- AMR::as.ab(drug_prescriptions$tr_DESC)

Some manual editing may be required when drugs are not successfully mapped, as shown in the example below.

# editing drug names
drug_prescriptions$drug_code <- gsub("Vancomycin protocol", 
                                   "Vancomycin",
                                   drug_prescriptions$tr_DESC)
# mapping drug name using AMR package
drug_prescriptions$drug_code <- AMR::as.ab(drug_prescriptions$drug_code)
drug_prescriptions$drug_name <- AMR::ab_name(drug_prescriptions$drug_code)

We also map the route of administration to the World Health Organisation Anatomical Therapeutic Chemical Classification System (ATC) codes:

# recoding route of administration
drug_prescriptions <- mutate(drug_prescriptions, 
  ATC_route = 
    case_when(
      route %in% c(NULL) ~ "Implant", 
      route %in% c("NEB", "INHAL") ~ "Inhal", 
      route %in% c("TOP", "EYE", "EYEL", "EYER", "EYEB", 
                   "EAR", "EARL", "EARR", "EARB") ~ "Instill", 
      route %in% c("NASAL", "NOST", "NOSTL", "NOSTR", "NOSTB") ~ "N", 
      route %in% c("ORAL", "NAS", "PEG") ~ "O", 
      route %in% c("IV", "IVB", "IVI", "IMI", "IT", "IVT") ~ "P", 
      route %in% c("PR") ~ "R", 
      route %in% c("BUCC", "SB", "OROM", "SUBL") ~ "SL", 
      route %in% c("SC", "ID") ~ "TD", 
      route %in% c("PV") ~ "V", 
      TRUE ~ NA_character_
    ))

Using the AMR package, it is possible to infer important properties of antibacterial/antifungal drugs:

ATC classes and groups, which can be useful for instance in order to identify antifungals (group ‘Antimycotics for systemic use’)
Defined Daily Doses (DDDs), which are commonly used to measure drug consumption.

# extracting the ATC code (eg J01CR02) group (eg Beta-lactam antibacterials, penicillins)
drug_prescriptions$ATC_code <- AMR::ab_atc(drug_prescriptions$drug_code, only_first = TRUE)
drug_prescriptions$ATC_group <- AMR::ab_atc_group1(drug_prescriptions$drug_code)
# removing antifungal drugs, if applicable
drug_prescriptions <- filter(drug_prescriptions, 
                             ATC_group != "Antimycotics for systemic use")

Working out the prescription DDDs can be complex in the case of compound medications. The ATC provides reference DDDs with respect to the main substance in a given medication. For example:

Co-amoxiclav is composed of amoxicillin and clavulanic acid (ATC code: J01CR02). Its reference DDD can be obtained directly from

paste0(AMR::ab_ddd("co-amoxiclav", "oral"), 
       AMR::ab_ddd("co-amoxiclav", "oral", unit = T))

that is, 1.5g. The corresponding web page notes this dose refers to the amoxicillin content only. A prescription of Augmentin 625mg TDS (three times a day) tablets containing 500mg of amoxicillin, the total dose prescribed per day is 1.5g or of amoxicillin per day, or 1.0 DDD.
More complex compounds are documented here. For instance, co-trimoxazole (ATC code: J01EE01) is composed of 5 parts of sulfamethoxazole to 1 part of trimethoprim. The ATC reference DDD for sulfamethoxazole is 2g while the DDD of trimethoprim is 400mg: with this ratio of 5 to 1, DDDs calculated on the basis of the dosage of sulfamethoxazole will be valid for co-trimoxazole prepared with 5 parts to 1 part. These just some examples.

In order to obtain the correct DDDs for compound medications, it is recommended to map data to a suitable dictionary of medicines such as the RxNorm or the SNOMED CT UK Drug Extension and extract the corresponding ATC codes and DDDs.

In this example, however, we use a look-up table for the strength of compound medications.

# prepare DDD extraction
compound_strength_lookup <- data.frame(list(
  drug_code = c("AMC", "AMC", "TZP", "SMX"),
  route = c("oral", "oral", "oral", "oral"),
  dose = c(625, 1.2, 4.5, 480),
  units = c("mg", "g", "g", "mg"),
  strength = c(500, 1, 4, 400),
  basis_of_strength = c("AMX", "AMX", "PIP", "SMX")
), stringsAsFactors = F)

drug_prescriptions <- merge(drug_prescriptions, 
                            compound_strength_lookup, 
                            all.x = T)

drug_prescriptions <- drug_prescriptions %>% 
  mutate(strength = if_else(is.na(strength), dose, strength),  
         basis_of_strength = if_else(is.na(basis_of_strength),
                                     as.character(drug_code),
                                     basis_of_strength))

Ramses provides another look-up table Ramses::reference_drug_frequency to help convert the frequency character variable into a numeric variable indicating the daily frequency of administration.

drug_prescriptions <- merge(drug_prescriptions, 
                            reference_drug_frequency, by = "frequency", all.x = T)

We can now compute the prescription DDD based on the ATC reference DDDs and the daily frequency of administration:

drug_prescriptions <- drug_prescriptions %>% 
  mutate(daily_dose = strength * daily_frequency) %>% 
  mutate(DDD = compute_DDDs(
    ATC_code = AMR::ab_atc(basis_of_strength, only_first = TRUE),
    ATC_administration = ATC_route,
    dose = daily_dose,
    unit = units)) %>% 
  mutate(duration_days = if_else(
    daily_frequency == -1,
    "one-off", 
    if_else(
      round(difftime(prescription_end, 
                     prescription_start, 
                     units = "days")) == 1,
      "1 day", 
      paste(round(difftime(prescription_end, 
                           prescription_start, 
                           units = "days")),
            "days")
      )
  ))

drug_prescriptions <- drug_prescriptions %>% 
  transmute(patient_id,
            prescription_id,
            # prescription summary for displaying in interfaces
            prescription_text = paste0(
              drug_name, " ", 
              route, " ", 
              dose, units, " ", 
              duration_days
            ),
            drug_code,
            drug_name = drug_name,
            drug_display_name = drug_name,
            # in this example dataset, only antibacterial drugs are included
            antiinfective_type = "antibacterial", 
            ATC_code,
            ATC_group,
            ATC_route,
            authoring_date,
            prescription_start,
            prescription_end,
            prescription_status = "completed",
            prescription_context = "inpatient",
            dose,
            unit = units,
            route,
            frequency,
            daily_frequency,
            DDD)

Drug administrations

An identical process is followed for drug administration, with the difference that DDDs refer to the dose of a single administration, and thus do not involve a frequency variable.

drug_administrations$drug_code <- gsub("Vancomycin protocol", 
                                     "Vancomycin", 
                                     drug_administrations$tr_DESC)
drug_administrations$drug_code <- AMR::as.ab(drug_administrations$drug_code)
drug_administrations$drug_name <- AMR::ab_name(drug_administrations$drug_code)
drug_administrations$drug_group <- AMR::ab_group(drug_administrations$drug_code)
# recoding route of administration
drug_administrations <- mutate(drug_administrations, 
  ATC_route = 
    case_when(
      route %in% c(NULL) ~ "Implant", 
      route %in% c("NEB", "INHAL") ~ "Inhal", 
      route %in% c("TOP", "EYE", "EYEL", "EYER", "EYEB", 
                   "EAR", "EARL", "EARR", "EARB") ~ "Instill", 
      route %in% c("NASAL", "NOST", "NOSTL", "NOSTR", "NOSTB") ~ "N", 
      route %in% c("ORAL", "NAS", "PEG") ~ "O", 
      route %in% c("IV", "IVB", "IVI", "IMI", "IT", "IVT") ~ "P", 
      route %in% c("PR") ~ "R", 
      route %in% c("BUCC", "SB", "OROM", "SUBL") ~ "SL", 
      route %in% c("SC", "ID") ~ "TD", 
      route %in% c("PV") ~ "V", 
      TRUE ~ "NA_character_"
    ))
drug_administrations$ATC_code <- AMR::ab_atc(drug_administrations$drug_code, only_first = TRUE)
drug_administrations <- filter(drug_administrations, 
                               drug_group != "Antifungals/antimycotics")

drug_administrations <- merge(drug_administrations, compound_strength_lookup, all.x = T)
drug_administrations <- drug_administrations %>% 
  mutate(strength = if_else(is.na(strength), dose, strength),
         basis_of_strength = if_else(is.na(basis_of_strength),
                                     as.character(drug_code),
                                     basis_of_strength))

drug_administrations <- drug_administrations %>% 
  mutate(DDD = compute_DDDs(
    ATC_code = AMR::ab_atc(basis_of_strength, only_first = TRUE),
    ATC_administration = ATC_route,
    dose = dose,
    unit = units 
  )) 

# Create an identifier
# Note: An alternative is encryption using openssl::sha256()
      
drug_administrations <- drug_administrations %>%
  group_by(patient_id, drug_code, route,
           dose, units, administration_date) %>% 
  mutate(administration_id = dplyr::cur_group_id()) %>% 
  ungroup()

drug_administrations <- drug_administrations %>% 
    transmute(
      patient_id,
      administration_id = as.character(administration_id),
      prescription_id,
      administration_text = paste0(
          drug_name, " ", route, " ", dose, units),
      drug_code,
      drug_name,
      drug_display_name = drug_name,
      drug_group,
      # in this example dataset, only antibacterial drugs are included
      antiinfective_type = "antibacterial", 
      ATC_code,
      ATC_route,
      dose,
      unit = units,
      route,
      administration_date,
      administration_status = "completed",
      DDD
    )

This data frame can now be validated, before being loaded into the database.

validate_administrations(drug_administrations)
#> NULL

load_medications(conn = ramses_db, 
                 prescriptions = drug_prescriptions, 
                 administrations = drug_administrations, 
                 overwrite = TRUE,
                 silent = FALSE)
#> Transitive closure of therapy episodes beginning...
#> Registered S3 method overwritten by 'igraph':
#>   method   from
#>   plot.sir AMR
#> Transitive closure of therapy combinations beginning...

Inpatient care records

Admissions

We now turn to records of patient demographics, admissions, discharges and transfers. Ramses is designed around the notion of hospital encounters identified by variable encounter_id, which correspond to the window from the time a patient is admitted to the time a patient is discharged.

In addition, Ramses use subdivisions of hospital encounters called episodes of care, which designate a continuous period of inpatient care under the responsibility of a department, specialty, or clinical team. This corresponds to the ‘finished consultant episode’ in the UK NHS Admitted Patient Care data model (Herbert et al. 2017). All hospital encounters consist of at least one episode of care. Episodes of care are used to calculate bed-days as well as rates of infection, prescribing and other stewardship metrics for individual specialties and teams of clinicians.

Hospital encounters and episodes of care are recorded in a single table in which every row corresponds to one episode of care.

str(inpatient_episodes)
#> 'data.frame':    1082 obs. of  11 variables:
#>  $ patient_id               : chr  "6145252493" "5124578766" "5124578766" "5124578766" ...
#>  $ encounter_id             : chr  "5458286195" "1275624266" "1275624266" "1275624266" ...
#>  $ admission_method         : chr  "1" "2" "2" "2" ...
#>  $ admission_date           : POSIXct, format: "2015-09-24 09:13:22" "2015-08-04 11:27:16" ...
#>  $ discharge_date           : POSIXct, format: "2015-09-24 19:16:41" "2015-08-12 12:54:48" ...
#>  $ episode_number           : int  1 1 2 3 4 5 1 1 2 1 ...
#>  $ last_episode_in_encounter: chr  "1" "2" "2" "2" ...
#>  $ episode_start            : POSIXct, format: "2015-09-24 09:13:22" "2015-08-04 11:27:16" ...
#>  $ episode_end              : POSIXct, format: "2015-09-24 19:16:41" "2015-08-06 11:49:09" ...
#>  $ consultant_code          : chr  "C1000003" "C3400039" "C1000010" "C3400036" ...
#>  $ main_specialty_code      : chr  "100" "340" "100" "340" ...

Dates of admissions and episode start/end must be valid:

there must be no overlap between admissions or between episodes
there must be no gap between episodes in an admission.

validate_inpatient_episodes(patients = patients,
                            episodes = inpatient_episodes)
#> [1] TRUE

Optionally, Ramses can use data on ward movements. These records must be validated together with episodes of care

validate_inpatient_episodes(patients = patients,
                            episodes = inpatient_episodes, 
                            wards = inpatient_wards)
#> [1] TRUE

The data can now be loaded into the database.

load_inpatient_episodes(conn = ramses_db,
                        patients_data = patients,
                        episodes_data = inpatient_episodes,
                        wards_data =  inpatient_wards)

Diagnosis codes (ICD-10)

In order to exploit clinical diagnoses coded by doctors during hospital admissions, Ramses relies on a range of look up tables which must be built to suit the precise ICD-10 version used by the data. Ramses uses three types of look up tables:

an ICD-10 reference table containing all ICD-10 codes, their full-text description, and chapter headings
a categorisation based on the AHRQ Clinical Classification Software (CCS) or its Revised version (CCSR)
a classification of infections and whether antibiotics are commonly indicated to treat them, adapted from Hashimoto et al. (2020)

ICD-10 reference table

This example uses the Ramses::inpatient_diagnoses example dataset. First, we download the ICD-10-CM reference look up table

icd10cm <- download_icd10cm()
str(icd10cm)
#> 'data.frame':    72184 obs. of  6 variables:
#>  $ icd_code            : chr  "A000" "A001" "A009" "A0100" ...
#>  $ icd_display         : chr  "A00.0" "A00.1" "A00.9" "A01.00" ...
#>  $ icd_description     : chr  "Cholera due to Vibrio cholerae 01, biovar cholerae" "Cholera due to Vibrio cholerae 01, biovar eltor" "Cholera, unspecified" "Typhoid fever, unspecified" ...
#>  $ category_code       : chr  "A00" "A00" "A00" "A01" ...
#>  $ category_description: chr  "Cholera" "Cholera" "Cholera" "Typhoid and paratyphoid fevers" ...
#>  $ edition             : chr  "ICD-10-CM 2020" "ICD-10-CM 2020" "ICD-10-CM 2020" "ICD-10-CM 2020" ...

We verify that all ICD-10 codes in the dataset can be mapped to the reference table. The first validation fails, because some of the icd_code in inpatient_diagnoses are missing (all diagnoses that are not infections). These are removed before validation is attempted again. In the present example, the dataset is simulated using ICD-10 5th Edition, rather than ICD-10-CM. The match is therefore not complete on the second validation attempt. Although the validation produces a warning, it passes (returns TRUE). This means it will not prevent loading the data into the warehouse.

validate_inpatient_diagnoses(inpatient_diagnoses, icd10cm)
#> Warning in .validate_variable_no_missing(data = diagnoses_data, vectorname =
#> diagnoses_data_schema[diagnoses_data_schema[["must_be_nonmissing"]], : The
#> following variables contain missing data: `icd_code`
#> Warning: some `icd_code` values in `diagnoses_data` do not match any `icd_code`
#> in `diagnoses_lookup`
#> [1] FALSE

inpatient_diagnoses <- filter(inpatient_diagnoses, !is.na(icd_code))
validate_inpatient_diagnoses(inpatient_diagnoses, icd10cm)
#> Warning: some `icd_code` values in `diagnoses_data` do not match any `icd_code`
#> in `diagnoses_lookup`
#> [1] TRUE

Next, we classify the reference look-up table to other classifications:

common infections and indications for antibiotic prescribing (see Ramses::antibiotic_icd_indications)
Charlson Comorbidity Index weights
Clinical Classifications Software (CCS)
Clinical Classifications Software Refined (CCSR)

Although this is done automatically when loading diagnoses into the warehouse, this step can be useful to first check that the ICD-10 reference table is successfully mapped to other classifications.

icd_infection_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_infections_abx_indications(df = ., icd_column = "icd_code")

comorbidity_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_charlson_comorbidities(df = ., icd_column = "icd_code")

ccsr_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_ICD10_CCSR(df = ., icd_column = "icd_code")

ccs_lkup <- icd10cm %>% 
  select(icd_code) %>% 
  map_ICD10_CCS(df = ., icd_column = "icd_code")

head(icd_infection_lkup)
#>   icd_code infection_group1_code      infection_group1_label
#> 1     A000                   p02 Gastrointestinal infections
#> 2     A001                   p02 Gastrointestinal infections
#> 3     A009                   p02 Gastrointestinal infections
#> 4    A0100                   p02 Gastrointestinal infections
#> 5    A0101                   p02 Gastrointestinal infections
#> 6    A0102                   p02 Gastrointestinal infections
#>           infection_group2_label infection_group2_code antibiotics_indicated
#> 1 Intestinal infectious diseases                 p0201           potentially
#> 2 Intestinal infectious diseases                 p0201           potentially
#> 3 Intestinal infectious diseases                 p0201           potentially
#> 4 Intestinal infectious diseases                 p0201           potentially
#> 5 Intestinal infectious diseases                 p0201           potentially
#> 6 Intestinal infectious diseases                 p0201           potentially

Data may now be loaded into the database. Diagnoses and their reference look-up are loaded in one step.

load_inpatient_diagnoses(conn = ramses_db, 
                         diagnoses_data = inpatient_diagnoses, 
                         diagnoses_lookup = icd10cm, 
                         overwrite = TRUE)
#> Warning: some `icd_code` values in `diagnoses_data` do not match any `icd_code`
#> in `diagnoses_lookup`

Investigation records

Microbiology culture isolates and susceptibilities

A small sample of records of microbial isolates is provided along with susceptibility testing results:

str(Ramses::inpatient_microbiology)
#> 'data.frame':    38 obs. of  10 variables:
#>  $ patient_id           : chr  "99999999999" "99999999999" "99999999999" "99999999999" ...
#>  $ specimen_id          : chr  "CCC2" "CCC1" "BBB8" "BBB7" ...
#>  $ status               : chr  NA NA NA NA ...
#>  $ specimen_type_display: chr  "MRSA Screen" "Blood Culture" "MRSA Screen" "Faeces" ...
#>  $ specimen_datetime    : POSIXct, format: "2017-02-14" "2017-02-13" ...
#>  $ organism_display_name: chr  "No growth" "No growth" "No growth" "No growth" ...
#>  $ isolate_id           : chr  "2300967" "2365356" "2284807" "2392702" ...
#>  $ agent_display_name   : chr  NA NA NA NA ...
#>  $ rsi_code             : chr  NA NA NA NA ...
#>  $ isolation_datetime   : POSIXct, format: "2017-02-17" "2017-02-16" ...

Before validating and loading these records into the database, some transformations will be required to create three separate data frames:

specimens contains one row per microbial sample sent to the laboratory
isolates contains zero, one, or multiple rows for every specimen – as many rows as the number of microorganisms isolated from the specimen, whatever the technique (eg mass spectrometry, PCR, bacterial culture)
susceptibility contains results of susceptibility testing: one row per substance tested.

For this, we create a list of data frames named microbiology_data, and commence with cleaning the raw data. We employ the AMR package to code microorganism species names (AMR::as.mo()) and antimicrobial names (AMR::as.ab()).

microbiology_data <- list()

# Start by cleaning the raw data
microbiology_data$raw <- inpatient_microbiology
microbiology_data$raw <- microbiology_data$raw %>% 
  mutate(
    organism_code = AMR::as.mo(if_else(
      organism_display_name == "No growth",
      NA_character_,
      organism_display_name)),
    agent_code = AMR::as.ab(agent_display_name)) %>% 
  mutate(organism_name = AMR::mo_name(organism_code),
         agent_name = AMR::ab_name(agent_code))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `agent_code = AMR::as.ab(agent_display_name)`.
#> Caused by warning:
#> ! in as.ab(): these values could not be coerced to a valid antimicrobial
#> ID: "".

As for specimen types (eg bood, bronchoalveolar lavage fluid, etc.), they must be coded using SNOMED CT specimen concepts. A complete list of concepts from the SNOMED CT International Edition can be found in Ramses::reference_specimen_type.

Susceptibility testing results in the example dataset are already interpreted as Resistant/Increased exposure/Susceptible (RSI). Where needed, minimum inhibitory concentrations and disk diffusion diameters can be interpreted as RSI using the as.sir() function provided by the AMR package.

If possible, use the most specific concept available (eg 309176002 | Bronchial brushings sample (specimen) |) over a more generic term (eg 258415003 | Biopsy sample (specimen) |). Specific concepts can always be reclassified to higher-level concepts using SNOMED CT inference (in this case, not just 258415003 | Biopsy sample (specimen) |, but also 258603007 | Respiratory sample (specimen) |). The snomedizer package can be used to query SNOMED CT concept parents and descendants.

microbiology_data$raw <- microbiology_data$raw %>% 
  mutate(specimen_type_code = case_when(
    specimen_type_display == "Blood Culture" ~ 
      "446131002", # Blood specimen obtained for blood culture
    specimen_type_display == "Faeces" ~ 
      "119339001", # Stool specimen
    specimen_type_display == "MRSA Screen" ~ 
      "697989009", # Anterior nares swab 
    specimen_type_display == "Urine" ~ 
      "122575003", # Urine specimen
    TRUE ~ NA_character_
  )) %>% 
  left_join(transmute(reference_specimen_type,
                      specimen_type_code = conceptId,
                      specimen_type_name = pt_term))
#> Joining with `by = join_by(specimen_type_code)`

The three data frames can now be created.

microbiology_data$specimens <- microbiology_data$raw %>% 
  transmute(specimen_id,
            patient_id,
            status = "available",
            specimen_datetime,
            specimen_type_code,
            specimen_type_name,
            specimen_type_display) %>% 
  distinct() # Removing duplicates created by multiple isolates and susceptibility testing

microbiology_data$isolates <- microbiology_data$raw %>% 
  transmute(isolate_id,
            specimen_id,
            patient_id,
            organism_code,
            organism_name,
            organism_display_name,
            isolation_datetime) %>% 
  distinct() # Removing duplicates created by susceptibility testing

microbiology_data$susceptibilities <- microbiology_data$raw %>% 
  filter(!is.na(organism_code)) %>%  # Remove no growth
  transmute(isolate_id,
            specimen_id,
            patient_id,
            organism_code,
            organism_name,
            organism_display_name,
            agent_code,
            agent_name,
            agent_display_name,
            rsi_code,
            concept_code = NA_character_) %>% 
  distinct()

An optional step is to classify isolates by multidrug resistance phenotype (Magiorakos et al. 2012). The AMR package function AMR::mdro() provides classification based on:

international guidelines on microorganism intrinsic resistance
records of susceptibility testing (RSI).

In this vignette, we adopt intrinsic resistances as defined by EUCAST guidelines (European Committee on Antimicrobial Susceptibility Testing 2020). The tidyr package is used to pivot susceptibility test results into wide format (antimicrobials arranged as columns).

mdr_classification <- microbiology_data$susceptibilities  %>% 
  select(isolate_id,
         organism_code,
         agent_code,
         rsi_code) %>% 
  tidyr::pivot_wider(id_cols = c("isolate_id", "organism_code"),
                     names_from = agent_code,
                     values_from = rsi_code)

mdr_classification$multidrug_resistance <- 
  as.character(AMR::mdro(
    x = mdr_classification,
    guideline = "EUCAST",
    col_mo = "organism_code"))

# Merging back into the isolates dataset
microbiology_data$isolates <- left_join(
  microbiology_data$isolates,
  select(mdr_classification, isolate_id, multidrug_resistance)
)
#> Joining with `by = join_by(isolate_id)`

The data may now be validated, and loaded.

validate_microbiology(specimens = microbiology_data$specimens,
                      isolates = microbiology_data$isolates,
                      susceptibilities = microbiology_data$susceptibilities)
#> [1] TRUE

load_inpatient_microbiology(
  conn = ramses_db,
  specimens = microbiology_data$specimens,
  isolates = microbiology_data$isolates,
  susceptibilities = microbiology_data$susceptibilities,
  overwrite = TRUE
)

Clinical investigations

Finally, we import all other observation and test results, from basic vitals such as respiration rate and blood pressure, to urine antigen test results. Analysis of such data requires the use of standard test vocabularies.

The international standard code system for laboratory data is LOINC. A sample of important codes may be found in dataset Ramses::reference_loinc. Other observable entity concepts or evaluation procedure concepts may be imported from SNOMED CT.

data(inpatient_investigations)
str(inpatient_investigations)
#> 'data.frame':    1428 obs. of  13 variables:
#>  $ observation_id           : chr  "1427" "1431" "1430" "1432" ...
#>  $ patient_id               : chr  "99999999999" "99999999999" "99999999999" "99999999999" ...
#>  $ encounter_id             : chr  "9278078393" "9278078393" "9278078393" "9278078393" ...
#>  $ status                   : chr  "final" "final" "final" "final" ...
#>  $ request_datetime         : POSIXct, format: "2016-03-30 13:51:13" "2016-03-30 13:51:13" ...
#>  $ observation_datetime     : POSIXct, format: "2016-03-30 13:51:45" "2016-03-30 13:51:45" ...
#>  $ observation_value_text   : chr  "" "" "" "" ...
#>  $ observation_value_numeric: num  35.7 154 2 17 95 108 4 36.6 113 0 ...
#>  $ observation_name         : chr  "Body temperature" "Systolic blood pressure" "SEWS score" "Respiratory rate" ...
#>  $ observation_display      : chr  "Temperature" "Systolic BP" "SEWS score" "Resps" ...
#>  $ observation_code_system  : chr  "http://loinc.org" "http://loinc.org" "http://snomed.info/sct" "http://loinc.org" ...
#>  $ observation_code         : chr  "8310-5" "8480-6" "1104051000000101" "9279-1" ...
#>  $ observation_unit         : chr  "degree_Celsius" "mm_Hg" "" "breaths/min" ...

One particular set of validation constraints apply:

units must conform to the Unified Code for Units of Measure (UCUM). Units are implemented using the units package in R.
an observation_code must have one and only one observation_unit to pass validation. If observation records come in a variety of units, they must first be converted to a common unit.

Many important units in medicine are not listed in the UCUM. For instance:

respiration rate (breaths/min)
heart rate (beats/min)
alkaline phosphatase in serum (international unit per liter notated U/L)

By default, these three custom units are passed to the custom_units argument of validate_investigations(). Any other custom unit in the dataset must be handled with this argument.

validate_investigations(inpatient_investigations, 
                        custom_units = c("breaths", "beats", "U"))
#> [1] TRUE

Data may now be loaded into the database.

load_inpatient_investigations(
  conn = ramses_db,
  investigations_data = inpatient_investigations, 
  overwrite = TRUE
)

Final steps

There is one final step to create ancillary tables:

# Create bridge tables
bridge_tables(conn = ramses_db, overwrite = TRUE)
#> [1] TRUE

# Create reference AWaRe table (see ?reference_aware)
dplyr::copy_to(dest = ramses_db,
               df = filter(reference_aware, version == "England" & year == "2019"),
               name = "reference_aware",
               temporary = FALSE,
               overwrite = TRUE)

The database is now ready.

The main vignette (getting started) demonstrates how to use it.

Once you are done

Always remember close database connections when you are finished.

DBI::dbDisconnect(ramses_db, shutdown = TRUE)

References

European Committee on Antimicrobial Susceptibility Testing. 2020. “EUCAST Expert Rules Version 3.2. Intrinsic Resistance and Unusual Phenotypes. February 2020.” European Society of Clinical Microbiology; Infectious Diseases. https://www.eucast.org/expert_rules_and_intrinsic_resistance/.

Hashimoto, Hideki, Makoto Saito, Jumpei Sato, Kazuo Goda, Naohiro Mitsutake, Masaru Kitsuregawa, Ryozo Nagai, and Shuji Hatakeyama. 2020. “Indications and Classes of Outpatient Antibiotic Prescriptions in Japan: A Descriptive Study Using the National Database of Electronic Health Insurance Claims, 2012–2015.” International Journal of Infectious Diseases 91: 1–8. https://doi.org/10.1016/j.ijid.2019.11.009.

Herbert, Annie, Linda Wijlaars, Ania Zylbersztejn, David Cromwell, and Pia Hardelid. 2017. “Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC).” International Journal of Epidemiology 46 (4): 1093–1093i. https://doi.org/10.1093/ije/dyx015.

Magiorakos, A.-P., A. Srinivasan, R. B. Carey, Y. Carmeli, M. E. Falagas, C. G. Giske, S. Harbarth, et al. 2012. “Multidrug-resistant, extensively drug-resistant and pandrug-resistant bacteria: an international expert proposal for interim standard definitions for acquired resistance.” Clinical Microbiology and Infection 18 (3): 268–81. https://doi.org/10.1111/j.1469-0691.2011.03570.x.

11 February 2024