Package design vignette for {readepi}

Concept and motivation

This document outlines the design decisions guiding the development strategies of the {readepi} R package, the reasoning behind them, as well as the possible pros and cons of each decision.

Importing data from various sources into the R environment is the first step in the workflow of outbreak analysis. Health data are often stored in individual files of different formats, in relational database management systems (RDBMS), and more importantly, many health organizations store their data in health information systems (HIS) that are wrapped under hood of a specific Application Programming Interfaces (APIs).

Many R packages have been developed over the years to read data stored in a file or in a directory containing multiple files. We recommend the {rio} package for importing data that are relatively small in size and the {data.table} package for large files. For retrieving data from RDBMS, we recommend the {DBI} package.

There are several R packages for reading data from HIS such as {fingertipsR}, {REDCapR}, {godataR}, and {globaldothealth}, which are used to fetch data from Fingertips, REDCap, Go.Data, and Global.Health respectively. However, these packages are usually designed to read from specific HIS and can’t be used to query others. This increases the dependency on many other packages and introduces the challenge of having a unified framework for importing data from multiple HIS. As such, we propose {readepi}, a centralized tool that will provide users with the capability of importing data from various HIS and RDBMS.

{readepi} aims at importing data from several potential sources in the same way. The data sources include distributed health information systems and public databases as shown in the figure below.

readepi roadmap
readepi roadmap

Scope

The {readepi} package is designed to import data from two common sources of institutional health-related data: HIS wrapped with specific APIs and RDBMS that run on specific servers.

To import data from these sources, users must have read access and provide the relevant query parameters to fetch the target data. The current version of {readepi} supports importing data from: - HISs: DHIS2 and SORMAS,

In next releases, we plan to include features for reading data from additional HISs like GoData, Globaldothealth, and ODK, as well as RDBMS such as MS Access.

Diagram of current functions available in {readepi}
Diagram of current functions available in {readepi}

Output

The main functions of the {readepi} package return a data frame object that contains the data fetched from the target source with the specified request parameters. The login() function returns a connection object that is used in the subsequent queries.

Design decisions

The aim of {readepi} is to simplify and standardize the process of fetching data from APIs and servers. We strive to make this easy for users by limiting the number of required arguments to access and retrieve the data of interest from the target source. As such, the package is structured around few main functions: read_dhis2(), read_sormas(), and read_rdbms(); and one auxiliary functions (login()).

Authentication

The login() function is used to establish connection with the data source. It verifies the user’s identity and determines if they are authorized to access the requested database or API. Establishing this connection is crucial for ensuring successful data import. However, the basic authentication does not work for SORMAS. To maintain the design of the package across all HIS, the login function returns a object when importing data from SORMAS.

Once authentication credentials are provided, they are securely stored within the connection object. This prevents the need to re-supply them for subsequent requests in other functions. The Figure below lists the arguments needed to call the login() function.

The type argument refers to the name of the data source of interest. The current version of the package covers the following types:

i) RDBMS: “ms sql”, “mysql”, “postgresql”, “sqlite”
ii) APIs: “dhis2”, “sormas”

Data import

You can use one of the functions below depending on the data source.

Note that, when reading from RDBMS, the query argument could be an SQL query or a list with a vector of table names, fields and rows to subset on. For HIS, we strongly recommend reading the vignette on the query_parameters for more details about the request parameters that are supported in the current version of the package.

Dependencies

These functions also require system dependencies for OS-X and Linux systems, detailed in the install drivers vignette vignette.

Additionally, the development of the package necessitates the inclusion of other required packages: - {checkmate} - {httptest2} - {bookdown} - {rmarkdown} - {testthat} (>= 3.0.0) - {knitr} - {cli} - {DiagrammeR} - {cyclocomp}

Contribute

There are no special requirements to contributing to {readepi}, please follow the package contributing guide.