---
title: "wrictools"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{wrictools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(wrictools)
```
Welcome!

This document serves as a guided tutorial to help you get started using the WRIC_preprocessing package.

<details>
  <summary>**Click here if this is your first time programming in R**</summary>
  If this is your first time programming in R please make sure you have the following installed:
  
  - **R**: The programming language we’ll be using for data analysis and visualization.
  - **A Programming Environment (RStudio)**: I recommend using RStudio, but feel free to use a different IDE.
  
  - **[Install R](https://cran.r-project.org/mirrors.html)**
  - **[Install RStudio](https://posit.co/download/rstudio-desktop)**
  
  ## FAQ and "Programming Terms"
  ### What is the difference between R and RStudio?
  **R** is a programming language, while **RStudio** is a, so called, integrated development environment (IDE) designed specifically for R, offering a user-friendly interface for coding, plotting, and managing projects. You can think of R like a language, like English, where RStudio is Word - a program you can write English inside. But you could also use other programs for example LibreOffice or LaTeX.
  
  ### R vs Rmd 
  There are different types of files where you can write R code. A file ending in **.R** is a standard R script for writing and running R code, while in **.Rmd (R Markdown)** you can combine code and text for dynamic reports, or for example this tutorial/vignette.
</details>



# Getting Started
We are in the process of submitting `wrictools` to CRAN, so for now you can install the current wrictools **development** version via GitHub:
```{r eval = FALSE}
library(remotes)
install_github("NinaZiegenbein/wrictools")
```

Once the package is on CRAN you can install the `wrictools` package from CRAN as normal:
```{r eval = FALSE}
install.packages("wrictools")
```

Once installed, load the package:

```{r}
library(wrictools)
```


## Preprocess WRIC data
Now let's preprocess the txt files, that are created by the WRIC (Omnical Software from Maastricht Instruments). The function `preprocess_wric_file()` disentangles the meta-data at the top of the file (ID, comment etc) and creates DataFrames and csv-files with the actual data, separated between both rooms and summarized between the two measurements for each room. This works both for files created using the "old" software version 1.x that returned multiple rooms in one file, as well as files created using the current software version 2.x.
```{r}
data_txt <- system.file("extdata", "data.txt", package = "wrictools") # loading example data
result <- preprocess_wric_file(data_txt)

r1_metadata <- result$metadata$r1
r2_metadata <- result$metadata$r2
df_room1 <- result$dfs$room1
df_room2 <- result$dfs$room2
```

The function returns a named list with the elements `metadata` and `df`.
For software version 1, `metadata` contains `r1` and `r2`, and `df` contains `room1` and `room2`.  
For version 2, `metadata` and `df` each contain a single data frame `metadata`and `data`. 
If ´save_csv` is True, then the data.frames will be saved as csv files with "id_comment_WRIC_data.csv" or "id_comment_WRIC_metadata.csv".

The example above uses data generated by software version 1, which is why it returns metadata and data.frames for both room 1 and room 2.

Let's look at the output really quick for room 1:
```{r}
head(r1_metadata)
head(df_room1)
```

Let's see what the output would look like for data from software version 2:
```{r}
data_v2_txt <- system.file("extdata", "data_v2.txt", package = "wrictools") # loading example data
result <- preprocess_wric_file(data_v2_txt)

# For version 2 we only have data from one of the rooms, so only one metadata and one wric-data data.frame
metadata <- result$metadata$metadata
df<- result$dfs$data

# Let's look at the first rows (head) of the data.frame
head(df)
```


But the `preprocess_wric_file` function can do a lot more and has a lot of extra parameters you can specify. The following is the exact same function call, but mentioning all optional parameters you can call with their default values. Default means, that if you do not specify this parameter, this is the value that the parameter has by default.
```{r}
result <- preprocess_wric_file(
  filepath = data_txt, 
  code = "id", 
  manual = NULL, 
  save_csv = FALSE, 
  path_to_save = NULL, 
  combine = TRUE, 
  method = "mean", 
  start = NULL, 
  end = NULL, 
  notefilepath = NULL, 
  keywords_dict = NULL, 
  entry_exit_dict = NULL
)
```
Here are explanations and options to all parameters you can specify:

- **filepath:** [String, filepath] Directory path to the WRIC .txt file.
- **code** [String] Method for generating subject IDs. Default is "id", also possible to specify "id+comment", where both ID and comment values are combined or "manual", where you can specify your own.
- **manual** [String] Custom codes for subjects in Room 1 and Room 2 if `code` is "manual".
- **save_csv** [Logical], whether to save extracted metadata and data to CSV files or not. Default is True
- **path_to_save** [String] Directory path for saving CSV files, NULL uses the current directory, NULL is Deafult.
- **combine** [Logical], whether to combine S1 and S2 measurements. Default is True
- **method** [String] Method for combining measurements ("mean", "median", "s1", "s2", "min", "max").
- **start** [character or POSIXct or NULL], rows before this will be removed, if NULL takes first row e.g "2023-11-13 11:43:00"
- **end** [character or POSIXct or NULL], rows after this will be removed, if NULL takes last rows e.g "2023-11-13 11:43:00"
- **notefilepath:**
If you specify a path to the corresponding notefile, the code will try to automatically extract the datetime and current protocol specification (sleeping, exercising, eating etc). If possible please read the [How To Note File]https://github.com/hulmanlab/wrictools/blob/main/HowToNoteFile.pdf), before you start your study for consistent note taking. If there is a TimeStamp in the note e.g "Participants starts eating at 16:10", the time of the creation of the note will be overwritten with the time specified in the free-text of the note. The "protocol" is extracted by keyword search. You can check currently included keywords and extend them by checking the keywords_dict in the extract_note_info() function of the preprocessing.R file. 
- **keywords_dict:** [Nested List] A "dictionary" with keywords for extracting protocol information out of the notefile.

To explore the available functionality and arguments for key functions, simply call:
```{r}
?preprocess_wric_file
```

### <span style="color:green">Your Turn</span>
So now it is your turn. Using the `preprocess_wric_file()` method create a csv file using "data.txt" in folder example_data.

1) create a csv with the name "XXXX_WRIC_data.csv" combining S1 and S2 measurements by taking the mean between them.
2) create a csv, but cut-off the start to 22:45 on 13/11/2023 and the end to 23:45 on the same day. The csv should be saved as "testing_start_end_parameter_WRIC_data.csv".
3) _Optional:_ Try out the `notefilepath` parameter and see what happens.


## Automatic note file extraction - adaptation to your notes
One helpful feature of the `preprocess_wric_file()` method is to automatically extract the protocol from the note_file, that is filled in manually during the experiment. With "protocol" I mean coding whether the participant is currently sleeping, eating, exercising etc. This enables quick processing and easy access to extract and compare various e.g. eating periods. Let's try it:

```{r}
note_txt <- system.file("extdata", "note.txt", package = "wrictools") # loading example data
result <- preprocess_wric_file(data_txt, 
                            notefilepath=note_txt)
head(result$dfs$room1)
```
When looking at `room1` now, we can see a new column called "protocol". We can see the file starts with 0 and at 22:41:21 changes to 1.

### <span style="color:green">Your Turn</span>
1) Look into the `note.txt` file and find out why there is a change at 22:41:21 and what 0 and 1 might represent. Are there more numbers? What do they represent?
2) When comparing with previous results, notice that the file now starts at a later time and stops at an earlier one. Why might that be?
_Attention:_Since we keep reusing variable names (result, room1 etc) and use the same data.txt file to create csv_files, we overwrite those files and variables. That is completely fine for this tutorial, where we are focused on how to use it and not the results. But be careful in your own work!

## A bit more information about extracting data from the notefile
When specifying a notefilepath, the function will 

1) Check whether there is a time in the first row. If there is, this will be used to calculate the drift of the system. This drift will be added to all further datetimes you specify within the notefile.
2) Check whether there is information about the participant entering or exiting the chamber. If yes, the data is cut to only include times in which the participant is in the chamber.

<details><summary>**What are the keywords for entering/exiting?**</summary>
    start = c("enter", "entry", "ind i kammer", "ind")
    This is only checked in the first three rows. Reasoning behind it, is that the first shows the time drift and then there might be two rows - one for each participant - detailing their entry into the chamber.
    end = c("ud", "exit", "out")
    This is only checked for the two last rows.
    
    This package was developed in Denmark, which is why it includes danish signal words. If your notefile is in another language besides English or Danish you can specify your own keywords (or simply replace all with the english words):
    ```{r}
    # Example how to specify your own keywords in German
    entry_exit_dict <- list(
      end = c("aus", "raus", "Ausgang", "Ende"),
      start = c("rein", "in der Kammer", "innen", "hinein")
    )
    preprocess_wric_file(data_txt, entry_exit_dict = entry_exit_dict)
    ```
</details>

3) Read each row and compare if it contains a keyword that responds to one of the predefined keywords. If yes change the label for that time and all following times until the next match. If you do not specify the `keywords_dict` parameter it will use a default dictionary of keywords and protocol values:
```{r}
keywords_dict <- list(
  sleeping = list(keywords = list(c("seng", "sleeping", "bed", "sove", "soeve", "godnat", "night", "sleep")), value = 1),
  eating = list(keywords = list(c("start", "begin", "began"), c("maaltid", "eat", "meal", "food", "spis", "maal", "mad", "frokost", "morgenmad", "middag", "snack", "aftensmad")), value = 2),
  stop_sleeping = list(keywords = list(c("vaagen", "vaekke", "wake", "woken", "vaagnet")), value = 0),
  stop_anything = list(keywords = list(c("faerdig", "stop", "end ", "finished", "slut")), value = 0),
  activity = list(keywords = list(c("start", "begin", "began"), c("step", "exercise", "physical activity", "active", "motion", "aktiv")), value = 3),
  ree_start = list(keywords = list(c("start", "begin", "began"), c("REE", "BEE", "BMR", "RMR", "RER")), value = 4)
)
```
If there are to lists e.g. for `sleeping`, at least one word of each list need to be present for it to be classified as sleeping. The value at the end of the list is the value used int he protocol column in the created dataframe.

4) Check wether there is a timestamp within the comment. If yes, the time drift is added (if available) and the time in the comment is used instead of the one made in the note file.



### <span style="color:green">Your Turn</span>

1) Look at the `note_new.txt` notefile. Then use notefilepath and specify keywords_dict to automatically process the notefile. Use `data.txt` as the data file. Which comment might be hard to catch with keywords and should be avoided in the notefile?


## Batch Processing
Next lets look at processing multiple files together. You might have all of your wric_data files in one folder and want to process them at the same time. This is an example of just that. 
```{r eval = FALSE}
# Specify the folder with the wric_data
data_folder <- "./example_data/my_project"

# Find all files in the folder that start with "Results_"
data_files <- list.files(data_folder, pattern = "^Results_", full.names = TRUE)

# Iterate over all files, call the function and save the csv-files in the same folder
for (data_file in data_files) {
  preprocess_wric_file(data_file, path_to_save = data_folder, code = "id+comment")
}
```

When you also want to process note files with it, the code becomes a little bit more complex, since you want to make sure that the correct files are processed together. You can do this based on the shared date in the filename, or a more labour intensive, but maybe easier option is to create a list of filename pairs. Below you can see both options:

##### Option 1 - Pairs based on shared dates
```{r eval = FALSE}
data_folder <- "./example_data/my_project"

data_files <- list.files(data_folder, pattern = "^Results_.*_(\\d{12})\\.txt$", 
                        full.names = TRUE)
note_files <- list.files(data_folder, pattern = "^note_(\\d{12})\\.txt$", 
                        full.names = TRUE)

# Create a lookup table by extracting the 12-digit date from the filenames
note_lookup <- setNames(note_files, sub("^(note_)(\\d{12})\\.txt$", "\\2", 
                        basename(note_files)))

# Loop through the data files and match the date with the note_lookup
for (data_file in data_files) {
  date <- sub(".*_(\\d{12})\\.txt$", "\\1", basename(data_file))
  print(date)
  if (date %in% names(note_lookup)) {
    preprocess_wric_file(data_file, notefilepath = note_lookup[date], 
                          path_to_save = data_folder, code = "id+comment")
    message("Processed: ", data_file)
  }
}
```

##### Option 2 - Based on File-Pairs
```{r, eval=FALSE}
# Manually specify the pairs of data files and note files (these are made up examples)
filename_pairs <- list(
  list(
    data_file = "./example_data/my_project/Results_1m_0101_202501130800.txt",
    note_file = "./example_data/my_project/note_202501130800.txt"
  ),
  list(
    data_file = "./example_data/my_project/Results_1m_0101_202501190800.txt",
    note_file = "./example_data/my_project/note_202501190800.txt"
  ),
  list(
    data_file = "./example_data/my_project/Results_1m_0101_202501250800.txt",
    note_file = "./example_data/my_project/note_202501250800.txt"
  )
)

# Loop through the filename pairs and process them
for (pair in filename_pairs) {
  preprocess_wric_file(pair$data_file, notefilepath = pair$note_file, 
                        path_to_save = "./example_data/my_project", 
                        code = "id+comment")
  message("Processed: ", pair$data_file, " and ", pair$note_file)
}
```

Your folder structure will look different, so adjust this code to fit your folders and files.

## RedCap
You can also use RedCap's API (Application Programming Interface) to use files directly from RedCap, and also upload the resulting files. To loop over record IDs and process all files within a project on RedCap, use the `preprocess_wric_files` function. To find out more about using RedCap, please see the `RedCAP` vignette.

_ATTENTION!_ During processing, the data-file(s) will be downloaded and afterwards deleted again. If the data is not allowed to be on your personal device at any point, please use this package on a secure server, where you are allowed to (temporarily) store the data.


## Working with a subset of the data (specific time)
Often you are interested in a certain time period (e.g. after eating or during exercise) and want to perform some calculations based on those time frames. Let's look at how we would do that.

1) Import the preprocessed data (the csv-file) _You can skip this step, if you already have the data.frame, for example right after calling_ `preprocess_wric_files()`.
```{r eval = FALSE}
data <- read.csv("./example_data/my_project/XXXX_comment_WRIC_data.csv") 
head(data)
```

```{r}
result <- preprocess_wric_file(data_txt, notefilepath = note_txt)
data <- result$dfs$room1
head(data)
```

2) Let's extract the data that we are interested in. Let's start with the first time our participant is eating (breakfast) including 15min afterwards.
```{r}
# we take the first (1) instance where the protocol is 2 (eating)
breakfast_index <- which(data$protocol == 2)[1] 
print(breakfast_index)
# we create a new data.frame where we take the next 14 rows after the start_index
data_breakfast <- data[breakfast_index:(breakfast_index + 14),] 
head(data_breakfast) #Let's look at the data to check wether it worked correctly
```

Maybe we want to compare RER after breakfast with RER after dinner. So let's extract the dinner time. As participants are eating for some time (e.g 15min) there are 15 rows where protocol is 2. So it would not work to just take the second instance, but we need to identify transitions from another number to 2 and then choose the second transition.

```{r}
# we additionally check wether the row right before (lag) is not 2 and 
# then take the second instance (2) to get the dinner time
dinner_index <- which(data$protocol == 2 & dplyr::lag(data$protocol) != 2)[2] 
print(dinner_index)
data_dinner <- data[dinner_index:(dinner_index + 14),]
head(data_dinner) #Let's look at the data to check wether it worked correctly
```

3) Now we can compare the two data.frames. For this example let's use a paired t-test to check whether there are differences in RER between breakfast and dinner.

```{r}
t.test(data_breakfast$RER, data_dinner$RER, paired = TRUE)
```

_Please note that this is shortened synthetic example data, so the "dinner" is actually 15min after the "breakfast". This code is just to demonstrate how to use the package and some helpful analysis code, so you can not draw any conclusion based on this random synthetic data that is completely unphysiological._

Some more helpful functions, you might want to use for/on your sub-dataframes:

- `add_relative_time(dataframe)` - Renumbers _relative_time_ column starting from 0. Might be more intuitive for further use. Example: `data_dinner <- add_relative_time(data_dinner)`
- `cut_rows`- With this function you can easily create a sub-dataframe (like we did above) based on datetime values (instead of the protocol value). Example: `data_dinner <- cut_rows(data, start="2023-11-14 20:04:00", end="2023-11-14 21:04:00")`

Of course you can do these analyses batch-wise as well, the same way as above. Here an example:
```{r, eval=FALSE}
files <- list.files(folder_path, pattern = "_data\\.csv$", full.names = TRUE)
for (file in files) {
    data <- read.csv(file)
    breakfast_index <- which(data$protocol == 2)[1]
    data_breakfast <- data[breakfast_index:(breakfast_index + 59),] 
    dinner_index <- which(data$protocol == 2 & dplyr::lag(data$protocol) != 2)[2] 
    data_dinner <- data[dinner_index:(dinner_index + 59),]
    message("T-Test Result for : ", file)
    t.test(data_breakfast$RER, data_dinner$RER, paired = TRUE)
}
```

## Visualizing
Let's try to visualize the data highlighted by the protocol.
```{r}
example_csv <- system.file("extdata", "example.csv", package = "wrictools") # loading example data
visualize_with_protocol(example_csv) 
```
We can see that it plotted RER (Respiratory Exchange Ratio) over time and highlighted the protocol. But what if we wanted to plot energy expenditure instead? To see all parameters we can specify, let's use `?function_name` again.

### <span style="color:green">Your Turn</span>

1) Use `?function_name` to find the parameters you can specify.
2) Try adjusting the protocol_colors_label parameters.
3) Run it on the example_csv data.


Here an example for batch processing over multiple files
```{r eval = FALSE}
# Path to the folder containing the files
folder_path <- "example_data/my_project"

# Get all files ending with "_data.csv"
csv_files <- list.files(folder_path, pattern = "_data.csv", full.names = TRUE)
dataframes <- list()

protocol_colors_labels <- data.frame(
  protocol = c(0, 1, 2, 3, 4),
  color = c("white", "purple", "#4b3302", "#48c5a6", "#d0a4c6"),
  label = c("Normal", "Something", "Nothing", "Third Thing", "?")
)

for (file in csv_files) {
  visualize_with_protocol(file, plot="Energy Expenditure (kcal/min)", 
                          protocol_colors_labels = protocol_colors_labels,
                          save_png = TRUE)
}
```


That concludes this tutorial. Now you know all basic functionalities of the package and are ready to use it in your own projects and with real data. Have fun!
