ztils

Various utilities meant to aid in speeding up common statistical operations, such as: - removing outliers and extremes - generating probability density and cumulative distribution graphs with ggplot2 - running one-sample Kolmogorov-Smirnov tests against multiple distributions at once - generating prediction plots with ggplot2 - scaling data and performing principal component analysis (PCA) - plotting PCA with ggplot2

Installation

To install from CRAN

install.packages("ztils")

To install the development version:

remotes::install_github("zachpeagler/ztils")

no_outliers()

Description

This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 1.5 times the interquartile range.

Usage

This function has no defaults, as it is entirely dependent on the user input.

no_outliers(data,
            var
            )

Arguments

data: the dataframe to remove rows containing outliers of the target variable
var: the variable to calculate outliers against

Returns

Returns the specified dataframe data minus the rows containing outliers in the var variable.

Examples:

no_outliers(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not contain any statistical outliers.

no_extremes()

Description

This function works by keeping only rows in the dataframe containing variable values within the quartiles +- 3.0 times the interquartile range.

Usage

This function has no defaults, as it is entirely dependent on the user input.

no_extremes(data,
            var
            )

Arguments

data: the dataframe to remove rows containing outliers of the target variable
var: the variable to calculate outliers against

Returns

Returns the specified dataframe data minus the rows containing extremes in the var variable.

Examples:

no_extremes(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not contain any statistical outliers.

multipdf_cont()

Description

This function gets the probability density function (PDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).

Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.

Usage:

multipdf_cont(var, 
              seq_length = 50, 
              distributions = "all"
              )

Returns

This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.

Arguments

var: the variable of which to get the PDF
- no default
seq_length: the length to fit the distribution against
- default 50
distributions: the distributions to fit var against
- default “all”

Examples

multipdf_cont(iris$Petal.Length)

multipdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal"))

multipdf_plot()

Description

This function extends multiPDF_cont and gets the probability density functions (PDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.

Usage

multipdf_plot(var, 
              seq_length = 50,
              distributions = "all", 
              palette = "oslo", 
              var_name = NULL
              )

Returns

A plot showing the PDF of the selected variable against the selected distributions over the selected sequence length.

Arguments

var: the variable of which to get the PDF
seq_length: the length to fit the distribution against
distributions: the distributions to fit var against
palette: A scico palette to use on the graph, with each distribution corresponding to a color. For all possible palettes, call scico_palette_names().
var_name: A name to use in the title and x axis label of the plot.

Examples


multipdf_plot(iris$Sepal.Length)

multipdf_plot(iris$Sepal.Length,
              seq_length = 100,
              distributions = c("normal", "lognormal", "gamma"),
              palette = "bilbao",
              var_name = "Sepal Length (cm)"
              )

multicdf_cont()

Description

This function gets the cumulative distribution function (CDF) for selected distributions against continuous variables. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions).

Note that only non-negative numbers are supported by the lognormal and gamma distributions. Feeding this function a negative number with those distributions selected will result in an error.

Usage:

multicdf_cont(var, 
              seq_length = 50, 
              distributions = "all"
              )

Returns

This function returns a dataframe with row number equal to seq_length containing the real density and the probability density function of var for selected distributions.

Arguments

var: the variable of which to get the PDF
- no default
seq_length: the length to fit the distribution against
- default 50
distributions: the distributions to fit var against
- default “all”

Examples

multicdf_cont(iris$Petal.Length)

multicdf_cont(iris$Sepal.Length,
              100, 
              c("normal", "lognormal")
              )

multicdf_plot()

Description

This function extends multiCDF_cont and gets the cumulative distribution functions (CDFs) for selected distributions against continuous, non-negative numbers. Possible distributions include any combination of “normal”, “lognormal”, “gamma”, “exponential”, and “all” (which just uses all of the prior distributions). It then plots this using ggplot2 and a scico palette, using var_name for the plot labeling, if specified. If not specified, it will use var instead.

Usage

multicdf_plot(var, 
              seq_length = 50,
              distributions = "all", 
              palette = "oslo", 
              var_name = NULL
              )

Returns

A plot showing the CDF of the selected variable against the selected distributions over the selected sequence length.

Arguments

var: the variable of which to get the CDF
seq_length: the length to fit the distribution against
distributions: the distributions to fit var against
palette: A scico palette to use on the graph, with each distribution corresponding to a color. For all possible palettes, call scico_palette_names().
var_name: A name to use in the title and x axis label of the plot.

Examples


multicdf_plot(iris$Sepal.Length)

multicdf_plot(iris$Sepal.Length,
              seq_length = 100,
              distributions = c("normal", "lognormal", "gamma"),
              palette = "bilbao",
              var_name = "Sepal Length (cm)"
              )

multiks_cont()

Description

This function gets the distance and p-value from a one-sample Kolmogorov-Smirnov (KS) test for selected distributions against a continous input variable. Possible distributions include “normal”, “lognormal”, “gamma”, “exponential”, and “all”.

Usage

multiks_cont(var,
             distributions = "all"   
             )

Note: If using “lognormal” or “gamma” distributions, the target variable must be non-negative.

Arguments

var: The variable to perform one-sample KS tests on
distributions: The distributions to test against

Returns

Returns a dataframe with the distance and p-value for each performed KS test. The distance is a relative metric of similarity. A p-value of > 0.05 indicates that the target variable’s distribution is not significantly different from the specified distribution.

Examples

multiks_cont(iris$Sepal.Length)

multiks_cont(iris$Sepal.Length, c("normal", "lognormal"))

gml_pseudor2

Description

This function calculates the pseudo R^2 (proportion of variance explained by the model) for a general linear model (glm). glms don’t have real R^2 due to the intrinsic difference between a linear model and a generalized linear model, but we can still calculate an approximiation of the R^2 as (1 - (deviance/null deviance)).

Usage

glm_pseudor2(mod)

Arguments

mod: The glm object to calculate a pseudo-R^2 for.

Returns

Returns the pseudo R^2 value of the model.

Examples

gmod <- glm(Sepal.Length ~ Petal.Length + Species, data = iris)
glm_pseudor2(gmod)

pca_plot()

Description

This function performs a principal component analysis (PCA) for the selected pcavars with the option to automatically scale the variables. It then graphs PC1 on the x axis and PC2 on the y-axis using ggplot2, coloring the graph with a scico palette over the specified groups. This is similar to the biplot command from the stats package, but performs all the steps required in graphing a PCA for you.

Usage

pca_plot(group,
         pcavars,
         scaled = FALSE,
         palette = "oslo
         )

Arguments

group: The group column, used for assigning colors.
pcavars: The variables (columns) to perform a principal component analysis on. Should be explanatory variables and not response variables.
scaled: A boolean (TRUE or FALSE) indicated if the pcavars have already been scaled or if they should be scaled in the function.
palette: A scico palette used to color the graph. For all possible palettes, call scico_palette_names(). If non-scico palettes are desired, the palette can be overridden with scale_color and scale_fill functions.

Returns

A ggplot object showing PC1 on the x axis and PC2 on the y axis, colored by group with vectors and labels showing the individual pca variables.

Examples

pca_plot(iris$Species, iris[,c(1:4)])

pca_plot(iris$Species, iris[,c(1:4)], FALSE, "bilbao")

pca_data()

Description

This function performs a principal component analysis (PCA) on the specified variables, pcavars and attaches the resulting principal components to the specified dataframe, data, with optional variable scaling.

Usage

pca_data(data,
         pcavars,
         scaled = FALSE
         )

Arguments

data: The dataframe to attach principal components to.
pcavars: The variables to use in the principal component analysis.
scaled: A logical value (TRUE or FALSE) indicating if pcavars have already been scaled or if they should be scaled in the function.

Returns

Returns a dataframe with principal components as additional columns.

Examples

pca_data(iris, iris[,c(1:4)], FALSE)

predict_plot()

Description

This function performs a prediction based on the supplied model, then graphs it using ggplot2. Options are available for predicting based on the confidence or prediction interval, as well as for applying corrections, such as exponential and logistic.

I would like to alter this function to reduce the number of required inputs, as all the information should be available from the model call, but that’s a work in progress. ### Usage

predict_plot(mod,
             data,
             rvar,
             pvar,
             group = NULL,
             length = 50,
             interval = "confidence",
             correction = "normal",
             palette = "oslo"
             )

Arguments

mod: A univariate linear model to base predictions on.
data: The dataframe used in the model. Will be used to pull variables for plotting.
rvar: The response variable (y-axis), must be the same as the one in the model
pvar: The predictor variable (x-axis), must be the same as the one in the model.
group: An optional grouping variable. If a group is present, separate predictions will be made for each group.
length: The length to predict over. A longer length will result in more precision.
interval: Tells the function to predict over either the confidence interval or the prediction interval.
- “confidence” or “prediction”
correction: If you log transform or logit transform the variables in the model, you can choose to apply a correction to the predicted output to reverse that transformation.
- “normal”, “exponential”, or “logit”
palette: A scico palette used to color the graph. For all possible palettes, call scico_palette_names(). If non-scico palettes are desired, the palette can be overridden with scale_color and scale_fill functions.

Returns

Returns a plot with the observed (real) data plotted as points and the prediction plotted as lines, with a 95% confidence or prediction interval.

This function has a known issue with the colors on ungrouped predictions being kind of funky, as the function uses the predictor variable (x-axis) for the color, which works for the actual data (points), but doesn’t translate well to the predicted lines and ribbon.

Examples

mod1 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)
predict_plot(mod1, iris, Sepal.Length, Petal.Length, Species)

Bug reporting

If you find any bugs, please report them at https://github.com/zachpeagler/ztils/issues.