Type: Package
Title: Datasets and Basic Statistics for Symbolic Data Analysis
Version: 0.1.8
Date: 2026-02-11
Author: Po-Wei Chen [aut], Chun-houh Chen [aut], Han-Ming Wu [cre]
Maintainer: Han-Ming Wu <wuhm@g.nccu.edu.tw>
Description: Collects a diverse range of symbolic data and offers a comprehensive set of functions that facilitate the conversion of traditional data into the symbolic data format.
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
Depends: R (≥ 4.0.0)
Suggests: testthat (≥ 2.1.0), knitr, rmarkdown
VignetteBuilder: knitr
Imports: magrittr, tidyr, dplyr, RSDA, HistDAWass
NeedsCompilation: no
Packaged: 2026-02-11 14:31:09 UTC; hmwu
Repository: CRAN
Date/Publication: 2026-02-11 15:00:02 UTC

MM to RSDA

Description

To convert MM format interval dataframe to RSDA format (symbolic_tbl).

Usage

MM_to_RSDA(data)

Arguments

data

The dataframe with the MM format (paired _min/_max columns).

Value

Return a symbolic_tbl dataframe with complex-encoded interval columns.

Examples

data(mushroom.int)
mm <- RSDA_to_MM(mushroom.int, RSDA = FALSE)
rsda <- MM_to_RSDA(mm)

MM to iGAP

Description

To convert MM format to iGAP format.

Usage

MM_to_iGAP(data)

Arguments

data

The dataframe with the MM format.

Value

Return a dataframe with the iGAP format.

Examples

data(face.iGAP)
face <- iGAP_to_MM(face.iGAP, 1:6)
MM_to_iGAP(face)

RSDA Format

Description

This function changes the format of the data to conform to RSDA format.

Usage

RSDA_format(data, sym_type1 = NULL, location = NULL, sym_type2 = NULL, var = NULL)

Arguments

data

A conventional data.

sym_type1

The labels I means an interval variable and $S means set variable.

location

The location of the sym_type in the data.

sym_type2

The labels I means an interval variable and $S means set variable.

var

The name of the symbolic variable in the data.

Value

Return a dataframe with a label added to the previous column of symbolic variable.

Examples

data("mushroom")
mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species")
mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"),
                            location = c(25, 31), sym_type2 = c("S", "I", "I"),
                            var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))

RSDA to MM

Description

To convert RSDA format interval dataframe to MM format.

Usage

RSDA_to_MM(data, RSDA)

Arguments

data

The RSDA format with interval dataframe.

RSDA

Whether to load the RSDA package.

Value

Return a dataframe with the MM format.

Examples

data(mushroom.int)
RSDA_to_MM(mushroom.int, RSDA = FALSE)

RSDA to iGAP

Description

To convert RSDA format interval dataframe to iGAP format.

Usage

RSDA_to_iGAP(data)

Arguments

data

The RSDA format with interval dataframe.

Value

Return a dataframe with the iGAP format.

Examples

data(mushroom.int)
RSDA_to_iGAP(mushroom.int)

SODAS to MM

Description

To convert SODAS format interval dataframe to the MM format.

Usage

SODAS_to_MM(XMLPath)

Arguments

XMLPath

Disk path where the SODAS *.XML file is.

Value

Return a dataframe with the MM format.

Examples

## Not run:
data(abalone.int)

SODAS to iGAP

Description

To convert SODAS format interval dataframe to the iGAP format.

Usage

SODAS_to_iGAP(XMLPath)

Arguments

XMLPath

Disk path where the SODAS *.XML file is.

Value

Return a dataframe with the iGAP format.

Examples

## Not run:
data(abalone.int)

Abalone Dataset (iGAP Format)

Description

Interval-valued dataset of 24 units from the UCI Abalone dataset, aggregated by sex and age. iGAP format for matrix visualization.

Usage

data(abalone.iGAP)

Format

An object of class data.frame with 24 rows and 7 columns.

Source

UCI Machine Learning Repository.

References

Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.

Examples

data(abalone.iGAP)

Abalone Interval Dataset

Description

Interval-valued dataset of 24 units from the UCI Abalone dataset, aggregated by sex and age. Standard data frame format.

Usage

data(abalone.int)

Format

An object of class data.frame with 24 rows and 14 columns.

Source

UCI Machine Learning Repository.

Examples

data(abalone.int)

Acid Rain Pollution Indices Interval Dataset

Description

Interval-valued acid rain pollution indices for sulphates and nitrates (kg/hectares) by US state.

Usage

data(acid_rain.int)

Format

A data frame with 2 observations and 2 interval-valued variables:

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.21.

Examples

data(acid_rain.int)

Age-Cholesterol-Weight Interval Dataset

Description

Interval-valued dataset relating age, cholesterol, and weight measurements.

Usage

data(age_cholesterol_weight.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 7 rows and 4 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(age_cholesterol_weight.int)

JFK Airport Airline Flights Histogram-Valued Dataset

Description

Histogram-valued dataset of 16 airlines flying into JFK Airport. Six variables (Flight Time, Taxi In, Arrival Delay, Taxi Out, Departure Delay, Weather Delay) recorded as frequency distributions.

Usage

data(airline_flights.hist)

Format

An object of class data.frame with 16 rows and 17 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.7.

Examples

data(airline_flights.hist)

JFK Airport Airline Flights Modal-Valued Dataset

Description

Modal-valued version of the airline flights dataset. See airline_flights.hist.

Usage

data(airline_flights2)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 16 rows and 6 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.7.

Examples

data(airline_flights2)

Bank Interest Rates AR Model Symbolic Dataset

Description

Symbolic dataset of autoregressive time series models for 4 banks. Each bank is described by AR model order, parameters, and noise variance.

Usage

data(bank_rates)

Format

An object of class data.frame with 4 rows and 6 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.9.

Examples

data(bank_rates)

Baseball Teams Interval Dataset

Description

Interval-valued data for baseball teams with player statistics.

Usage

data(baseball.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 19 rows and 3 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(baseball.int)

Bat Species Interval Dataset

Description

Interval-valued data for 21 bat species described by 4 morphological measurements. Benchmark dataset for matrix visualization.

Usage

data(bats.int)

Format

A data frame with 21 observations and 4 interval-valued variables:

Details

Used to demonstrate color coding schemes, the HCT-R2E seriation algorithm, and distance measure comparisons (Gowda-Diday, Hausdorff, City-Block, L1, L2, etc.) for interval data.

References

Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.

Examples

data(bats.int)

Bird Species Mixed Symbolic Dataset

Description

Mixed symbolic data for bird species with interval-valued morphological measurements and categorical symbolic variables (habitat, color).

Usage

data(bird.mix)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 20 rows and 2 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.5.

Examples

data(bird.mix)

Bird Species Mixed Symbolic Dataset

Description

Symbolic data for 3 bird species (Swallow, Ostrich, Penguin) with interval-valued size, categorical flying, and categorical migration. Foundational SDA example from 600 individual bird observations.

Usage

data(bird_species.mix)

Format

A data frame with 3 observations and 3 symbolic variables:

References

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.2, p.6.

Examples

data(bird_species.mix)

Bird Species Extended Mixed Symbolic Dataset

Description

Three bird species (Geese, Ostrich, Penguin) with interval-valued height, histogram-valued color distribution, and categorical flying/migratory variables.

Usage

data(bird_species_extended.mix)

Format

A data frame with 3 observations and 4 symbolic variables.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.19.

Examples

data(bird_species_extended.mix)

Blood Pressure Interval Dataset

Description

Interval-valued blood pressure measurements by patient groups.

Usage

data(blood_pressure.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 15 rows and 3 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(blood_pressure.int)

Car Models Interval Dataset

Description

Interval-valued data for car models with price, engine, speed, acceleration.

Usage

data(car.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 8 rows and 5 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(car.int)

Cars Interval Dataset

Description

Interval-valued data for 27 car models classified into four classes (Utilitarian, Berlina, Sportive, Luxury), described by Price, EngineCapacity, TopSpeed and Acceleration intervals.

Usage

data(cars.int)

Format

A data frame with 27 observations and 5 variables.

Source

https://CRAN.R-project.org/package=MAINT.Data

References

Duarte Silva, A.P., Brito, P., Filzmoser, P. and Dias, J.G. (2021). MAINT.Data: Modelling and Analysing Interval Data in R. R Journal, 13(2).

Examples

data(cars.int)

China Meteorological Stations Quarterly Temperature Interval Dataset

Description

Interval-valued temperature data (Celsius) for 60 Chinese meteorological stations observed over the four quarters of years 1974 to 1988. One outlier observation (YinChuan_1982) has been discarded.

Usage

data(china_temp.int)

Format

A data frame with 899 observations and 5 variables.

Details

Originates from the Long-Term Instrumental Climatic Database of the People's Republic of China. Widely used in the SDA literature for demonstrating standardization, clustering, self-organizing maps, MLE and MANOVA.

Source

https://CRAN.R-project.org/package=MAINT.Data

References

Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and Skew-Normal distributions. J. Appl. Stat., 39(1), 3-20.

Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.

Examples

data(china_temp.int)

clean_colnames

Description

This function is used to clean up variable names to conform to the RSDA format.

Usage

clean_colnames(data)

Arguments

data

The conventional data.

Value

Data after cleaning variable names.

Examples

data(mushroom)
mushroom.clean <- clean_colnames(data = mushroom)

Credit Card Expenses Interval Dataset

Description

Interval-valued credit card spending aggregated by person-month. Three individuals' (Jon, Tom, Leigh) monthly expenditures across five categories.

Usage

data(credit_card.int)

Format

A data frame with person-month rows and 5 interval-valued columns:

Details

The original classical dataset (Table 2.3) records individual transactions. The symbolic version (Table 2.4) aggregates into interval-valued observations for each person-month combination.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Tables 2.3-2.4.

Examples

data(credit_card.int)

Crime Demographics Dataset

Description

Crime-related demographic variables with symbolic data types.

Usage

data(crime)

Format

An object of class data.frame with 15 rows and 7 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(crime)

Crime Demographics Modal-Valued Dataset

Description

Modal-valued version of the crime demographics dataset.

Usage

data(crime2)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 15 rows and 3 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(crime2)

European Employment by Gender and Age Interval Dataset

Description

Interval-valued proportions for 12 sex-age population groups across employment variables (employment type, education, industry sector, occupation, marital status). Used for factorial discriminant analysis.

Usage

data(employment.int)

Format

A data frame with 12 sex-age group observations and interval-valued proportion variables.

References

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 18.1.

Examples

data(employment.int)

US Energy Consumption Distribution-Valued Dataset

Description

Distribution-valued dataset of energy consumption across US states. Each energy type described by Normal distribution parameters (mean, SD).

Usage

data(energy_consumption.distr)

Format

A data frame with 5 observations and 3 variables:

Details

Five types: Petroleum, Natural Gas, Coal, Hydroelectric, Nuclear Power. Values are rescaled consumption from the US Census Bureau (2004).

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.8.

Examples

data(energy_consumption.distr)

Face Dataset (iGAP Format)

Description

Symbolic data matrix with all interval-type variables for facial measurements, in iGAP format.

Usage

data(face.iGAP)

Format

An object of class data.frame with 27 rows and 6 columns.

References

Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.

Examples

data(face.iGAP)

Finance Sector Interval Dataset

Description

Interval-valued data for 14 business sectors described by job-related financial variables (job cost codes, activity codes, budgets). Used for PCA demonstrations.

Usage

data(finance.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 14 rows and 7 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 5.2.

Examples

data(finance.int)

Fuel Consumption by Region Dataset

Description

Modal-valued dataset describing fuel consumption patterns across 10 regions by proportions of heating fuel types (gas, oil, electricity, coal, none) and central heating presence.

Usage

data(fuel_consumption)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 10 rows and 3 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 3.7.

Examples

data(fuel_consumption)

Health Insurance Mixed Symbolic Dataset

Description

Health insurance data grouped by disease type and gender with classical and symbolic variables of mixed types.

Usage

data(health_insurance.mix)

Format

An object of class data.frame with 51 rows and 30 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Tables 2.1-2.2.

Examples

data(health_insurance.mix)

Health Insurance Modal-Valued Dataset

Description

Modal-valued version of the health insurance dataset.

Usage

data(health_insurance2)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 6 rows and 6 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.2b.

Examples

data(health_insurance2)

Hierarchy Dataset

Description

Classical dataset illustrating hierarchical data structures.

Usage

data(hierarchy)

Format

An object of class data.frame with 20 rows and 6 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.15.

Examples

data(hierarchy)

Hierarchy Interval Dataset

Description

Interval-valued version of the hierarchy dataset.

Usage

data(hierarchy.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 20 rows and 6 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.15.

Examples

data(hierarchy.int)

Statistics for Histogram Data

Description

Functions to compute the mean, variance, covariance, and correlation of histogram-valued data.

Usage

hist_mean(x, var_name, method = "BG", ...)

hist_var(x, var_name, method = "BG", ...)

hist_cov(x, var_name1, var_name2, method = "BG")

hist_cor(x, var_name1, var_name2, method = "BG")

Arguments

x

histogram-valued data object.

var_name

the variable name or the column location.

method

methods to calculate statistics: mean and var: BG (default), L2W; cov and cor: BG (default), BD, B, L2W.

...

additional parameters.

var_name1

the variable name or the column location.

var_name2

the variable name or the column location.

Details

...

Value

A numeric value: the mean, variance, covariance, or correlation.

Author(s)

Po-Wei Chen, Han-Ming Wu

See Also

int_mean int_var int_cov int_cor

Examples

library(HistDAWass)

Horse Breeds Interval Dataset

Description

Interval-valued data for 8 horse breeds (CES, CMA, PEN, TES, CEN, LES, PES, PAM) described by 6 variables: minimum/maximum weight, minimum/maximum height, cost of mares, cost of fillies.

Usage

data(horses.int)

Format

A data frame with 8 observations and 6 interval-valued variables.

Details

Extensively used in SDA for demonstrating divisive clustering, distance computation, hierarchy/pyramid construction, and complete objects.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 7.14.

Examples

data(horses.int)

iGAP to MM

Description

To convert iGAP format to MM format.

Usage

iGAP_to_MM(data, location)

Arguments

data

The dataframe with the iGAP format.

location

The location of the symbolic variable in the data.

Value

Return a dataframe with the MM format.

Examples

data(abalone.iGAP)
abalone <- iGAP_to_MM(abalone.iGAP, 1:7)

iGAP to RSDA

Description

To convert iGAP format interval dataframe to RSDA format (symbolic_tbl).

Usage

iGAP_to_RSDA(data, location)

Arguments

data

The dataframe with the iGAP format.

location

The location of the symbolic variable in the data.

Value

Return a symbolic_tbl dataframe with complex-encoded interval columns.

Examples

data(abalone.iGAP)
rsda <- iGAP_to_RSDA(abalone.iGAP, 1:7)

Convert Interval Data Format

Description

Automatically detect the format of interval data and convert it to the target format.

Usage

int_convert_format(x, to = "MM", from = NULL, ...)

Arguments

x

interval data in one of the supported formats

to

target format: "MM", "iGAP", "RSDA", "SODAS" (default: "MM")

from

source format (optional): "MM", "iGAP", "RSDA", "SODAS". If NULL, will auto-detect.

...

additional parameters passed to specific conversion functions

Details

This function provides a unified interface for all interval format conversions. It automatically detects the source format (unless specified) and applies the appropriate conversion function.

Supported conversions:

Value

Interval data in the target format

Author(s)

Han-Ming Wu

See Also

int_detect_format int_list_conversions RSDA_to_MM iGAP_to_MM MM_to_iGAP MM_to_RSDA iGAP_to_RSDA

Examples

# Auto-detect and convert to MM
data(mushroom.int)
data_mm <- int_convert_format(mushroom.int, to = "MM")

# Explicitly specify source format
data(abalone.iGAP)
data_mm <- int_convert_format(abalone.iGAP, from = "iGAP", to = "MM")

# Convert MM to iGAP
data_igap <- int_convert_format(data_mm, to = "iGAP")

 # Convert multiple datasets to MM
datasets <- list(mushroom.int, abalone.int, car.int)
mm_datasets <- lapply(datasets, int_convert_format, to = "MM")

# Check what conversions are available
int_list_conversions()

Detect Interval Data Format

Description

Automatically detect the format of interval data.

Usage

int_detect_format(x)

Arguments

x

interval data in unknown format

Details

Detection rules:

Value

A character string indicating the detected format: "RSDA", "MM", "iGAP", "SODAS", or "unknown"

Examples

data(mushroom.int)
int_detect_format(mushroom.int)  # Should return "RSDA"

data(abalone.iGAP)
int_detect_format(abalone.iGAP)  # Should return "iGAP"

List Available Format Conversions

Description

List all available format conversion functions.

Usage

int_list_conversions(from = NULL, to = NULL)

Arguments

from

source format (optional): "RSDA", "MM", "iGAP", "SODAS"

to

target format (optional): "RSDA", "MM", "iGAP", "SODAS"

Value

A data.frame showing available conversions

Examples

# List all conversions
int_list_conversions()

# List conversions from RSDA
int_list_conversions(from = "RSDA")

# List conversions to MM
int_list_conversions(to = "MM")

Distance Measures for Interval Data

Description

Functions to compute various distance measures between interval-valued observations.

int_dist_all computes all available distance measures at once.

Usage

int_dist(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...)

int_dist_matrix(x, method = "euclidean", gamma = 0.5, q = 1, p = 2, ...)

int_pairwise_dist(x, var_name1, var_name2, method = "euclidean", ...)

int_dist_all(x, gamma = 0.5, q = 1)

Arguments

x

interval-valued data with symbolic_tbl class, or an array of dimension [n, p, 2]

method

distance method: "GD", "IY", "L1", "L2", "CB", "HD", "EHD", "nEHD", "snEHD", "TD", "WD", "euclidean", "hausdorff", "manhattan", "city_block", "minkowski", "wasserstein", "ichino", "de_carvalho"

gamma

parameter for the Ichino-Yaguchi distance, 0 <= gamma <= 0.5 (default: 0.5)

q

parameter for the Ichino-Yaguchi distance (Minkowski exponent) (default: 1)

p

power parameter for Minkowski distance (default: 2)

...

additional parameters

var_name1

first variable name or column location

var_name2

second variable name or column location

Details

Available distance methods:

Value

A distance matrix (class 'dist') or numeric vector

Author(s)

Han-Ming Wu

References

Gowda, K. C., & Diday, E. (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6), 567-578.

Ichino, M. (1988). General metrics for mixed features. Systems and Computers in Japan, 19(2), 37-50.

Chavent, M., & Lechevallier, Y. (2002). Dynamical clustering of interval data. In Classification, Clustering and Data Analysis (pp. 53-60). Springer.

Tran, L., & Duckstein, L. (2002). Comparison of fuzzy numbers using a fuzzy distance measure. Fuzzy Sets and Systems, 130, 331-341.

Verde, R., & Irpino, A. (2008). A new interval data distance based on the Wasserstein metric.

Kao, C.-H. et al. (2014). Exploratory data analysis of interval-valued symbolic data with matrix visualization. CSDA, 79, 14-29.

See Also

int_dist_matrix int_dist_all int_pairwise_dist

Examples

# Using symbolic_tbl format
data(mushroom.int)
d1 <- int_dist(mushroom.int[, 3:4], method = "euclidean")
d2 <- int_dist(mushroom.int[, 3:4], method = "hausdorff")
d3 <- int_dist(mushroom.int[, 3:4], method = "GD")

# Using array format: 4 concepts, 3 variables
x <- array(NA, dim = c(4, 3, 2))
x[,,1] <- matrix(c(1,2,3,4, 5,6,7,8, 9,10,11,12), nrow=4)
x[,,2] <- matrix(c(3,5,6,7, 8,9,10,12, 13,15,16,18), nrow=4)
d4 <- int_dist(x, method = "snEHD")
d5 <- int_dist(x, method = "IY", gamma = 0.3)

Geometric Properties of Interval Data

Description

Functions to compute geometric characteristics of interval-valued data.

Usage

int_width(x, var_name, ...)

int_radius(x, var_name, ...)

int_center(x, var_name, ...)

int_overlap(x, var_name1, var_name2, ...)

int_containment(x, var_name1, var_name2, ...)

int_midrange(x, var_name, ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

...

additional parameters

var_name1

the first variable name or column location.

var_name2

the second variable name or column location.

Details

These functions compute basic geometric properties:

Value

A numeric matrix or value

Author(s)

Han-Ming Wu

See Also

int_width int_radius int_center int_overlap

Examples

data(mushroom.int)

# Calculate interval widths
int_width(mushroom.int, var_name = "Pileus.Cap.Width")
int_width(mushroom.int, var_name = 2:3)

# Calculate interval radius
int_radius(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"))

# Get interval centers
int_center(mushroom.int, var_name = 2:4)

# Measure overlap between two variables
int_overlap(mushroom.int, "Pileus.Cap.Width", "Stipe.Length")

Position and Scale Measures for Interval Data

Description

Functions to compute position and scale statistics for interval-valued data.

Usage

int_median(x, var_name, method = "CM", ...)

int_quantile(x, var_name, probs = c(0.25, 0.5, 0.75), method = "CM", ...)

int_range(x, var_name, method = "CM", ...)

int_iqr(x, var_name, method = "CM", ...)

int_mad(x, var_name, method = "CM", ...)

int_mode(x, var_name, method = "CM", breaks = 30, ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

method

methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT.

...

additional parameters

probs

numeric vector of probabilities with values in [0,1].

breaks

number of histogram breaks for mode estimation (default: 30).

Details

These functions provide position and scale measures:

Value

A numeric matrix or value

Author(s)

Han-Ming Wu

See Also

int_mean int_var int_median int_quantile

Examples

data(mushroom.int)

# Calculate median
int_median(mushroom.int, var_name = "Pileus.Cap.Width")
int_median(mushroom.int, var_name = 2:3, method = c("CM", "EJD"))

# Calculate quantiles
int_quantile(mushroom.int, var_name = 2, probs = c(0.25, 0.5, 0.75))

# Calculate interquartile range
int_iqr(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"))

# Calculate MAD
int_mad(mushroom.int, var_name = 2:3, method = "CM")

Robust Statistics for Interval Data

Description

Functions to compute robust statistics for interval-valued data.

Usage

int_trimmed_mean(x, var_name, trim = 0.1, method = "CM", ...)

int_winsorized_mean(x, var_name, trim = 0.1, method = "CM", ...)

int_trimmed_var(x, var_name, trim = 0.1, method = "CM", ...)

int_winsorized_var(x, var_name, trim = 0.1, method = "CM", ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

trim

the fraction (0 to 0.5) of observations to be trimmed from each end.

method

methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT.

...

additional parameters

Details

These functions provide robust alternatives to standard statistics:

Trimming vs Winsorizing:

Value

A numeric matrix

Author(s)

Han-Ming Wu

See Also

int_mean int_var int_trimmed_mean

Examples

data(mushroom.int)

# Trimmed mean (10% from each end)
int_trimmed_mean(mushroom.int, var_name = "Pileus.Cap.Width", trim = 0.1)

# Winsorized mean
int_winsorized_mean(mushroom.int, var_name = 2:3, trim = 0.05, method = "CM")

# Trimmed variance
int_trimmed_var(mushroom.int, var_name = c("Stipe.Length"), trim = 0.1)

Distribution Shape Measures for Interval Data

Description

Functions to compute shape statistics (skewness, kurtosis) for interval-valued data.

Usage

int_skewness(x, var_name, method = "CM", ...)

int_kurtosis(x, var_name, method = "CM", ...)

int_symmetry(x, var_name, method = "CM", ...)

int_tailedness(x, var_name, method = "CM", ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

method

methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT.

...

additional parameters

Details

These functions measure distribution shape:

Skewness interpretation:

Kurtosis interpretation (excess kurtosis):

Value

A numeric matrix

Author(s)

Han-Ming Wu

See Also

int_mean int_var int_skewness int_kurtosis

Examples

data(mushroom.int)

# Calculate skewness
int_skewness(mushroom.int, var_name = "Pileus.Cap.Width")
int_skewness(mushroom.int, var_name = 2:3, method = c("CM", "EJD"))

# Calculate kurtosis
int_kurtosis(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"))

# Check symmetry
int_symmetry(mushroom.int, var_name = 2:4, method = "CM")

Similarity Measures for Interval Data

Description

Functions to compute similarity measures between interval-valued observations.

Usage

int_jaccard(x, var_name1, var_name2, ...)

int_dice(x, var_name1, var_name2, ...)

int_cosine(x, var_name1, var_name2, ...)

int_overlap_coefficient(x, var_name1, var_name2, ...)

int_tanimoto(x, var_name1, var_name2, ...)

int_similarity_matrix(x, method = "jaccard", ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name1

the first variable name or column location.

var_name2

the second variable name or column location.

...

additional parameters

method

similarity method for int_similarity_matrix: "jaccard", "dice", or "overlap".

Details

These functions compute various similarity measures:

All similarity measures range from 0 (no similarity) to 1 (perfect similarity).

Value

A numeric matrix or value

Author(s)

Han-Ming Wu

See Also

int_dist int_cor int_jaccard

Examples

data(mushroom.int)

# Jaccard similarity
int_jaccard(mushroom.int, "Pileus.Cap.Width", "Stipe.Length")

# Dice coefficient
int_dice(mushroom.int, 2, 3)

# Cosine similarity
int_cosine(mushroom.int, 
           var_name1 = c("Pileus.Cap.Width"), 
           var_name2 = c("Stipe.Length", "Stipe.Thickness"))

# Overlap coefficient
int_overlap_coefficient(mushroom.int, 2, 3:4)

Statistics for Interval Data

Description

Functions to compute the mean, variance, covariance, and correlation of interval-valued data.

Usage

int_mean(x, var_name, method = "CM", ...)

int_var(x, var_name, method = "CM", ...)

int_cov(x, var_name1, var_name2, method = "CM", ...)

int_cor(x, var_name1, var_name2, method = "CM", ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

method

methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT.

...

additional parameters

var_name1

the variable name or the column location (multiple variables are allowed).

var_name2

the variable name or the column location (multiple variables are allowed).

Details

...

Value

A numeric value: the mean, variance, covariance, or correlation.

Author(s)

Han-Ming Wu

See Also

int_mean int_var int_cov int_cor

Examples

data(mushroom.int)
int_mean(mushroom.int, var_name = "Pileus.Cap.Width")
int_mean(mushroom.int, var_name = 2:3)

var_name <- c("Stipe.Length", "Stipe.Thickness")
method <- c("CM", "FV", "EJD")
int_mean(mushroom.int, var_name, method)
int_var(mushroom.int, var_name, method)

var_name1 <- "Pileus.Cap.Width"
var_name2 <- c("Stipe.Length", "Stipe.Thickness")
method <- c("CM", "VM", "EJD", "GQ", "SPT")
int_cov(mushroom.int, var_name1, var_name2, method)
int_cor(mushroom.int, var_name1, var_name2, method)

Uncertainty and Variability Measures for Interval Data

Description

Functions to compute uncertainty and variability measures for interval-valued data.

Usage

int_entropy(x, var_name, method = "CM", base = 2, ...)

int_cv(x, var_name, method = "CM", ...)

int_dispersion(x, var_name, method = "CM", ...)

int_imprecision(x, var_name, ...)

int_granularity(x, var_name, ...)

int_uniformity(x, var_name, ...)

int_information_content(x, var_name, method = "CM", ...)

Arguments

x

interval-valued data with symbolic_tbl class.

var_name

the variable name or the column location (multiple variables are allowed).

method

methods to calculate statistics: CM (default), VM, QM, SE, FV, EJD, GQ, SPT.

base

logarithm base for entropy calculation (default: 2)

...

additional parameters

Details

These functions measure uncertainty and variability:

Value

A numeric matrix or value

Author(s)

Han-Ming Wu

See Also

int_var int_entropy int_cv

Examples

data(mushroom.int)

# Calculate entropy
int_entropy(mushroom.int, var_name = "Pileus.Cap.Width")

# Coefficient of variation
int_cv(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"), method = c("CM", "EJD"))

# Measure imprecision
int_imprecision(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"))

# Check data granularity
int_granularity(mushroom.int, var_name = 2:4)

Internal Utility Functions for Interval Data

Description

Internal functions for interval data transformation. These are used by the exported interval statistics functions (int_mean, int_var, int_cov, int_cor) and are not intended to be called directly.

Details

Internal Utility Functions for Interval Data


Lack of Information Questionnaire Interval Dataset

Description

Interval-valued dataset from a lack-of-information questionnaire. Contains biographical data and responses to 5 items measuring perception of lack of information, collected via an interval-valued Likert scale.

Usage

data(lackinfo.int)

Format

A data frame with 50 observations and 8 variables:

Details

An educational innovation project was carried out for improving teaching-learning processes at the University of Oviedo (Spain) for the 2020/2021 academic year. A total of 50 students answered an online questionnaire about biographical data (sex and age) and their perception of lack of information by selecting the interval that best represents their level of agreement on a scale bounded between 1 (strongly disagree) and 7 (strongly agree).

The 5 items measuring perception of lack of information are:

Source

https://CRAN.R-project.org/package=IntervalQuestionStat

Examples

data(lackinfo.int)

Loans by Purpose Interval Dataset

Description

Interval-valued data for loan characteristics aggregated by their purpose. Original microdata contains 887,383 loan records from Kaggle.

Usage

data(loans_by_purpose.int)

Format

A data frame with 14 observations and 4 interval-valued variables:

Source

https://CRAN.R-project.org/package=MAINT.Data

Examples

data(loans_by_purpose.int)

Lung Cancer Treatments by State Histogram-Valued Dataset

Description

Histogram-valued distribution of lung cancer treatment counts by US state.

Usage

data(lung_cancer.hist)

Format

An object of class data.frame with 2 rows and 2 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.20.

Examples

data(lung_cancer.hist)

Mushroom Species Dataset (Original Format)

Description

Interval-valued data for 23 mushroom species of the genus Agaricus with 3 morphological measurements from the Fungi of California Species.

Usage

data(mushroom)

Format

A data frame with 23 observations and 5 variables:

Details

Classic SDA dataset used for descriptive statistics, histogram construction, and clustering of interval-valued data.

Source

Billard, L. and Diday, E. (2006), Table 3.2.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley, Chichester. Table 3.2.

Examples

data(mushroom)

Mushroom Species Interval Dataset

Description

Interval-valued version of the mushroom dataset. See mushroom.

Usage

data(mushroom.int)

Format

A data frame with 23 observations and interval-valued variables.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 3.2.

Examples

data(mushroom.int)

Mushroom Species Fuzzy/Symbolic Dataset

Description

Extended mushroom data with fuzzy stipe thickness (Small/Average/Large), numerical stipe length, interval cap size, and categorical cap colour for two Amanita species.

Usage

data(mushroom_fuzzy)

Format

An object of class data.frame with 4 rows and 9 columns.

References

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Tables 1.14-1.16.

Examples

data(mushroom_fuzzy)

New York City Flights Interval Dataset

Description

Interval-valued dataset with 142 units and four interval-valued variables from the nycflights13 package, aggregated by month and carrier.

Usage

data(nycflights.int)

Format

A list containing FlightsDF, FlightsUnits, and FlightsIdt.

Source

https://CRAN.R-project.org/package=MAINT.Data

References

Duarte Silva, A.P., Brito, P., Filzmoser, P. and Dias, J.G. (2021). MAINT.Data: Modelling and Analysing Interval Data in R. R Journal, 13(2).

Examples

data(nycflights.int)

Occupation Salaries Dataset

Description

Salary ranges for different occupations.

Usage

data(occupations)

Format

An object of class data.frame with 9 rows and 11 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(occupations)

Occupation Salaries Modal-Valued Dataset

Description

Modal-valued version of the occupation salaries dataset.

Usage

data(occupations2)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 9 rows and 4 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(occupations2)

Ohio River Basin 30-Year Trimmed Mean Daily Temperatures Interval Dataset

Description

Interval-valued dataset of 30-year trimmed mean daily temperatures for the Ohio river basin. Intervals are defined by the mean daily maximum and minimum temperatures from January 1, 1988 to December 31, 2018.

Usage

data(ohtemp.int)

Format

A data frame with 161 rows and 7 variables:

Source

https://CRAN.R-project.org/package=intkrige

Examples

data(ohtemp.int)

Oils and Fats Interval Dataset

Description

Classic benchmark interval-valued data for 8 oils and fats described by 4 physico-chemical properties. Originally from Ichino (1988).

Usage

data(oils.int)

Format

A data frame with 8 observations and 4 interval-valued variables:

Details

The 8 samples are: Linseed oil, Perilla oil, Cottonseed oil, Sesame oil, Camellia oil, Olive oil, Beef tallow, Hog fat. The expected 3-cluster structure is: {Beef tallow, Hog fat}, {Cottonseed, Sesame, Camellia, Olive}, and {Linseed, Perilla}. Widely used for comparing clustering methods and distance measures in symbolic data analysis.

References

Ichino, M. (1988). General metrics for mixed features. Proc. IEEE Conf. Systems, Man, and Cybernetics, pp. 494-497.

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 13.7, p.253.

Examples

data(oils.int)

Profession Work Salary Time Interval Dataset

Description

Interval-valued data for professional categories by salary and working time.

Usage

data(profession.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 15 rows and 4 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(profession.int)

Set Variable Format

Description

This function changes the format of the set variables in the data to conform to the RSDA format.

Usage

set_variable_format(data, location, var)

Arguments

data

A conventional data.

location

The location of the set variable in the data.

var

The name of the set variable in the data.

Value

Return a dataframe in which a set variable is converted to one-hot encoding.

Examples

data("mushroom")
mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species")

French Soccer Championship Bivariate Interval Dataset

Description

Interval-valued data for 20 teams from the French premier soccer championship. Contains ranges of Weight (response), Height and Age (explanatory variables).

Usage

data(soccer_bivar.int)

Format

A data frame with 20 rows and 3 interval-valued variables:

Source

https://CRAN.R-project.org/package=iRegression

References

Lima Neto, E. A., Cordeiro, G. and De Carvalho, F.A.T. (2011). Bivariate symbolic regression models for interval-valued variables. Journal of Statistical Computation and Simulation, 81, 1727-1744.

Examples

data(soccer_bivar.int)

Pickup League Teams Interval Dataset

Description

Interval-valued data for 5 teams in a local pickup league, classified by season performance. Each team is described by ranges of player age, weight, and speed.

Usage

data(teams.int)

Format

A data frame with 5 observations and 4 variables:

Details

The symbolic results are more informative than classical midpoint analyses: the Very Good team has homogeneous players, whereas the Poor team has players varying widely in age, weight, and speed. Used for symbolic principal component analysis.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.24, p.63.

Examples

data(teams.int)

World Cities Monthly Temperature Interval Dataset

Description

Interval-valued monthly temperatures for major cities worldwide. Benchmark dataset for comparing distance measures (Hausdorff, L2, Wasserstein) in dynamic clustering algorithms.

Usage

data(temperature_city.int)

Format

A data frame with city rows and 12 interval-valued monthly temperature variables (Jan-Dec), plus an expert class assignment.

Details

Expert partition into 4 classes: Class 1 (tropical/warm), Class 2 (temperate European and Asian), Class 3 (Mauritius), Class 4 (Tehran).

References

Verde, R. and Irpino, A. (2008). A new interval data distance based on the Wasserstein metric. Proc. COMPSTAT 2008, pp. 705-712.

Examples

data(temperature_city.int)

Tennis Court Types Interval Dataset

Description

Interval-valued data for tennis players aggregated by court type (Hard, Grass, Indoor, Clay) with weight, height, and racket tension.

Usage

data(tennis.int)

Format

A data frame with 4 observations and 4 variables:

Details

Clustering on weight and height separates grass courts from the rest (decision rule: Weight <= 74.75 kg). When all three variables are used, clustering separates by racket tension instead.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 2.25, p.64.

Examples

data(tennis.int)

Town Services Concatenated Mixed Symbolic Dataset

Description

Symbolic data for 3 towns (Paris, Lyon, Toulouse) combining school and hospital databases. Contains interval-valued, multi-valued, and modal-valued variables.

Usage

data(town_services.mix)

Format

A data frame with 3 observations and 5 symbolic variables:

References

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.21, p.19.

Examples

data(town_services.mix)

Trivial and Non-Trivial Intervals Example Dataset

Description

Simple 5x3 example illustrating different interval types: full intervals (hyperrectangles), degenerate intervals (lines), and trivial intervals (points). Used for vertices PCA demonstration.

Usage

data(trivial_intervals.int)

Format

A data frame with 5 observations and 3 interval-valued variables.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley. Table 5.1, p.146.

Examples

data(trivial_intervals.int)

Veterinary Interval Dataset

Description

Interval-valued veterinary dataset with animal measurements.

Usage

data(veterinary.int)

Format

An object of class symbolic_tbl (inherits from tbl_df, tbl, data.frame) with 10 rows and 3 columns.

References

Billard, L. and Diday, E. (2006). Symbolic Data Analysis. Wiley.

Examples

data(veterinary.int)

World Cup Soccer Teams Interval Dataset

Description

Interval-valued data for soccer teams grouped by World Cup qualification status. Includes age, weight, height ranges and covariance.

Usage

data(world_cup.int)

Format

A data frame with 2 observations and 5 variables.

References

Diday, E. and Noirhomme-Fraiture, M. (Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. Table 1.9, p.13.

Examples

data(world_cup.int)

Write Symbolic Data Table

Description

This function write (save) a symbolic data table from a CSV data file.

Usage

write_csv_table(data, file, output)

Arguments

data

The conventional data.

file

The name of the CSV file.

output

This is an experimental argument, with default TRUE, and can be ignored by most users.

Value

Write in CSV file the symbolic data table.

Examples

data(mushroom)
mushroom.set <- set_variable_format(data = mushroom, location = 8, var = "Species")
mushroom.tmp <- RSDA_format(data = mushroom.set, sym_type1 = c("I", "S"),
                            location = c(25, 31), sym_type2 = c("S", "I", "I"),
                            var = c("Species", "Stipe.Length_min", "Stipe.Thickness_min"))
mushroom.clean <- clean_colnames(data = mushroom.tmp)
# We can save the file in CSV to RSDA format as follows:
write_csv_table(data = mushroom.clean, file = "mushroom_interval.csv", output = FALSE)