How to describe and interpret a categorical variable automatically?

When the categorical variable is explicit

In the following section we present an example of what {NaileR} can do in the case of an explicit and therefore measured categorical variable. The dataset we will use is the famous Fisher’s Iris dataset. Let’s say we are interested in the variable species and we want to know how this categorical variable can be described by the other variables in the dataset. As mentioned above, {NaileR} is partly based on {FactoMineR} functions, and in particular on two very important functions of that package, namely catdes() and condes(). The first function, catdes(), is designed to automatically describe a categorical variable, while the second is designed to describe a continuous variable.

The {NaileR} package has two similar functions, both of which are extended using LLM: nail_catdes() and nail_condes(). For example, the parameters of the nail_catdes() function are partly the same as those of the catdes() function. You must specify the name of the dataset and the number of the column associated with the categorical/qualitative variable to be interpreted.

To get interesting results from {NaileR}, it is essential to fill in two parameters: the introduction and the request. These two parameters are important because they allow us to build a prompt that is operational and adapted to the data set and the variables of interest.

library(NaileR)
data(iris)

intro_iris <- "A study measured various parts of iris flowers
from 3 different species: setosa, versicolor and virginica.
I will give you the results from this study.
You will have to identify what sets these flowers apart."
intro_iris <- gsub('\n', ' ', intro_iris) |>
  stringr::str_squish()

req_iris <- "Please explain what makes each species distinct.
Also, tell me which species has the biggest flowers,
and which species has the smallest. Is there any biological reason for this?"
req_iris <- gsub('\n', ' ', req_iris) |>
  stringr::str_squish()
req_iris <- gsub('\n', ' ', req_iris) |>
  stringr::str_squish()

res_iris <- nail_catdes(iris,
                        num.var = 5,
                        model = "llama3.1",
                        introduction = intro_iris,
                        request = req_iris,
                        generate = TRUE)

res_iris <- readRDS(system.file("extdata", "res_iris.rds", package = "NaileR"))
formatted_text <- strwrap(res_iris$response, width = 80)
print(formatted_text)
#>  [1] "A classic problem in classification!"                                           
#>  [2] ""                                                                               
#>  [3] "**What makes each species distinct?**"                                          
#>  [4] ""                                                                               
#>  [5] "Based on the data, we can identify the following differences between the three" 
#>  [6] "species:"                                                                       
#>  [7] ""                                                                               
#>  [8] "1. **Setosa**: Characterized by high sepal width and low sepal length, petal"   
#>  [9] "length. 2. **Versicolor**: Marked by high petal length and low sepal width. 3." 
#> [10] "**Virginica**: Distinguished by high values across multiple variables: petal"   
#> [11] "width, petal length, and sepal length."                                         
#> [12] ""                                                                               
#> [13] "**Which species has the biggest flowers?**"                                     
#> [14] ""                                                                               
#> [15] "From the data, it appears that **virginica** has the largest flowers, as it has"
#> [16] "high values for both petal width and length, as well as sepal length. This"     
#> [17] "suggests that virginica flowers have wider and longer petals, and possibly"     
#> [18] "longer sepals, compared to the other two species."                              
#> [19] ""                                                                               
#> [20] "**Which species has the smallest flowers?**"                                    
#> [21] ""                                                                               
#> [22] "Based on the data, **setosa** seems to have the smallest flowers, with low"     
#> [23] "values for sepal length, petal length, and sepal width. This implies that"      
#> [24] "setosa flowers have narrower petals and shorter sepals compared to the other"   
#> [25] "two species."                                                                   
#> [26] ""                                                                               
#> [27] "**Biological reason:**"                                                         
#> [28] ""                                                                               
#> [29] "In biological terms, these differences might be related to the evolutionary"    
#> [30] "pressures and adaptations of each species. For example:"                        
#> [31] ""                                                                               
#> [32] "* Larger flowers (virginica) may indicate a stronger need for pollinators to"   
#> [33] "attract more visitors to increase reproductive success. * Smaller flowers"      
#> [34] "(setosa) could be an adaptation to conserve resources or reduce energy"         
#> [35] "expenditure."                                                                   
#> [36] ""                                                                               
#> [37] "Keep in mind that this is a simplified analysis, and there might be other"      
#> [38] "factors influencing these differences. However, based on the provided data, we" 
#> [39] "can make some educated inferences about the distinct characteristics of each"   
#> [40] "iris species."

When the categorical variable is latent

library(NaileR)
library(FactoMineR)
data(waste)
waste <- waste[-14]    # no variability on this question

set.seed(1)
res_mca_waste <- MCA(waste, quali.sup = c(1,2,50:76),
                     ncp = 35, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_waste, choix = "ind",
         invisible = c("var", "quali.sup"), label = "none")

res_hcpc_waste <- HCPC(res_mca_waste, nb.clust = 3, graph = FALSE)

don_clust_waste <- res_hcpc_waste$data.clust
res_mca_waste <- MCA(don_clust_waste, quali.sup = c(1,2,50:77),
                     ncp = 35, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_waste, choix = "ind",
         invisible = c("var", "quali.sup"), label = "none", habillage = 77)

intro_waste <- 'These data were collected
after a survey on food waste,
with participants describing their habits.'
intro_waste <- gsub('\n', ' ', intro_waste) |>
  stringr::str_squish()

req_waste <- 'Please summarize the characteristics of each group.
Then, give each group a new name, based on your conclusions.
Finally, give each group a grade between 0 and 10,
based on how wasteful they are with food:
0 being "not at all", 10 being "absolutely".'
req_waste <- gsub('\n', ' ', req_waste) |>
  stringr::str_squish()

res_waste <- nail_catdes(don_clust_waste,
                         num.var = ncol(don_clust_waste),
                         introduction = intro_waste,
                         request = req_waste,
                         model = "llama3.1",
                         drop.negative = TRUE,
                         generate = TRUE)

res_waste <- readRDS(system.file("extdata", "res_waste.rds", package = "NaileR"))
formatted_text <- strwrap(res_waste$response, width = 80)
print(formatted_text)
#>  [1] "**Summary of Group Characteristics**"                                           
#>  [2] ""                                                                               
#>  [3] "### Group 1:"                                                                   
#>  [4] ""                                                                               
#>  [5] "* Never throw away fruits, vegetables, or dairy products * Buy discounted"      
#>  [6] "products with short shelf life * Do not often throw away any type of food"      
#>  [7] "product * Have a careful approach to food waste, only throwing away damaged or" 
#>  [8] "expired items"                                                                  
#>  [9] ""                                                                               
#> [10] "### Group 2:"                                                                   
#> [11] ""                                                                               
#> [12] "* Never throw away dry goods * Rarely throw away fruits and vegetables (but"    
#> [13] "have thrown them away for being rotten) * Use \"best before\" dates as a guide" 
#> [14] "for discarding products * Do not often throw away dairy products or meat/fish *"
#> [15] "Have a moderate approach to food waste, throwing away items that are damaged or"
#> [16] "no longer desirable"                                                            
#> [17] ""                                                                               
#> [18] "### Group 3:"                                                                   
#> [19] ""                                                                               
#> [20] "* Throw away dry goods and other types of food due to various reasons"          
#> [21] "(expiration date passed, damaged, loss of taste quality) * Often throw away"    
#> [22] "fruits and vegetables for being rotten or not meeting expectations * Have"      
#> [23] "thrown away dairy products and meat/fish for expiration date passing or being"  
#> [24] "damaged * Have a more lax approach to food waste, throwing away items that are" 
#> [25] "slightly past their prime or no longer desirable"                               
#> [26] ""                                                                               
#> [27] "**New Group Names**"                                                            
#> [28] ""                                                                               
#> [29] "### Group 1: \"Conscious Consumers\""                                           
#> [30] ""                                                                               
#> [31] "### Group 2: \"Moderate Savers\""                                               
#> [32] ""                                                                               
#> [33] "### Group 3: \"Impulsive Discarders\""                                          
#> [34] ""                                                                               
#> [35] "**Wastefulness Grade (0-10)**"                                                  
#> [36] ""                                                                               
#> [37] "* Group 1: **4** - Not wasteful at all, takes a careful approach to food waste."
#> [38] "* Group 2: **5** - Moderately wasteful, throws away some items but still has a" 
#> [39] "moderate approach. * Group 3: **8** - Highly wasteful, often discards food due" 
#> [40] "to various reasons."                                                            
#> [41] ""                                                                               
#> [42] "Note that these grades are subjective and based on my interpretation of the"    
#> [43] "data."

An introduction to the NaileR package

Sébastien Lê

2025-07-13

Introduction

How to describe and interpret a categorical variable automatically?

When the categorical variable is explicit

When the categorical variable is latent