{NaileR}
is a small R package designed initially for
interpreting continuous or categorical latent variables: typically,
dimensions from an exploratory multivariate method, or a class variable
from an unsupervised clustering algorithm. As who can do more can do
less, {NaileR}
can also interpret explicit measures
regarding to the other variables of the data set. The rationale behind
{NaileR}
is to link {FactoMineR}
on the one
hand and {ollamar}
on the other hand:
{FactoMineR}
integrates dimension reduction methods, such
as PCA, CA, MCA, MFA,
hierarchical clustering; {ollamar}
enables R to
query an open-source large language model (LLM) installed locally thanks
to ollama (https://ollama.com). To do so, {NaileR}
recodes relevant numerical indicators from {FactoMineR}
into qualitative indicators that are then used to generate prompts
directly usable by the LLM. This recoding is carried out through the
prism of the statistical individuals.
In the following section we present an example of what
{NaileR}
can do in the case of an explicit and therefore
measured categorical variable. The dataset we will use is the famous
Fisher’s Iris dataset. Let’s say we are interested in the variable
species and we want to know how this categorical variable can
be described by the other variables in the dataset. As mentioned above,
{NaileR}
is partly based on {FactoMineR}
functions, and in particular on two very important functions of that
package, namely catdes() and condes(). The first
function, catdes(), is designed to automatically describe a
categorical variable, while the second is designed to describe a
continuous variable.
The {NaileR}
package has two similar functions, both of
which are extended using LLM: nail_catdes() and
nail_condes(). For example, the parameters of the
nail_catdes() function are partly the same as those of the
catdes() function. You must specify the name of the dataset and
the number of the column associated with the categorical/qualitative
variable to be interpreted.
To get interesting results from {NaileR}
, it is
essential to fill in two parameters: the introduction and the
request. These two parameters are important because they allow
us to build a prompt that is operational and adapted to the data set and
the variables of interest.
library(NaileR)
data(iris)
intro_iris <- "A study measured various parts of iris flowers
from 3 different species: setosa, versicolor and virginica.
I will give you the results from this study.
You will have to identify what sets these flowers apart."
intro_iris <- gsub('\n', ' ', intro_iris) |>
stringr::str_squish()
req_iris <- "Please explain what makes each species distinct.
Also, tell me which species has the biggest flowers,
and which species has the smallest. Is there any biological reason for this?"
req_iris <- gsub('\n', ' ', req_iris) |>
stringr::str_squish()
req_iris <- gsub('\n', ' ', req_iris) |>
stringr::str_squish()
res_iris <- nail_catdes(iris,
num.var = 5,
model = "llama3.1",
introduction = intro_iris,
request = req_iris,
generate = TRUE)
res_iris <- readRDS(system.file("extdata", "res_iris.rds", package = "NaileR"))
formatted_text <- strwrap(res_iris$response, width = 80)
print(formatted_text)
#> [1] "A classic problem in classification!"
#> [2] ""
#> [3] "**What makes each species distinct?**"
#> [4] ""
#> [5] "Based on the data, we can identify the following differences between the three"
#> [6] "species:"
#> [7] ""
#> [8] "1. **Setosa**: Characterized by high sepal width and low sepal length, petal"
#> [9] "length. 2. **Versicolor**: Marked by high petal length and low sepal width. 3."
#> [10] "**Virginica**: Distinguished by high values across multiple variables: petal"
#> [11] "width, petal length, and sepal length."
#> [12] ""
#> [13] "**Which species has the biggest flowers?**"
#> [14] ""
#> [15] "From the data, it appears that **virginica** has the largest flowers, as it has"
#> [16] "high values for both petal width and length, as well as sepal length. This"
#> [17] "suggests that virginica flowers have wider and longer petals, and possibly"
#> [18] "longer sepals, compared to the other two species."
#> [19] ""
#> [20] "**Which species has the smallest flowers?**"
#> [21] ""
#> [22] "Based on the data, **setosa** seems to have the smallest flowers, with low"
#> [23] "values for sepal length, petal length, and sepal width. This implies that"
#> [24] "setosa flowers have narrower petals and shorter sepals compared to the other"
#> [25] "two species."
#> [26] ""
#> [27] "**Biological reason:**"
#> [28] ""
#> [29] "In biological terms, these differences might be related to the evolutionary"
#> [30] "pressures and adaptations of each species. For example:"
#> [31] ""
#> [32] "* Larger flowers (virginica) may indicate a stronger need for pollinators to"
#> [33] "attract more visitors to increase reproductive success. * Smaller flowers"
#> [34] "(setosa) could be an adaptation to conserve resources or reduce energy"
#> [35] "expenditure."
#> [36] ""
#> [37] "Keep in mind that this is a simplified analysis, and there might be other"
#> [38] "factors influencing these differences. However, based on the provided data, we"
#> [39] "can make some educated inferences about the distinct characteristics of each"
#> [40] "iris species."
library(NaileR)
library(FactoMineR)
data(waste)
waste <- waste[-14] # no variability on this question
set.seed(1)
res_mca_waste <- MCA(waste, quali.sup = c(1,2,50:76),
ncp = 35, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_waste, choix = "ind",
invisible = c("var", "quali.sup"), label = "none")
don_clust_waste <- res_hcpc_waste$data.clust
res_mca_waste <- MCA(don_clust_waste, quali.sup = c(1,2,50:77),
ncp = 35, level.ventil = 0.05, graph = FALSE)
plot.MCA(res_mca_waste, choix = "ind",
invisible = c("var", "quali.sup"), label = "none", habillage = 77)
intro_waste <- 'These data were collected
after a survey on food waste,
with participants describing their habits.'
intro_waste <- gsub('\n', ' ', intro_waste) |>
stringr::str_squish()
req_waste <- 'Please summarize the characteristics of each group.
Then, give each group a new name, based on your conclusions.
Finally, give each group a grade between 0 and 10,
based on how wasteful they are with food:
0 being "not at all", 10 being "absolutely".'
req_waste <- gsub('\n', ' ', req_waste) |>
stringr::str_squish()
res_waste <- nail_catdes(don_clust_waste,
num.var = ncol(don_clust_waste),
introduction = intro_waste,
request = req_waste,
model = "llama3.1",
drop.negative = TRUE,
generate = TRUE)
res_waste <- readRDS(system.file("extdata", "res_waste.rds", package = "NaileR"))
formatted_text <- strwrap(res_waste$response, width = 80)
print(formatted_text)
#> [1] "**Summary of Group Characteristics**"
#> [2] ""
#> [3] "### Group 1:"
#> [4] ""
#> [5] "* Never throw away fruits, vegetables, or dairy products * Buy discounted"
#> [6] "products with short shelf life * Do not often throw away any type of food"
#> [7] "product * Have a careful approach to food waste, only throwing away damaged or"
#> [8] "expired items"
#> [9] ""
#> [10] "### Group 2:"
#> [11] ""
#> [12] "* Never throw away dry goods * Rarely throw away fruits and vegetables (but"
#> [13] "have thrown them away for being rotten) * Use \"best before\" dates as a guide"
#> [14] "for discarding products * Do not often throw away dairy products or meat/fish *"
#> [15] "Have a moderate approach to food waste, throwing away items that are damaged or"
#> [16] "no longer desirable"
#> [17] ""
#> [18] "### Group 3:"
#> [19] ""
#> [20] "* Throw away dry goods and other types of food due to various reasons"
#> [21] "(expiration date passed, damaged, loss of taste quality) * Often throw away"
#> [22] "fruits and vegetables for being rotten or not meeting expectations * Have"
#> [23] "thrown away dairy products and meat/fish for expiration date passing or being"
#> [24] "damaged * Have a more lax approach to food waste, throwing away items that are"
#> [25] "slightly past their prime or no longer desirable"
#> [26] ""
#> [27] "**New Group Names**"
#> [28] ""
#> [29] "### Group 1: \"Conscious Consumers\""
#> [30] ""
#> [31] "### Group 2: \"Moderate Savers\""
#> [32] ""
#> [33] "### Group 3: \"Impulsive Discarders\""
#> [34] ""
#> [35] "**Wastefulness Grade (0-10)**"
#> [36] ""
#> [37] "* Group 1: **4** - Not wasteful at all, takes a careful approach to food waste."
#> [38] "* Group 2: **5** - Moderately wasteful, throws away some items but still has a"
#> [39] "moderate approach. * Group 3: **8** - Highly wasteful, often discards food due"
#> [40] "to various reasons."
#> [41] ""
#> [42] "Note that these grades are subjective and based on my interpretation of the"
#> [43] "data."