The task
You have values in a continuous variable that you want to classify based on more than two criteria.
The Chat Edition of the tl;dr
In this article, the author discusses the process of classifying values in a continuous variable based on multiple criteria. The author demonstrates the use of R's `dplyr` package and the `if_else()` and `case_when()` functions to achieve this classification. Additionally, the author introduces a custom function, `bin_mass()`, to improve the process and highlights the advantages of using a functional language like R. Finally, the author briefly touches upon how the same task can be performed in Julia, another programming language, and advises caution when categorizing continuous variables.
The example
Stats and R has a recent post on {dplyr} that got me started on this post. They use the penguin
dataset from {palmerpenguines} to walk through the Way of Tidy approach to data processing.
library(dplyr)
library(palmerpenguins)
dat <- penguins
dat |>
mutate(
body_mass_cat = if_else(body_mass_g >= 4000, # condition
"High", # output if condition is true
"Low" # output if condition is false
)
)
This works well enough. Note the use of if_else
, the {dplyr} edition of {base} ifelse
. This handles NA
values implicitly, one of the reasons I don't use {dplyr}. Everything should be explicit.
The three-body problem
It gets ugly moving beyond two categories due to the need to begin nesting if_else()
. {dplyr} again silently takes care of NA
cases.
# nested if else
dat |>
mutate(
body_mass_cat = if_else(body_mass_g < 3500, # first condition
"Low", # output if first condition is true
if_else(body_mass_g > 4750, # second condition when first condition is false
"High", # output when second condition is true
"Medium" # output when second condition is false
)
)
)
The reason for the ugliness is ⋯⋯ punctuation, specifically balanced braces, the same barrier to LISP. It's easy to flub. Besides, residual algebraic trauma syndrome from school days doesn't help.
case_when() to the rescue
dat |>
mutate(
body_mass_cat = case_when(
body_mass_g < 3500 ~ "Low",
body_mass_g >= 3500 & body_mass_g <= 4750 ~ "Medium",
body_mass_g > 4750 ~ "High"
)
)
This is cleaner and easier on the eye. However, case_when()
silently handles the NA
case and this code uses two magic numbers. There are indended to represent the breakpoints for the 25% and 75% percentiles of this variable in this dataset and would have to be changed, manually, anytime the snippet is used for other data.
Use the function, Luke
R
is a crappy procedural/imperative language and grief awaits anyone using it who expects otherwise. R
is also a nifty functional language. Not great, but its heart is in the right place.
Apply this function object to that data object to get this new object using zero or more of these arguments.
That is what virtually every help(function)
page directs.
So, let's follow the Way of (y) = f(x)
library(palmerpenguins)
dat <- as.data.frame(penguins)
# categorize according to the interquartile range
bin_mass <- function(x,y){
x$body_mass_cat = NA
lo = fivenum(y)[2]
hi = fivenum(y)[4]
the_na = which(is.na(y))
the_lo = which(y < lo)
the_hi = which(y > hi)
mids = setdiff(1:dim(x)[1],c(the_na,the_lo,the_hi))
x$body_mass_cat[the_lo] = "Low"
x$body_mass_cat[the_hi] = "High"
x$body_mass_cat[mids] = "Medium"
return(x$body_mass_cat)
}
dat$body_mass_cat <- bin_mass(dat,dat$body_mass_g)
Why the HELL should I do THAT?
Because it's like eating your veggies; it's good for you.
The teardown
penguins
arrives as a tibble
for no good reason. It does need to be a data frame
because of the mix of characters and numeric variables. If the data were all numeric, I would have converted it to a matrix. Linear algebra makes short work of problems encased in a matrix.
big_mass()
takes two arguments, the name of a data frame and the name of the variable to be classified. When working for my own use, I'd make the y
argument an integer corresponding to its column index. The data frame has dim()
of 344 8
, not that hard to deal with. Magical Number 7±2.
x
, the data frame does two things
Avoids having to extract the variable as a vector
Assures that the return value will have as many rows as the data frame it is to be added to
The first step is to create a new variable to hold the results. It is initially set of NA
as placeholders and a cheat to leave them as defaults. This provides one check against the source variable because the counts of NA should be identical.
fivenum()
is like quantile()
sort of. It's a bit of geekery. See help(boxplot.stats)
. The idea is to rank the values of the variable into buckets according to frequency. Not a big deal.
The three which()
lines create vectors of row indexes. One is for NA
values, one for those below the first quartile (25th percentile) and one for those equal to or greater the third quartile. To create the index for the values falling between, we just create a sequence of the same length as the number of rows and then remove the values of the three which()
return values with a Boolean.
So much for the windup, here's the pitch. We now know the specific rows to change from NA
to one of the three categories and we can proceed to do that directly. And, if there's five categories or six or sixty the pattern is the same.
What this would look like in Julia?
Julia is a best of both worlds language.
Write and test interactively and then let run wild in compiled form to achieve execution speeds of up to 1 petabyte/second (2^50, which is 1,000 times bigger than a terrabyte, which is 1,000 times bigger than a gigabyte, which is 1,000 times bigger than a megabyte, which is 1,000 times bigger than a kilobyte.
Use dynamic (*ducktyped*) variables for declare them
Choice of mixing procedural and functional approaches
Availability to integrate foreign languages, including R
using RCall
using DataFrames
using Statistics
penguins = R"library(palmerpenguins);as.data.frame(data('penguins',package = 'palmerpenguins'))"
d = rcopy(DataFrame, R"penguins")
v = d.body_mass_g
# everything to this point was in aid of bringing in the data from R
# it would be more direct bringing in from CSV
q = quantile(skipmissing(v),[0.25,0.75])
new_v= [ismissing(x) ? missing : (x < q[1] ? "Low" : (x > q[2] ? "High" : "Middle")) for x in v]
Cognitive take
Data analysts do, well, analysis. Analysis is breaking stuff down into its smallest components. It's hard. It's tedious. It's error-prone. It's unnatural. We would much rather use our intuition, which has awesome powers of solving analog problem. Not so much digital.
Use categorization with care
Sometimes, removing detail helps. Sometime it doesn't. After all, detail can be noise, but it can also be signal.