Factors in R
The factor is a foundational data type in R. Factors are generally used to represent categorical variables, which may be intrinsically unordered (nominal) or ordered (ordinal). While the underlying data is often character, factors can be built on numerics as well. Factor variables are stored as integers pointing to unique values of underlying attribute labels. Thus, in a dataframe of a million records with 50 unique states, the character labels of a factor attribute for state are stored just once, accessed by a vector of integer pointers.
[Related Article: Deep Learning in R with Keras]
Factors are both critical and convenient to ordering/modeling in R. Yet, there are two sides to the question of whether to routinely load strings as factors or as character vars when building dataframes/data.tables in R. On the one hand, there should be storage efficiency given a limited number of unique labels vs data.table size. As well, R models and graphical functions often require factors to represent non-numeric data. At the same time, it may be wasteful to store high cardinality attributes as factors. And there can be update anomalies associated with factors that can make them tricky to maintain. So, there’s no absolute consensus, though more R analysts are on the stringsAsFactors side.
I’ve been working with a large data set recently that provides an opportunity to examine how differences in storage between factor and character representations impact data.table size and load performance in R. The data consist of annual Medicare provider utilization from 2012–2017 (and counting). At this point, there are almost 57M records of 30 attributes in the combined data; about half the attributes are strings that can be stored as either character or unordered factor.
After adding the latest available raw data file (2017) to my notebook, I set out to examine differences in memory requirements and load performance between the two representations of strings. The remainder of this blog details those tests, as I create two Medicare data.tables: one with strings stored as character vars, the other with strings as factors. I then compare load performance and data size between the two, as well as timings/sizings writing rds, feather, and fst portable files. Finally, I contrast performance reading back the just-written feather files. In my tests one side wins in a rout. Find the results below.
The technology used is Windows 10 with JupyterLab 0.35.4, and R 3.6.0, along with R packages data.table 1.12.2, tidyverse 1.2.1, feather 0.3.3, and fst 0.9.0.
Set options and load pertinent R libraries.
In [1]:
options(warn=-1)
options(scipen=20)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)
options(stringsAsFactors=TRUE)suppressMessages(library(tidyverse))
suppressMessages(library(data.table))
suppressMessages(library(pryr))
suppressMessages(library(fst))
suppressMessages(library(feather))
cat("\n\n")
Define two support functions.
In [2]:
blanks <- function(howmany)
{
bl = rep("\n",howmany)
cat(bl)
}blanks(2)
In [3]:
rmeta <- function(dt,data=TRUE,dict=FALSE)
{
print(dim(dt))
print(object_size(dt))
blanks(1)
if (data)
{
print(head(dt))
blanks(1)
print(tail(dt))
}
if (dict==TRUE) print(str(dt))
cat("\n\n")
}blanks(2)
Migrate to the working directory.
In [4]:
mdir <- "c:/bigdata/raw/medicare"
setwd(mdir)
GIG <- 1024**3blanks(2)
Define the basic loader function to be invoked for each input csv file. saf=TRUE loads stringsAsFactors. The first line of the tab-delimited files contains attribute names, some of which change from year to year. The second lines of each file are copyright notices to be discarded; “good” data for every file begins in line 3. The “year” attribute is constructed from the input file name. load_dta is applied to each input file in a functional loop.
In [5]:
load_dta <- function(fname,howmany=Inf,saf=TRUE)
{
header <- tolower(fread(fname,nrows=1,header=FALSE))
slug <- fread(fname,header=FALSE,skip=2,nrows=howmany,stringsAsFactors=saf)
names(slug) <- header
start <- str_locate(fname,"_CY")[2]+1
slug$year <- as.integer(substring(fname,start,start+3))
return(slug)
}
Load the data with strings as character vars. The 56.8 record, 30 attribute data.table comprises 12.8G.
In [6]:
start <- proc.time()medicares <- rbindlist(lapply(list.files(path = ".", pattern="*.txt"), load_dta,saf=FALSE),
use.names=TRUE, fill=TRUE)
rmeta(medicares,data=FALSE,dict=TRUE)end <- proc.time()
print (end-start)blanks(2)[1] 56817685 30
12.8 GBClasses 'data.table' and 'data.frame': 56817685 obs. of 30 variables:
$ npi : int 1003000126 1003000126 1003000126 1003000126 1003000126 1003000126 1003000126 1003000134 1003000134 1003000134 ...
$ nppes_provider_last_org_name : chr "ENKESHAFI" "ENKESHAFI" "ENKESHAFI" "ENKESHAFI" ...
$ nppes_provider_first_name : chr "ARDALAN" "ARDALAN" "ARDALAN" "ARDALAN" ...
$ nppes_provider_mi : chr "" "" "" "" ...
$ nppes_credentials : chr "M.D." "M.D." "M.D." "M.D." ...
$ nppes_provider_gender : chr "M" "M" "M" "M" ...
$ nppes_entity_code : chr "I" "I" "I" "I" ...
$ nppes_provider_street1 : chr "900 SETON DR" "900 SETON DR" "900 SETON DR" "900 SETON DR" ...
$ nppes_provider_street2 : chr "" "" "" "" ...
$ nppes_provider_city : chr "CUMBERLAND" "CUMBERLAND" "CUMBERLAND" "CUMBERLAND" ...
$ nppes_provider_zip : chr "215021854" "215021854" "215021854" "215021854" ...
$ nppes_provider_state : chr "MD" "MD" "MD" "MD" ...
$ nppes_provider_country : chr "US" "US" "US" "US" ...
$ provider_type : chr "Internal Medicine" "Internal Medicine" "Internal Medicine" "Internal Medicine" ...
$ medicare_participation_indicator: chr "Y" "Y" "Y" "Y" ...
$ place_of_service : chr "F" "F" "F" "F" ...
$ hcpcs_code : chr "99222" "99223" "99231" "99232" ...
$ hcpcs_description : chr "Initial hospital inpatient care, typically 50 minutes per day" "Initial hospital inpatient care, typically 70 minutes per day" "Subsequent hospital inpatient care, typically 15 minutes per day" "Subsequent hospital inpatient care, typically 25 minutes per day" ...
$ hcpcs_drug_indicator : chr "N" "N" "N" "N" ...
$ line_srvc_cnt : num 115 93 111 544 75 95 191 226 6070 13 ...
$ bene_unique_cnt : int 112 88 83 295 55 95 185 207 3624 13 ...
$ bene_day_srvc_cnt : int 115 93 111 544 75 95 191 209 4416 13 ...
$ average_medicare_allowed_amt : num 135.2 198.6 38.8 71 101.7 ...
$ stdev_medicare_allowed_amt : num 0 0 0 0 0 ...
$ average_submitted_chrg_amt : num 199 291 58 105 150 104 153 115 170 39 ...
$ stdev_submitted_chrg_amt : num 0 9.59 0 0 0 ...
$ average_medicare_payment_amt : num 108.1 158.9 30.7 56.7 81.4 ...
$ stdev_medicare_payment_amt : num 0.901 0 2.929 2.431 0 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ average_medicare_standard_amt : num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, ".internal.selfref")=<externalptr>
NULL
user system elapsed
255.94 15.03 221.39
Now load the same data with strings as factors. Note the reduction in data.table size to 8.7G with factor. That’s a big difference! Also note the elapsed task times of 3+ minutes vs 2+ minutes. Clear advantage to factor.
In [7]:
start <- proc.time()medicaref <- rbindlist(lapply(list.files(path = ".", pattern="*.txt"), load_dta),
use.names=TRUE, fill=TRUE)
rmeta(medicaref,data=FALSE,dict=TRUE)end <- proc.time()
print (end-start)blanks(2)[1] 56817685 30
8.74 GBClasses 'data.table' and 'data.frame': 56817685 obs. of 30 variables:
$ npi : int 1003000126 1003000126 1003000126 1003000126 1003000126 1003000126 1003000126 1003000134 1003000134 1003000134 ...
$ nppes_provider_last_org_name : Factor w/ 303991 levels "10 33 AMBULANCE SERVICE LIMITED",..: 60738 60738 60738 60738 60738 60738 60738 37340 37340 37340 ...
$ nppes_provider_first_name : Factor w/ 77227 levels "","(BARBARA)",..: 3265 3265 3265 3265 3265 3265 3265 48840 48840 48840 ...
$ nppes_provider_mi : Factor w/ 34 levels "","'","(","-",..: 1 1 1 1 1 1 1 19 19 19 ...
$ nppes_credentials : Factor w/ 21331 levels "","(D.C.) CHIROPRACTIC",..: 5088 5088 5088 5088 5088 5088 5088 5088 5088 5088 ...
$ nppes_provider_gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ nppes_entity_code : Factor w/ 2 levels "I","O": 1 1 1 1 1 1 1 1 1 1 ...
$ nppes_provider_street1 : Factor w/ 471135 levels "# 1 BURDICK EXPY W",..: 295066 295066 295066 295066 295066 295066 295066 141349 141349 141349 ...
$ nppes_provider_street2 : Factor w/ 96134 levels "","# 0987","# 1",..: 1 1 1 1 1 1 1 23355 23355 23355 ...
$ nppes_provider_city : Factor w/ 14439 levels "/WALDPORT","29 PALMS",..: 2503 2503 2503 2503 2503 2503 2503 3449 3449 3449 ...
$ nppes_provider_zip : Factor w/ 364963 levels "0000","00000",..: 61809 61809 61809 61809 61809 61809 61809 164058 164058 164058 ...
$ nppes_provider_state : Factor w/ 61 levels "AA","AE","AK",..: 26 26 26 26 26 26 26 20 20 20 ...
$ nppes_provider_country : Factor w/ 37 levels "AE","AR","AU",..: 23 23 23 23 23 23 23 23 23 23 ...
$ provider_type : Factor w/ 122 levels "Addiction Medicine",..: 37 37 37 37 37 37 37 61 61 61 ...
$ medicare_participation_indicator: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ place_of_service : Factor w/ 2 levels "F","O": 1 1 1 1 1 1 1 1 1 1 ...
$ hcpcs_code : Factor w/ 7527 levels "00100","00102",..: 5230 5231 5235 5236 5237 5241 5242 4518 4519 4522 ...
$ hcpcs_description : Factor w/ 8098 levels "17-hydroxypregnenolone (hormone) level",..: 2056 2057 4731 4732 4733 1702 1703 3353 3350 3464 ...
$ hcpcs_drug_indicator : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
$ line_srvc_cnt : num 115 93 111 544 75 95 191 226 6070 13 ...
$ bene_unique_cnt : int 112 88 83 295 55 95 185 207 3624 13 ...
$ bene_day_srvc_cnt : int 115 93 111 544 75 95 191 209 4416 13 ...
$ average_medicare_allowed_amt : num 135.2 198.6 38.8 71 101.7 ...
$ stdev_medicare_allowed_amt : num 0 0 0 0 0 ...
$ average_submitted_chrg_amt : num 199 291 58 105 150 104 153 115 170 39 ...
$ stdev_submitted_chrg_amt : num 0 9.59 0 0 0 ...
$ average_medicare_payment_amt : num 108.1 158.9 30.7 56.7 81.4 ...
$ stdev_medicare_payment_amt : num 0.901 0 2.929 2.431 0 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ average_medicare_standard_amt : num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, ".internal.selfref")=<externalptr>
NULL
user system elapsed
169.07 28.75 134.28
Write both the factor and string versions of medicare to portable rds files. factor dramatically “rules” here in size and elapsed time — 19.7G vs 8.1G, and 150 seconds vs 16. Incidentally, with each test, I get similar results juggling the order of the runs.
In [8]:
start <- proc.time()mrds <- "medicares.rds"
saveRDS(medicares,mrds,compress=FALSE)
cat(file.info(mrds)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)19.70455
user system elapsed
129.40 25.68 157.20
In [9]:
start <- proc.time()mrds <- "medicaref.rds"
saveRDS(medicaref,mrds,compress=FALSE)
cat(file.info(mrds)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)8.070449
user system elapsed
7.00 6.14 16.39
Same for feather files, both of which can be subsequently read with Python-Pandas.
In [10]:
start <- proc.time()mfth <- "medicares.feather"
write_feather(medicares,mfth)
cat(file.info(mfth)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)15.96077
user system elapsed
18.25 16.55 43.31
In [11]:
start <- proc.time()mfth <- "medicaref.feather"
write_feather(medicaref,mfth)
cat(file.info(mfth)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)8.131509
user system elapsed
5.24 5.53 15.71
Now compare the results of writing files using R’s splendid fst package. Fast and small with medicares; faster and smaller with medicaref.
In [12]:
start <- proc.time()mfst <- "medicares.fst"
write_fst(medicares,mfst)
cat(file.info(mfst)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)9.128304
user system elapsed
19.55 8.83 28.29
In [13]:
start <- proc.time()mfst <- "medicaref.fst"
write_fst(medicaref,mfst)
cat(file.info(mfst)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)3.241613
user system elapsed
9.52 2.32 5.43
Finally, contrast data.table size and elapsed time to load with the just-produced feather files. medicare with character is almost twice as large as medicare with factor, and took 4 times times as long to read.
In [14]:
start <- proc.time()mfth <- "medicaref.feather"
medicaref <- read_feather(mfth)
cat(file.info(mfth)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)8.131509
user system elapsed
5.54 4.93 10.47
In [15]:
start <- proc.time()mfth <- "medicares.feather"
medicares <- read_feather(mfth)
cat(file.info(mfth)$size/GIG)
blanks(2)end <- proc.time()
print (end-start)blanks(2)15.96077
user system elapsed
33.86 6.56 40.43
While factor attributes in R occupy an important role in ordering and modeling, it hasn’t been clear to me that size and read/write performance are heavily impacted by the parm stringsAsFactors=TRUE. The outcomes presented in this blog are consistent and unmistakable: on tests of a 57M record data.table of 30 attributes, half of which can either be assigned as character arrays or factors, factor dominates character on both data.table size and load performance.
[Related Article: Jupyter Notebook: Python or R — Or Both?]
My take is that with small data, the choice of stringsAsFactors may be open, but for large data like presented here, storing character attributes as factors is a no-brainer and should be the norm in an R data load methodology.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.