The naflex
R package provides additional flexibility for
handling missing values in summary functions beyond the existing options
(na.rm = TRUE
/FALSE
) available in base R.
Most summary functions in base R e.g. mean
, provide the
two extreme options for handling missing values:
na.rm = TRUE
), orna.rm = FALSE
)In many cases, something in between these two extremes is often more
appropriate. For example, you may wish to give a summary statistic if
less than 5%
of values are missing.
naflex
provides helper functions to facilitate this
flexibility. It allows for omitting missing values conditionally, using
four types of checks:
The motivating application for producing this package was the calculation of Climate Normals: Long term averages of surface meteorological measurements that provide benchmark information about the climate at specific locations e.g. total rainfall and mean temperature. The World Meteorological Organization (WMO) Guidelines on the Calculation of Climate Normals1 provides recommendations to standardise these calculations across countries, including handling of missing values.
For example, it recommendations that a monthly mean value calculated
from daily values should only be calculated when there are no more than
10
missing values in the month and no more than
4
days of consecutive missing values. Adhering to such
rules using base R requires doing further calculations and increasing
the complexity and length of code. The aim of naflex
is to
make it easier to apply such rules routinely and efficiently as part of
calculations.
Install the current release from CRAN:
C:15N631023c84995-to-naflex.R
Or install the latest development version from GitHub:
C:15N631023c84995-to-naflex.R
The main function in naflex
is
na_omit_if
.
When wrapped around a vector in a summary function,
na_omit_if
ensures that the summary value is calculated
when the checks pass, and returns NA
if not. The example
below shows how to calculate the mean
, conditionally on the
proportion of missing values.
library(naflex)
x <- c(1, 3, NA, NA, 3, 2, NA, 5, 8, 7)
# Calculate if 30% or less missing values
mean(na_omit_if(x, prop = 0.3))
## [1] 4.142857
## [1] NA
C:15N631023c84995-to-naflex.R
Four types of checks are available:
prop
: the maximum proportion (0 to 1) of missing values
allowedn
: the maximum number of missing values allowedconsec
: the maximum number of consecutive missing
values allowed, andn_non
: the minimum number of non-missing values
required.If multiple checks are specified, all checks must pass for missing
values to be removed. For example, although there are less than 4
missing values in x
, there are two consecutive missing
values, hence the consec = 1
check fails in the example
below the result is NA
.
# Calculate if 4 or less missing values and 1 or less consecutive missing values
mean(na_omit_if(x, n = 4, consec = 1))
## [1] NA
C:15N631023c84995-to-naflex.R The use of %>%
(“pipe”)
from magrittr
can be used to make the code look clearer and
more familiar. The beginning of the line is now the same as standard R
and it moves na_omit_if
after x
which then
appears more like an option within the function, like
na.rm
, which is how you might think about
na_omit_if
conceptually in this case.
## Loading required package: magrittr
## [1] NA
C:15N631023c84995-to-naflex.R
Note that you should not use na_omit_if
with
na.rm = TRUE
in the summary function since this will always
remove missing values so the checks are essentially ignored.
naflex
works & more detailsna_omit_if
works by removing the missing values from
x
if the checks pass, and leaving x
unmodified
otherwise.
## [1] 1 3 3 2 5 8 7
## attr(,"na.action")
## [1] 3 4 7
## attr(,"class")
## [1] "omit"
C:15N631023c84995-to-naflex.R na_omit_if
can be thought
of like an extension of stats::na.omit
and if missing
values are removed, an na.action
attribute and
omit
class are added for consistency with
stats::na.omit
.
## [1] 1 3 NA NA 3 2 NA 5 8 7
C:15N631023c84995-to-naflex.R A further set of four
na_omit_if_*
functions are provided for doing the same
thing but restricted to a single check
e.g. na_omit_if_n(x, 2)
.
na_check
has the same parameters as
na_omit_if
but returns a logical indicating whether the
checks pass. It is used internally in na_omit_if
and may
also be a useful helper function.
## [1] "NA checks fail"
C:15N631023c84995-to-naflex.R
A set a four na_check_*
functions are also provided for
doing the same thing restricted to a single check
e.g. na_check_prop(x, 0.2)
Finally, naflex
provides a set of helper functions for
calculating missing value properties used in these checks.
## [1] 0.3
## [1] 3
## [1] 2
## [1] 7
C:15N631023c84995-to-naflex.R
In base R, this functionality can often be achieved using a
combination of ifelse
, is.na
, rle
and the option na.rm = TRUE
.naflex
aims to
simplify, shorten and standardise this process for users.
For example, the equivalent of:
## [1] NA
C:15N631023c84995-to-naflex.R
in base R is:
## [1] NA
C:15N631023c84995-to-naflex.R
The check for longest sequence of consecutive missing values is more
complex and requires clever use of the rle
function. For
example,
## [1] 4.142857
C:15N631023c84995-to-naflex.R
is equivalent to:
## [1] 4.142857
C:15N631023c84995-to-naflex.R