The function varselect()
in the leaps package
can be used for variable selection. Available approaches are forward,
backward, and exhaustive selection. The DAAG package has the
functions bestsetNoise()
and bsnVaryNvar()
that are designed to give insight on the sampling properties of output
from the function lm()
, when one of these variable
selection approaches has been used to choose the explanatory variables
that appear in the model.
The function bestsetNoise()
(DAAG) can be used
to experiment with the behaviour of various variable selection
techniques with data that is purely noise. Maindonald and Braun (2011), Section 6.5, pp.~197-198, gives
examples of the use of this function. For example, try:
The analyses will typically yield a model that, if assessed using
output from the R function lm()
, appears to have highly
(but spuriously) statistically significant explanatory power, with one
or more coefficients that appear (again spuriously) significant at a
level of around p=0.01 or
less.
The function bestsetNoise()
has provision to specify the
model matrix. Model matrices with uncorrelated columns of independent
Normal data, which is the default, are not a good match to most
practical situations.
As above, datasets of random normal data were created, always with
100 observations and with the number of variables varying between 3 and
50. For three variables, there was no selection, while in other cases
the
`best'' three variables were selected, by exhaustive search. Figure \@ref(fig:exhaust) plots the p-values for the 3 variables that were selected against the total number of variables. The fitted line estimates the median $p$-value, as a function of
nvar. The function
bsnVaryNvar()that is used for the calculations makes repeated calls to
bestsetNoise()`.
Similar results will be obtained from use of forward or backward
selection.
cap1
## Estimating learning rate. Each dot corresponds to a loss evaluation.
## qu = 0.5........done
Code is:
## Code
suppressPackageStartupMessages(library(qgam, quietly=TRUE))
set.seed(37) # Use to reproduce graph that is shown
bsnVaryNvar(m=100, nvar=3:50, nvmax=3)
When all 3 variables are taken, the p-values are expected to average 0.5. Notice that, for selection of the best 3 variables out of 10, the median p-value has reduced to about 0.1.