Cubist
is an R
port of the Cubist GPL
C
code released by RuleQuest at http://rulequest.com/cubist-info.html
.
See the last section of this document for information on the porting.
The other parts describes the functionality of the R
package.
Cubist is a rule-based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
This is explained better in Quinlan (1992). Wang and Witten (1997)
attempted to recreate this model using a “rational reconstruction” of
Quinlan (1992) that is the basis for the M5P
model in
Weka
(and the R package RWeka
).
An example of a model tree can be illustrated using the Ames housing
data in the modeldata
package.
library(Cubist)
data(ames, package = "modeldata")
# model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)
set.seed(11)
in_train_set <- sample(1:nrow(ames), floor(.8*nrow(ames)))
predictors <-
c("Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type",
"Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area",
"Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath",
"TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude")
train_pred <- ames[ in_train_set, predictors]
test_pred <- ames[-in_train_set, predictors]
train_resp <- ames$Sale_Price[ in_train_set]
test_resp <- ames$Sale_Price[-in_train_set]
model_tree <- cubist(x = train_pred, y = train_resp)
model_tree
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
## Number of samples: 2344
## Number of predictors: 17
##
## Number of committees: 1
## Number of rules: 14
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
##
## Cubist [Release 2.07 GPL Edition] Mon Jul 1 21:23:48 2024
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model:
##
## Rule 1: [48 cases, mean 4.859286, range 4.106837 to 5.176091, est err 0.114526]
##
## if
## Neighborhood in {Old_Town, Sawyer_West, Iowa_DOT_and_Rail_Road}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = -352.195658 + 10.59 Latitude + 0.000362 Gr_Liv_Area
## - 0.044 Year_Sold
##
## Rule 2: [59 cases, mean 4.981545, range 4.594393 to 5.311754, est err 0.066423]
##
## if
## Neighborhood in {North_Ames, Edwards, Sawyer, Brookside, Crawford,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 3.738655 + 0.000281 Gr_Liv_Area + 8e-05 Total_Bsmt_SF
## + 0.00047 Year_Built
##
## Rule 3: [333 cases, mean 5.089450, range 4.69897 to 5.676693, est err 0.063499]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = -36.367569 + 0.000242 Gr_Liv_Area + 5.9e-05 Total_Bsmt_SF
## + 0.00085 Year_Built + 0.94 Latitude - 0.01 TotRms_AbvGrd
## + 0.036 Bsmt_Half_Bath + 0.006 Bsmt_Full_Bath + 5e-07 Lot_Area
##
## Rule 4: [176 cases, mean 5.137743, range 4.788875 to 5.332842, est err 0.034310]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = -28.661356 + 0.00333 Year_Built + 0.000188 Gr_Liv_Area
## + 7.4e-05 Total_Bsmt_SF + 5.6e-06 Lot_Area + 0.64 Latitude
## - 0.007 TotRms_AbvGrd
##
## Rule 5: [35 cases, mean 5.150411, range 4.954243 to 5.361728, est err 0.050057]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 87.372832 + 0.000171 Total_Bsmt_SF + 8.5e-05 Gr_Liv_Area
## + 0.00127 Year_Built - 2.02 Latitude + 3.8e-06 Lot_Area
## - 0.005 TotRms_AbvGrd
##
## Rule 6: [76 cases, mean 5.157218, range 4.973128 to 5.363612, est err 0.045520]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Central_Air = Y
## Gr_Liv_Area <= 1692
## then
## outcome = 141.774591 + 0.000129 Gr_Liv_Area - 3.26 Latitude
## + 4.2e-06 Lot_Area + 0.037 Bsmt_Full_Bath + 9e-05 Year_Built
## + 6e-06 Total_Bsmt_SF
##
## Rule 7: [54 cases, mean 5.171031, range 4.79588 to 5.44832, est err 0.051527]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 2.31639 + 0.000232 Gr_Liv_Area + 0.00125 Year_Built
## - 0.018 TotRms_AbvGrd + 6.1e-05 Total_Bsmt_SF + 9e-07 Lot_Area
## + 0.01 Half_Bath + 0.007 Bsmt_Full_Bath
##
## Rule 8: [473 cases, mean 5.192192, range 4.826075 to 5.509555, est err 0.037515]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## Meadow_Village, Bloomington_Heights, Northpark_Villa}
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath > 0
## then
## outcome = 23.436857 + 0.00245 Year_Built + 0.000128 Gr_Liv_Area
## + 0.000103 Total_Bsmt_SF + 2.5e-06 Lot_Area + 0.015 Full_Bath
## + 0.25 Longitude + 0.003 TotRms_AbvGrd
##
## Rule 9: [480 cases, mean 5.198712, range 4.778151 to 5.463805, est err 0.040757]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath <= 0
## then
## outcome = 18.268736 + 0.000174 Gr_Liv_Area + 0.00279 Year_Built
## + 8.9e-05 Total_Bsmt_SF + 3.3e-06 Lot_Area
## - 0.011 TotRms_AbvGrd + 0.016 Bsmt_Full_Bath + 0.26 Longitude
## + 0.13 Latitude
##
## Rule 10: [315 cases, mean 5.297982, range 4.905256 to 5.676693, est err 0.051186]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = -28.023112 + 0.000157 Gr_Liv_Area + 0.0015 Year_Built
## + 8.4e-05 Total_Bsmt_SF - 0.015 TotRms_AbvGrd + 0.03 Full_Bath
## + 0.029 Half_Bath + 2.3e-06 Lot_Area - 0.32 Longitude
## + 0.015 Bsmt_Full_Bath
##
## Rule 11: [161 cases, mean 5.445366, range 5.142662 to 5.872156, est err 0.052062]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 2.344921 + 0.000151 Gr_Liv_Area + 0.00134 Year_Built
## + 8.3e-05 Total_Bsmt_SF + 0.034 Bsmt_Full_Bath
## - 0.004 TotRms_AbvGrd + 0.007 Half_Bath + 6e-07 Lot_Area
##
## Rule 12: [275 cases, mean 5.452962, range 5.136721 to 5.872156, est err 0.051166]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Crawford, Northridge,
## Stone_Brook, Clear_Creek, Veenker, Blueste, Greens,
## Green_Hills}
## Year_Built > 1952
## Bsmt_Full_Bath > 0
## then
## outcome = 19.156714 + 0.000178 Gr_Liv_Area + 0.000159 Total_Bsmt_SF
## + 0.00176 Year_Built + 1.4e-06 Lot_Area + 0.19 Longitude
## + 0.007 Bsmt_Full_Bath - 0.002 TotRms_AbvGrd
##
## Rule 13: [113 cases, mean 5.491452, range 5.281034 to 5.765619, est err 0.039038]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 1907
## Gr_Liv_Area > 1692
## then
## outcome = 25.674216 + 0.0097 Year_Built + 0.000152 Gr_Liv_Area
## + 0.000109 Total_Bsmt_SF + 0.057 Bsmt_Full_Bath
## + 0.68 Longitude + 0.56 Latitude
##
## Rule 14: [35 cases, mean 5.602593, range 5.20412 to 5.786508, est err 0.077426]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 1907
## then
## outcome = -0.069641 - 9.9e-05 Gr_Liv_Area + 0.008 Bsmt_Full_Bath
## + 0.14 Latitude + 0.001 TotRms_AbvGrd
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.053913
## Relative |error| 0.39
## Correlation coefficient 0.90
##
##
## Attribute usage:
## Conds Model
##
## 80% 97% Year_Built
## 76% 100% Gr_Liv_Area
## 74% Neighborhood
## 50% 97% Total_Bsmt_SF
## 47% 70% Bsmt_Full_Bath
## 20% Bldg_Type
## 20% Central_Air
## 90% Lot_Area
## 89% TotRms_AbvGrd
## 63% Longitude
## 49% Latitude
## 30% Full_Bath
## 20% Half_Bath
## 13% Bsmt_Half_Bath
## 2% Year_Sold
##
##
## Time: 0.0 secs
There is no formula method for cubist()
; the predictors
are specified as matrix or data frame, The outcome is a numeric
vector.
There is a predict method for the model:
model_tree_pred <- predict(model_tree, test_pred)
## Test set RMSE
sqrt(mean((model_tree_pred - test_resp)^2))
## [1] 0.0751
## [1] 0.82
The Cubist model can also use a boosting-like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over-predicted a value, the response is adjusted downward for the next model (and so on, see this blog post). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
The committee
option can be used to control number of
model trees:
##
## Call:
## cubist.default(x = train_pred, y = train_resp, committees = 3)
##
##
## Cubist [Release 2.07 GPL Edition] Mon Jul 1 21:23:49 2024
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [48 cases, mean 4.859286, range 4.106837 to 5.176091, est err 0.114526]
##
## if
## Neighborhood in {Old_Town, Sawyer_West, Iowa_DOT_and_Rail_Road}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = -352.195658 + 10.59 Latitude + 0.000362 Gr_Liv_Area
## - 0.044 Year_Sold
##
## Rule 1/2: [59 cases, mean 4.981545, range 4.594393 to 5.311754, est err 0.066423]
##
## if
## Neighborhood in {North_Ames, Edwards, Sawyer, Brookside, Crawford,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 3.738655 + 0.000281 Gr_Liv_Area + 8e-05 Total_Bsmt_SF
## + 0.00047 Year_Built
##
## Rule 1/3: [333 cases, mean 5.089450, range 4.69897 to 5.676693, est err 0.063499]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = -36.367569 + 0.000242 Gr_Liv_Area + 5.9e-05 Total_Bsmt_SF
## + 0.00085 Year_Built + 0.94 Latitude - 0.01 TotRms_AbvGrd
## + 0.036 Bsmt_Half_Bath + 0.006 Bsmt_Full_Bath + 5e-07 Lot_Area
##
## Rule 1/4: [176 cases, mean 5.137743, range 4.788875 to 5.332842, est err 0.034310]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = -28.661356 + 0.00333 Year_Built + 0.000188 Gr_Liv_Area
## + 7.4e-05 Total_Bsmt_SF + 5.6e-06 Lot_Area + 0.64 Latitude
## - 0.007 TotRms_AbvGrd
##
## Rule 1/5: [35 cases, mean 5.150411, range 4.954243 to 5.361728, est err 0.050057]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 87.372832 + 0.000171 Total_Bsmt_SF + 8.5e-05 Gr_Liv_Area
## + 0.00127 Year_Built - 2.02 Latitude + 3.8e-06 Lot_Area
## - 0.005 TotRms_AbvGrd
##
## Rule 1/6: [76 cases, mean 5.157218, range 4.973128 to 5.363612, est err 0.045520]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Central_Air = Y
## Gr_Liv_Area <= 1692
## then
## outcome = 141.774591 + 0.000129 Gr_Liv_Area - 3.26 Latitude
## + 4.2e-06 Lot_Area + 0.037 Bsmt_Full_Bath + 9e-05 Year_Built
## + 6e-06 Total_Bsmt_SF
##
## Rule 1/7: [54 cases, mean 5.171031, range 4.79588 to 5.44832, est err 0.051527]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 2.31639 + 0.000232 Gr_Liv_Area + 0.00125 Year_Built
## - 0.018 TotRms_AbvGrd + 6.1e-05 Total_Bsmt_SF + 9e-07 Lot_Area
## + 0.01 Half_Bath + 0.007 Bsmt_Full_Bath
##
## Rule 1/8: [473 cases, mean 5.192192, range 4.826075 to 5.509555, est err 0.037515]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## Meadow_Village, Bloomington_Heights, Northpark_Villa}
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath > 0
## then
## outcome = 23.436857 + 0.00245 Year_Built + 0.000128 Gr_Liv_Area
## + 0.000103 Total_Bsmt_SF + 2.5e-06 Lot_Area + 0.015 Full_Bath
## + 0.25 Longitude + 0.003 TotRms_AbvGrd
##
## Rule 1/9: [480 cases, mean 5.198712, range 4.778151 to 5.463805, est err 0.040757]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath <= 0
## then
## outcome = 18.268736 + 0.000174 Gr_Liv_Area + 0.00279 Year_Built
## + 8.9e-05 Total_Bsmt_SF + 3.3e-06 Lot_Area
## - 0.011 TotRms_AbvGrd + 0.016 Bsmt_Full_Bath + 0.26 Longitude
## + 0.13 Latitude
##
## Rule 1/10: [315 cases, mean 5.297982, range 4.905256 to 5.676693, est err 0.051186]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = -28.023112 + 0.000157 Gr_Liv_Area + 0.0015 Year_Built
## + 8.4e-05 Total_Bsmt_SF - 0.015 TotRms_AbvGrd + 0.03 Full_Bath
## + 0.029 Half_Bath + 2.3e-06 Lot_Area - 0.32 Longitude
## + 0.015 Bsmt_Full_Bath
##
## Rule 1/11: [161 cases, mean 5.445366, range 5.142662 to 5.872156, est err 0.052062]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 2.344921 + 0.000151 Gr_Liv_Area + 0.00134 Year_Built
## + 8.3e-05 Total_Bsmt_SF + 0.034 Bsmt_Full_Bath
## - 0.004 TotRms_AbvGrd + 0.007 Half_Bath + 6e-07 Lot_Area
##
## Rule 1/12: [275 cases, mean 5.452962, range 5.136721 to 5.872156, est err 0.051166]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Crawford, Northridge,
## Stone_Brook, Clear_Creek, Veenker, Blueste, Greens,
## Green_Hills}
## Year_Built > 1952
## Bsmt_Full_Bath > 0
## then
## outcome = 19.156714 + 0.000178 Gr_Liv_Area + 0.000159 Total_Bsmt_SF
## + 0.00176 Year_Built + 1.4e-06 Lot_Area + 0.19 Longitude
## + 0.007 Bsmt_Full_Bath - 0.002 TotRms_AbvGrd
##
## Rule 1/13: [113 cases, mean 5.491452, range 5.281034 to 5.765619, est err 0.039038]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 1907
## Gr_Liv_Area > 1692
## then
## outcome = 25.674216 + 0.0097 Year_Built + 0.000152 Gr_Liv_Area
## + 0.000109 Total_Bsmt_SF + 0.057 Bsmt_Full_Bath
## + 0.68 Longitude + 0.56 Latitude
##
## Rule 1/14: [35 cases, mean 5.602593, range 5.20412 to 5.786508, est err 0.077426]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 1907
## then
## outcome = -0.069641 - 9.9e-05 Gr_Liv_Area + 0.008 Bsmt_Full_Bath
## + 0.14 Latitude + 0.001 TotRms_AbvGrd
##
## Model 2:
##
## Rule 2/1: [66 cases, mean 4.924712, range 4.106837 to 5.266937, est err 0.114951]
##
## if
## Central_Air = N
## Gr_Liv_Area <= 2035
## Longitude > -93.62571
## then
## outcome = -580.059188 + 13.91 Latitude + 0.000299 Gr_Liv_Area
##
## Rule 2/2: [144 cases, mean 4.960665, range 4.106837 to 5.311754, est err 0.099679]
##
## if
## Central_Air = N
## Gr_Liv_Area <= 2035
## then
## outcome = 66.952321 + 0.000172 Gr_Liv_Area + 0.000122 Total_Bsmt_SF
## - 0.031 Year_Sold
##
## Rule 2/3: [314 cases, mean 5.090793, range 4.69897 to 5.676693, est err 0.062296]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer_West,
## Iowa_DOT_and_Rail_Road,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Briardale, Northpark_Villa}
## Year_Built <= 1968
## Central_Air = Y
## then
## outcome = -67.722399 + 0.000207 Gr_Liv_Area - 0.01 TotRms_AbvGrd
## + 0.78 Latitude + 0.024 Bsmt_Full_Bath - 0.42 Longitude
## + 0.00024 Year_Built + 1.2e-05 Total_Bsmt_SF + 6e-07 Lot_Area
## + 0.012 Bsmt_Half_Bath
##
## Rule 2/4: [86 cases, mean 5.111280, range 4.788875 to 5.311754, est err 0.075818]
##
## if
## Bldg_Type = Duplex
## Year_Built <= 1990
## then
## outcome = 3.017972 + 0.000162 Gr_Liv_Area + 0.0009 Year_Built
## + 4.9e-05 Total_Bsmt_SF + 0.034 Bsmt_Full_Bath
## - 0.011 TotRms_AbvGrd + 2.3e-06 Lot_Area
## + 0.049 Bsmt_Half_Bath
##
## Rule 2/5: [116 cases, mean 5.121052, range 4.851258 to 5.50515, est err 0.041411]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer_West,
## Iowa_DOT_and_Rail_Road,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Briardale, Northpark_Villa}
## Year_Built > 1968
## Year_Built <= 1990
## then
## outcome = -112.729989 + 0.00697 Year_Built + 0.000177 Gr_Liv_Area
## + 0.000109 Total_Bsmt_SF - 0.74 Longitude + 0.82 Latitude
## - 0.002 TotRms_AbvGrd + 0.005 Bsmt_Full_Bath
##
## Rule 2/6: [1037 cases, mean 5.200723, range 4.594393 to 5.676693, est err 0.047478]
##
## if
## Neighborhood in {North_Ames, College_Creek, Sawyer, Northwest_Ames,
## Mitchell, Brookside, Timberland, Blueste}
## then
## outcome = -11.037934 + 0.000128 Gr_Liv_Area + 8e-05 Total_Bsmt_SF
## + 0.00095 Year_Built + 0.032 Bsmt_Full_Bath + 0.022 Full_Bath
## + 0.046 Bsmt_Half_Bath + 1e-06 Lot_Area + 0.01 Half_Bath
## - 0.15 Longitude
##
## Rule 2/7: [265 cases, mean 5.260237, range 5.069113 to 5.449479, est err 0.025156]
##
## if
## Year_Built > 1990
## Total_Bsmt_SF <= 952
## Gr_Liv_Area <= 2035
## then
## outcome = -0.65428 + 0.000208 Gr_Liv_Area + 0.00279 Year_Built
## + 0.03 Bsmt_Full_Bath
##
## Rule 2/8: [141 cases, mean 5.292914, range 4.955928 to 5.585461, est err 0.064230]
##
## if
## Neighborhood in {Crawford, Stone_Brook, Clear_Creek, Veenker, Greens,
## Green_Hills}
## Year_Built <= 1990
## then
## outcome = -13.943719 + 0.000192 Gr_Liv_Area + 0.00106 Year_Built
## + 3.3e-05 Total_Bsmt_SF - 0.023 Full_Bath
## + 0.019 Bsmt_Full_Bath + 0.026 Bsmt_Half_Bath
## - 0.004 TotRms_AbvGrd + 0.012 Half_Bath + 9e-07 Lot_Area
## - 0.18 Longitude
##
## Rule 2/9: [133 cases, mean 5.323634, range 4.79588 to 5.676693, est err 0.094539]
##
## if
## Neighborhood in {North_Ames, Old_Town, Edwards, Gilbert, Sawyer,
## Northwest_Ames, Sawyer_West, Mitchell, Brookside,
## Iowa_DOT_and_Rail_Road,
## South_and_West_of_Iowa_State_University, Clear_Creek,
## Meadow_Village, Veenker}
## Gr_Liv_Area > 2035
## then
## outcome = -114.880872 + 2.8 Latitude + 0.00119 Year_Built
## + 0.048 Full_Bath + 1.1e-05 Gr_Liv_Area + 8e-06 Total_Bsmt_SF
## + 0.006 Bsmt_Full_Bath
##
## Rule 2/10: [457 cases, mean 5.357634, range 4.926857 to 5.672098, est err 0.043722]
##
## if
## Year_Built > 1990
## Total_Bsmt_SF > 952
## Gr_Liv_Area <= 2035
## then
## outcome = 31.082771 + 0.00743 Year_Built + 0.000272 Gr_Liv_Area
## + 0.000144 Total_Bsmt_SF + 0.059 Bsmt_Full_Bath
## - 0.015 TotRms_AbvGrd + 3.9e-06 Lot_Area + 0.44 Longitude
##
## Rule 2/11: [82 cases, mean 5.482795, range 5.281034 to 5.872156, est err 0.050959]
##
## if
## Neighborhood in {College_Creek, Somerset, Northridge_Heights, Crawford,
## Timberland, Northridge, Stone_Brook}
## Year_Built <= 2002
## Gr_Liv_Area > 2035
## then
## outcome = -1.964556 + 0.000126 Gr_Liv_Area + 3.8e-06 Lot_Area
## + 0.00018 Year_Built + 0.16 Latitude + 0.004 Bsmt_Full_Bath
##
## Rule 2/12: [78 cases, mean 5.585224, range 5.380211 to 5.788875, est err 0.048668]
##
## if
## Neighborhood in {College_Creek, Somerset, Northridge_Heights, Crawford,
## Timberland, Northridge, Stone_Brook}
## Year_Built > 2002
## Gr_Liv_Area > 2035
## then
## outcome = 137.506637 + 0.01229 Year_Built + 9.6e-05 Gr_Liv_Area
## + 1.91 Longitude + 0.000107 Total_Bsmt_SF
## + 0.044 Bsmt_Full_Bath + 0.52 Latitude + 9e-07 Lot_Area
##
## Model 3:
##
## Rule 3/1: [33 cases, mean 4.765631, range 4.106837 to 5.041393, est err 0.159903]
##
## if
## Central_Air = N
## Gr_Liv_Area <= 845
## then
## outcome = -391.954369 + 0.000768 Gr_Liv_Area - 4.23 Longitude
## + 7e-05 Year_Built
##
## Rule 3/2: [104 cases, mean 4.928864, range 4.106837 to 5.175802, est err 0.091334]
##
## if
## Gr_Liv_Area <= 845
## then
## outcome = -1.229338 + 0.000359 Total_Bsmt_SF + 2.31e-05 Lot_Area
## + 0.00297 Year_Built + 5.1e-05 Gr_Liv_Area
##
## Rule 3/3: [215 cases, mean 5.123382, range 4.740363 to 5.44832, est err 0.057230]
##
## if
## Bldg_Type in {TwoFmCon, Duplex, Twnhs}
## Year_Built <= 2005
## Gr_Liv_Area > 845
## then
## outcome = 41.243193 + 0.000147 Gr_Liv_Area + 0.00201 Year_Built
## + 0.056 Bsmt_Full_Bath + 0.48 Longitude
## + 2.6e-05 Total_Bsmt_SF - 0.006 TotRms_AbvGrd + 8e-07 Lot_Area
## + 0.11 Latitude + 0.004 Half_Bath
##
## Rule 3/4: [248 cases, mean 5.136273, range 4.79588 to 5.676693, est err 0.074758]
##
## if
## Year_Built <= 1951
## Total_Bsmt_SF > 800
## Gr_Liv_Area > 845
## then
## outcome = 4.685451 + 0.000248 Gr_Liv_Area + 0.114 Half_Bath
## - 0.029 TotRms_AbvGrd + 5e-06 Lot_Area + 0.0001 Year_Built
## + 6e-06 Total_Bsmt_SF
##
## Rule 3/5: [438 cases, mean 5.150761, range 4.60206 to 5.580925, est err 0.057916]
##
## if
## Bldg_Type in {OneFam, TwnhsE}
## Total_Bsmt_SF <= 800
## Gr_Liv_Area > 845
## then
## outcome = 2.032241 + 0.000165 Gr_Liv_Area + 0.000116 Total_Bsmt_SF
## + 0.00142 Year_Built + 0.033 Full_Bath - 0.003 TotRms_AbvGrd
## + 0.008 Bsmt_Full_Bath + 6e-07 Lot_Area + 0.006 Half_Bath
##
## Rule 3/6: [1358 cases, mean 5.252755, range 4.117271 to 5.872156, est err 0.042818]
##
## if
## Bldg_Type in {OneFam, TwnhsE}
## Year_Built > 1951
## Year_Built <= 2005
## then
## outcome = 40.010366 + 0.000187 Gr_Liv_Area + 0.00307 Year_Built
## + 9.9e-05 Total_Bsmt_SF + 0.038 Bsmt_Full_Bath
## - 0.011 TotRms_AbvGrd + 2.5e-06 Lot_Area + 0.44 Longitude
##
## Rule 3/7: [235 cases, mean 5.400186, range 4.926857 to 5.765619, est err 0.047093]
##
## if
## Year_Built > 2005
## Total_Bsmt_SF <= 2006
## then
## outcome = -3.631741 + 0.00424 Year_Built + 0.000131 Gr_Liv_Area
## + 0.000125 Total_Bsmt_SF + 0.075 Bsmt_Full_Bath
## + 5.4e-06 Lot_Area + 0.011 TotRms_AbvGrd
##
## Rule 3/8: [20 cases, mean 5.583736, range 5.20412 to 5.786508, est err 0.103229]
##
## if
## Year_Built > 2005
## Total_Bsmt_SF > 2006
## then
## outcome = 4.772929 - 0.000158 Gr_Liv_Area + 0.00059 Year_Built
## + 0.002 TotRms_AbvGrd + 4e-06 Total_Bsmt_SF
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.051966
## Relative |error| 0.38
## Correlation coefficient 0.91
##
##
## Attribute usage:
## Conds Model
##
## 70% 96% Year_Built
## 52% 100% Gr_Liv_Area
## 47% Neighborhood
## 36% 94% Total_Bsmt_SF
## 32% Bldg_Type
## 15% 83% Bsmt_Full_Bath
## 13% Central_Air
## 66% Longitude
## 87% Lot_Area
## 73% TotRms_AbvGrd
## 32% Half_Bath
## 31% Full_Bath
## 28% Latitude
## 23% Bsmt_Half_Bath
## 2% Year_Sold
##
##
## Time: 0.1 secs
For this model:
## [1] 0.0708
## [1] 0.839
Another innovation in Cubist using nearest-neighbors to adjust the predictions from the rule-based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points. See Quinlan (1993a) for the details of the adjustment as well as this blog post.
The development of rules and committees is independent of the choice
of using instances. The original C
code allowed the program
to choose whether to use instances, not use them or let the program
decide. Our approach is to build a model with the cubist()
function that is ignorant to the decision about instances. When samples
are predicted, the argument neighbors
can be used to adjust
the rule-based model predictions (or not).
We can add instances to the previously fit committee model:
inst_pred <- predict(com_model, test_pred, neighbors = 5)
## RMSE
sqrt(mean((inst_pred - test_resp)^2))
## [1] 0.0688
## [1] 0.848
Note that the previous models used the implicit default of
neighbors = 0
for their predictions.
It may also be useful to see how the different models fit a single
predictor. Here is the test set data for a model with one predictor
(Gr_Liv_Area
), 100 committees, and various values of
neighbors
:
After the initial use of the instance-based correction, there is very little change in the mainstream of the data.
R modeling packages such as caret
,
tidymodels
, and mlr3
can be used to tune the
model. See the examples
here for more details.
It should be noted that this variable importance measure does not capture the influence of the predictors when using the instance-based correction.
Rules from a Cubist model can be viewed using summary
as
follows:
##
## Call:
## cubist.default(x = train_pred, y = train_resp)
##
##
## Cubist [Release 2.07 GPL Edition] Mon Jul 1 21:23:48 2024
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 2344 cases (18 attributes) from undefined.data
##
## Model:
##
## Rule 1: [48 cases, mean 4.859286, range 4.106837 to 5.176091, est err 0.114526]
##
## if
## Neighborhood in {Old_Town, Sawyer_West, Iowa_DOT_and_Rail_Road}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = -352.195658 + 10.59 Latitude + 0.000362 Gr_Liv_Area
## - 0.044 Year_Sold
##
## Rule 2: [59 cases, mean 4.981545, range 4.594393 to 5.311754, est err 0.066423]
##
## if
## Neighborhood in {North_Ames, Edwards, Sawyer, Brookside, Crawford,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = N
## Gr_Liv_Area <= 1692
## then
## outcome = 3.738655 + 0.000281 Gr_Liv_Area + 8e-05 Total_Bsmt_SF
## + 0.00047 Year_Built
##
## Rule 3: [333 cases, mean 5.089450, range 4.69897 to 5.676693, est err 0.063499]
##
## if
## Neighborhood in {Old_Town, Edwards, Gilbert, Sawyer, Sawyer_West,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University}
## Year_Built <= 1952
## Central_Air = Y
## then
## outcome = -36.367569 + 0.000242 Gr_Liv_Area + 5.9e-05 Total_Bsmt_SF
## + 0.00085 Year_Built + 0.94 Latitude - 0.01 TotRms_AbvGrd
## + 0.036 Bsmt_Half_Bath + 0.006 Bsmt_Full_Bath + 5e-07 Lot_Area
##
## Rule 4: [176 cases, mean 5.137743, range 4.788875 to 5.332842, est err 0.034310]
##
## if
## Neighborhood in {College_Creek, Edwards, Somerset, Northridge_Heights,
## Gilbert, Sawyer_West, Mitchell, Iowa_DOT_and_Rail_Road,
## Timberland, Clear_Creek, Meadow_Village, Briardale,
## Blueste, Landmark}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = -28.661356 + 0.00333 Year_Built + 0.000188 Gr_Liv_Area
## + 7.4e-05 Total_Bsmt_SF + 5.6e-06 Lot_Area + 0.64 Latitude
## - 0.007 TotRms_AbvGrd
##
## Rule 5: [35 cases, mean 5.150411, range 4.954243 to 5.361728, est err 0.050057]
##
## if
## Neighborhood in {North_Ames, Old_Town, Sawyer, Northwest_Ames, Crawford,
## Green_Hills}
## Year_Built > 1952
## Total_Bsmt_SF <= 765
## Gr_Liv_Area <= 1692
## then
## outcome = 87.372832 + 0.000171 Total_Bsmt_SF + 8.5e-05 Gr_Liv_Area
## + 0.00127 Year_Built - 2.02 Latitude + 3.8e-06 Lot_Area
## - 0.005 TotRms_AbvGrd
##
## Rule 6: [76 cases, mean 5.157218, range 4.973128 to 5.363612, est err 0.045520]
##
## if
## Neighborhood in {North_Ames, Crawford}
## Year_Built <= 1952
## Central_Air = Y
## Gr_Liv_Area <= 1692
## then
## outcome = 141.774591 + 0.000129 Gr_Liv_Area - 3.26 Latitude
## + 4.2e-06 Lot_Area + 0.037 Bsmt_Full_Bath + 9e-05 Year_Built
## + 6e-06 Total_Bsmt_SF
##
## Rule 7: [54 cases, mean 5.171031, range 4.79588 to 5.44832, est err 0.051527]
##
## if
## Bldg_Type in {Duplex, Twnhs}
## Gr_Liv_Area > 1692
## then
## outcome = 2.31639 + 0.000232 Gr_Liv_Area + 0.00125 Year_Built
## - 0.018 TotRms_AbvGrd + 6.1e-05 Total_Bsmt_SF + 9e-07 Lot_Area
## + 0.01 Half_Bath + 0.007 Bsmt_Full_Bath
##
## Rule 8: [473 cases, mean 5.192192, range 4.826075 to 5.509555, est err 0.037515]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Brookside, Iowa_DOT_and_Rail_Road, Timberland,
## Meadow_Village, Bloomington_Heights, Northpark_Villa}
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath > 0
## then
## outcome = 23.436857 + 0.00245 Year_Built + 0.000128 Gr_Liv_Area
## + 0.000103 Total_Bsmt_SF + 2.5e-06 Lot_Area + 0.015 Full_Bath
## + 0.25 Longitude + 0.003 TotRms_AbvGrd
##
## Rule 9: [480 cases, mean 5.198712, range 4.778151 to 5.463805, est err 0.040757]
##
## if
## Year_Built > 1952
## Total_Bsmt_SF > 765
## Gr_Liv_Area <= 1692
## Bsmt_Full_Bath <= 0
## then
## outcome = 18.268736 + 0.000174 Gr_Liv_Area + 0.00279 Year_Built
## + 8.9e-05 Total_Bsmt_SF + 3.3e-06 Lot_Area
## - 0.011 TotRms_AbvGrd + 0.016 Bsmt_Full_Bath + 0.26 Longitude
## + 0.13 Latitude
##
## Rule 10: [315 cases, mean 5.297982, range 4.905256 to 5.676693, est err 0.051186]
##
## if
## Neighborhood in {North_Ames, College_Creek, Old_Town, Edwards, Gilbert,
## Sawyer, Northwest_Ames, Sawyer_West, Mitchell,
## Iowa_DOT_and_Rail_Road, Timberland,
## South_and_West_of_Iowa_State_University,
## Meadow_Village, Veenker}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = -28.023112 + 0.000157 Gr_Liv_Area + 0.0015 Year_Built
## + 8.4e-05 Total_Bsmt_SF - 0.015 TotRms_AbvGrd + 0.03 Full_Bath
## + 0.029 Half_Bath + 2.3e-06 Lot_Area - 0.32 Longitude
## + 0.015 Bsmt_Full_Bath
##
## Rule 11: [161 cases, mean 5.445366, range 5.142662 to 5.872156, est err 0.052062]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Brookside, Crawford,
## Northridge, Stone_Brook, Clear_Creek}
## Bldg_Type in {OneFam, TwoFmCon, TwnhsE}
## Year_Built <= 2004
## Gr_Liv_Area > 1692
## then
## outcome = 2.344921 + 0.000151 Gr_Liv_Area + 0.00134 Year_Built
## + 8.3e-05 Total_Bsmt_SF + 0.034 Bsmt_Full_Bath
## - 0.004 TotRms_AbvGrd + 0.007 Half_Bath + 6e-07 Lot_Area
##
## Rule 12: [275 cases, mean 5.452962, range 5.136721 to 5.872156, est err 0.051166]
##
## if
## Neighborhood in {Somerset, Northridge_Heights, Crawford, Northridge,
## Stone_Brook, Clear_Creek, Veenker, Blueste, Greens,
## Green_Hills}
## Year_Built > 1952
## Bsmt_Full_Bath > 0
## then
## outcome = 19.156714 + 0.000178 Gr_Liv_Area + 0.000159 Total_Bsmt_SF
## + 0.00176 Year_Built + 1.4e-06 Lot_Area + 0.19 Longitude
## + 0.007 Bsmt_Full_Bath - 0.002 TotRms_AbvGrd
##
## Rule 13: [113 cases, mean 5.491452, range 5.281034 to 5.765619, est err 0.039038]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF <= 1907
## Gr_Liv_Area > 1692
## then
## outcome = 25.674216 + 0.0097 Year_Built + 0.000152 Gr_Liv_Area
## + 0.000109 Total_Bsmt_SF + 0.057 Bsmt_Full_Bath
## + 0.68 Longitude + 0.56 Latitude
##
## Rule 14: [35 cases, mean 5.602593, range 5.20412 to 5.786508, est err 0.077426]
##
## if
## Year_Built > 2004
## Total_Bsmt_SF > 1907
## then
## outcome = -0.069641 - 9.9e-05 Gr_Liv_Area + 0.008 Bsmt_Full_Bath
## + 0.14 Latitude + 0.001 TotRms_AbvGrd
##
##
## Evaluation on training data (2344 cases):
##
## Average |error| 0.053913
## Relative |error| 0.39
## Correlation coefficient 0.90
##
##
## Attribute usage:
## Conds Model
##
## 80% 97% Year_Built
## 76% 100% Gr_Liv_Area
## 74% Neighborhood
## 50% 97% Total_Bsmt_SF
## 47% 70% Bsmt_Full_Bath
## 20% Bldg_Type
## 20% Central_Air
## 90% Lot_Area
## 89% TotRms_AbvGrd
## 63% Longitude
## 49% Latitude
## 30% Full_Bath
## 20% Half_Bath
## 13% Bsmt_Half_Bath
## 2% Year_Sold
##
##
## Time: 0.0 secs
The tidy()
function in the rules
package
returns rules in a tibble (an extension of data frames) with one row per
rule. The tibble provides information about the rule and can be used to
programatically extra data from the model. For example:
## # A tibble: 14 × 5
## committee rule_num rule estimate statistic
## <int> <int> <chr> <list> <list>
## 1 1 1 ( Central_Air == 'N' ) & ( Neighborhoo… <tibble> <tibble>
## 2 1 2 ( Central_Air == 'N' ) & ( Neighborhoo… <tibble> <tibble>
## 3 1 3 ( Year_Built <= 1952 ) & ( Central_Air… <tibble> <tibble>
## 4 1 4 ( Total_Bsmt_SF <= 765 ) & ( Year_Buil… <tibble> <tibble>
## 5 1 5 ( Total_Bsmt_SF <= 765 ) & ( Neighborh… <tibble> <tibble>
## 6 1 6 ( Neighborhood %in% c( 'North_Ames','… <tibble> <tibble>
## 7 1 7 ( Bldg_Type %in% c( 'Duplex','Twnhs' … <tibble> <tibble>
## 8 1 8 ( Bsmt_Full_Bath > 0 ) & ( Gr_Liv_Area… <tibble> <tibble>
## 9 1 9 ( Bsmt_Full_Bath <= 0 ) & ( Gr_Liv_Are… <tibble> <tibble>
## 10 1 10 ( Gr_Liv_Area > 1692 ) & ( Neighborhoo… <tibble> <tibble>
## 11 1 11 ( Neighborhood %in% c( 'Somerset','No… <tibble> <tibble>
## 12 1 12 ( Neighborhood %in% c( 'Somerset','No… <tibble> <tibble>
## 13 1 13 ( Year_Built > 2004 ) & ( Gr_Liv_Area … <tibble> <tibble>
## 14 1 14 ( Total_Bsmt_SF > 1907 ) & ( Year_Buil… <tibble> <tibble>
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) -352.
## 2 Gr_Liv_Area 0.000362
## 3 Year_Sold -0.044
## 4 Latitude 10.6
## # A tibble: 1 × 6
## num_conditions coverage mean min max error
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 48 4.86 4.11 5.18 0.115
The rule
column can be converted to an R expression that
can be used to pull data used by that rule. For example, for the seventh
rule:
# Text
rule_7 <- rule_df$rule[7]
# Convert to an expression
rule_7 <- rlang::parse_expr(rule_7)
rule_7
## (Bldg_Type %in% c("Duplex", "Twnhs")) & (Gr_Liv_Area > 1692)
## [1] 2344
## [1] 54
The summary()
method for Cubist shows the usage of each
variable in either the rule conditions or the (terminal) linear model.
In actuality, many more linear models are used in prediction that are
shown in the output. Because of this, the variable usage statistics
shown at the end of the output of the summary()
function
will probably be inconsistent with the rules also shown in the output.
At each split of the tree, Cubist saves a linear model (after feature
selection) that is allowed to have terms for each variable used in the
current split or any split above it. Quinlan (1992) discusses a
smoothing algorithm where each model prediction is a linear combination
of the parent and child model along the tree. As such, the final
prediction is a function of all the linear models from the initial node
to the terminal node. The percentages shown in the Cubist output
reflects all the models involved in prediction (as opposed to the
terminal models shown in the output).
The raw usage statistics are contained in a data frame called
usage
in the cubist
object.
The caret
and vip
packages have general
variable importance functions caret::varImp()
and
vip::vi()
. When using this function on a
cubist
argument, the variable importance is a linear
combination of the usage in the rule conditions and the model.
For example, to compute the scores:
As previously mentioned, this code is a port of the command-line
C
code. To run the C
code, the training set
data must be converted to a specific file format as detailed on the
RuleQuest website. Two files are created. The file.data
file is a header-less, comma delimited version of the data (the
file
part is a name given by the user). The
file.names
file provides information about the columns (eg.
levels for categorical data and so on). After running the C
program, another text file called file.models
, which
contains the information needed for prediction.
Once a model has been built with the R
cubist
package, the exportCubistFiles
can be
used to create the .data
, .names
and
.model
files so that the same model can be run at the
command-line.
There are a few features in the C
code that are not yet
operational in the R
package:
C
code decide on using
instances or not. The choice is more explicit in this packageC
code supports binning of predictors