The C1C2: A framework for simultaneous model selection and assessment


There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment.

Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data.

Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets.

A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates.

The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model.

Results: The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error.

Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice significantly, but not the generalization error estimates.

The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a significantly lower accuracy of the generalization error estimates was observed.

Conclusions: The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations.

A complete separation of the model choice and the model assessment in terms of used data improves the estimates of the generalization error.

Author: Martin Eklund, Ola Spjuth and Jarl ES Wikberg
Credits/Source: BMC Bioinformatics 2008, 9:360



Published on: 2008-09-02

Limited copyright is granted for you to use and/or republish any story on this site for any legitimate media purpose as long as you reference 7thSpace and any source mentioned in the story above. Please make sure to read our disclaimer prior to contacting 7thSpace Interactive. To contact our editors, visit our online helpdesk. If you wish submit your own press release, click here.

Social Bookmarking
Digg this! | Post to del.icio.us | Post to Furl | Add to Netscape | Add to Yahoo! | Rojo



Comments Page 0 of 0
There are currently 0 comments to display.

 


+ Add New Comment


Custom Search

Username
Password





© 2008 7thSpace Interactive
All Rights Reserved - About | Disclaimer | Helpdesk
There are currently 11613 people browsing 7thSpace