mlens.model_selection package¶
Module contents¶
author: | Sebastian Flennerhag |
---|---|
copyright: | 2017 |
licence: | MIT |
-
class
mlens.model_selection.
Evaluator
(scorer, cv=2, shuffle=True, random_state=None, backend=None, error_score=None, metrics=None, array_check=2, n_jobs=-1, verbose=False)[source]¶ Bases:
object
Model selection across several estimators and preprocessing pipelines.
The
Evaluator
allows users to evaluate several models in one call across a set preprocessing pipelines. The class is useful for comparing a set of estimators, especially when several preprocessing pipelines is to be evaluated. By pre-making all folds and iteratively fitting estimators with different parameter settings, array slicing and preprocessing is kept to a minimum. This can greatly reduced fit time compared to creating pipeline classes for each estimator and pipeline and fitting them one at a time in an Scikit-learnsklearn.model_selection.GridSearch
class.Preprocessing can be done before making any evaluation, and several evaluations can be made on the pre-made folds. Current implementation relies on a randomized grid search, so parameter grids must be specified as SciPy distributions (or a class that accepts a
rvs
method).Parameters: - scorer (function) –
a scoring function that follows the Scikit-learn API:
score = scorer(estimator, y_true, y_pred)
A user defines scoring function,
score = f(y_true, y_pred)
can be made into a scorer by calling on the ML-Ensemble implementation of Scikit-learn’smake_scorer
. NOTE: do not use Scikit-learn’smake_scorer
if the Evaluator is to be pickled.from mlens.metrics import make_scorer scorer = make_scorer(scoring_function, **kwargs)
- error_score (int, optional) – score to assign when fitting an estimator fails. If
None
, the evaluator will raise an error. - cv (int or obj (default = 2)) – cross validation folds to use. Either pass a
KFold
class that obeys the Scikit-learn API. - metrics (list, optional) – list of aggregation metrics to calculate on scores. Default is mean and standard deviation.
- shuffle (bool (default = True)) – whether to shuffle input data before creating cv folds.
- random_state (int, optional) – seed for creating folds (if shuffled) and parameter draws
- array_check (int (default = 2)) –
level of strictness in checking input arrays.
array_check = 0
will not checkX
ory
array_check = 1
will checkX
andy
for inconsistencies and warn when format looks suspicious, but retain original format.array_check = 2
will impose Scikit-learn array checks, which convertsX
andy
to numpy arrays and raises an error if conversion fails.
- n_jobs (int (default = -1)) – number of CPU cores to use.
- verbose (bool or int (default = False)) – level of printed messages.
-
summary
¶ dict – Summary output that shows data for best mean test scores, such as test and train scores, std, fit times, and params.
-
cv_results
¶ dict – a nested
dict
of data from each fit. Includes mean and std of test and train scores and fit times, as well as param draw index and parameters.
-
evaluate
(X, y, estimators, param_dicts, n_iter=2)[source]¶ Evaluate set of estimators.
Function for evaluating a set of estimators using cross validation. Similar to a randomized grid search, but applies the grid search to all specified preprocessing pipelines.
Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- estimators (list or dict) –
set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e.
('my_name', my_est)
).If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed:
{'case_a': list_of_est, ...}
. - param_dicts (dict) –
parameter distribution mapping for estimators. Current implementation only supports randomized grid search. Passed distribution object must have an
rvs
method. SeeScipy.stats
for details.There is quite some flexibility in specifying
param_dicts
. If there is no preprocessing, or if all estimators are fitted on all preprocessing cases, theparam_dict
should have keys matching the names of the estimators.estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, 'est': {'param-1': some_distribution} }
It is possible to specify different distributions for some or all preprocessing cases:
preprocessing = {'case-1': transformer_list, 'case-2': transformer_list} estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, ('case-1', 'est'): {'param-1': some_distribution} ('case-2', 'est'): {'param-1': some_distribution, 'param-2': some_distribution} }
If estimators are mapped on a per-preprocessing case basis as a dictionary,
param_dict
must have key entries of the form(case_name, est_name)
. - n_iter (int) – number of parameter draws to evaluate.
Returns: self – class instance with stored estimator evaluation results.
Return type: instance
-
fit
(X, y, estimators, param_dicts, n_iter=2, preprocessing=None)[source]¶ Fit the Evaluator to given data, estimators and preprocessing.
Utility function that calls
preprocess
andevaluate
. The following is equivalent:# Explicitly calling preprocess and evaluate evaluator.preprocess(X, y, preprocessing) evaluator.evaluate(X, y, estimators, param_dicts, n_iter) # Calling fit evaluator.fit(X, y, estimators, param_dicts, n_iter, preprocessing)
Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- estimators (list or dict) –
set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e.
('my_name', my_est)
).If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed:
{'case_a': list_of_est, ...}
. - param_dicts (dict) –
parameter distribution mapping for estimators. Current implementation only supports randomized grid search. Passed distribution object must have an
rvs
method. SeeScipy.stats
for details.There is quite some flexibility in specifying
param_dicts
. If there is no preprocessing, or if all estimators are fitted on all preprocessing cases, theparam_dict
should have keys matching the names of the estimators.estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, 'est': {'param-1': some_distribution} }
It is possible to specify different distributions for some or all preprocessing cases:
preprocessing = {'case-1': transformer_list, 'case-2': transformer_list} estimators = [('name', est), est] param_dicts = {'name': {'param-1': some_distribution}, ('case-1', 'est'): {'param-1': some_distribution} ('case-2', 'est'): {'param-1': some_distribution, 'param-2': some_distribution} }
If estimators are mapped on a per-preprocessing case basis as a dictionary,
param_dict
must have key entries of the form(case_name, est_name)
. - n_iter (int) – number of parameter draws to evaluate.
- preprocessing (dict, optional) –
preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.
preprocessing = {'case_name': transformer_list,}
Returns: self – class instance with stored estimator evaluation results.
Return type: instance
-
preprocess
(X, y, preprocessing=None)[source]¶ Preprocess folds.
Method for preprocessing data separately from the evaluation method. Helpful if preprocessing is costly relative to estimator fitting and several
evaluate
calls might be desired.Parameters: - X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
- y (array-like, shape=[n_samples, ]) – training labels.
- preprocessing (list or dict, optional) –
preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.
preprocessing = {'case_name': transformer_list,}
Returns: self – class instance with stored estimator evaluation results.
Return type: instance
- scorer (function) –