mlens.model_selection package

Module contents

author:Sebastian Flennerhag
copyright:2017
licence:MIT
class mlens.model_selection.Evaluator(scorer, cv=2, shuffle=True, random_state=None, backend=None, error_score=None, metrics=None, array_check=2, n_jobs=-1, verbose=False)[source]

Bases: object

Model selection across several estimators and preprocessing pipelines.

The Evaluator allows users to evaluate several models in one call across a set preprocessing pipelines. The class is useful for comparing a set of estimators, especially when several preprocessing pipelines is to be evaluated. By pre-making all folds and iteratively fitting estimators with different parameter settings, array slicing and preprocessing is kept to a minimum. This can greatly reduced fit time compared to creating pipeline classes for each estimator and pipeline and fitting them one at a time in an Scikit-learn sklearn.model_selection.GridSearch class.

Preprocessing can be done before making any evaluation, and several evaluations can be made on the pre-made folds. Current implementation relies on a randomized grid search, so parameter grids must be specified as SciPy distributions (or a class that accepts a rvs method).

Parameters:
  • scorer (function) –

    a scoring function that follows the Scikit-learn API:

    score = scorer(estimator, y_true, y_pred)
    

    A user defines scoring function, score = f(y_true, y_pred) can be made into a scorer by calling on the ML-Ensemble implementation of Scikit-learn’s make_scorer. NOTE: do not use Scikit-learn’s make_scorer if the Evaluator is to be pickled.

    from mlens.metrics import make_scorer
    scorer = make_scorer(scoring_function, **kwargs)
    
  • error_score (int, optional) – score to assign when fitting an estimator fails. If None, the evaluator will raise an error.
  • cv (int or obj (default = 2)) – cross validation folds to use. Either pass a KFold class that obeys the Scikit-learn API.
  • metrics (list, optional) – list of aggregation metrics to calculate on scores. Default is mean and standard deviation.
  • shuffle (bool (default = True)) – whether to shuffle input data before creating cv folds.
  • random_state (int, optional) – seed for creating folds (if shuffled) and parameter draws
  • array_check (int (default = 2)) –

    level of strictness in checking input arrays.

    • array_check = 0 will not check X or y
    • array_check = 1 will check X and y for inconsistencies and warn when format looks suspicious, but retain original format.
    • array_check = 2 will impose Scikit-learn array checks, which converts X and y to numpy arrays and raises an error if conversion fails.
  • n_jobs (int (default = -1)) – number of CPU cores to use.
  • verbose (bool or int (default = False)) – level of printed messages.
summary

dict – Summary output that shows data for best mean test scores, such as test and train scores, std, fit times, and params.

cv_results

dict – a nested dict of data from each fit. Includes mean and std of test and train scores and fit times, as well as param draw index and parameters.

evaluate(X, y, estimators, param_dicts, n_iter=2)[source]

Evaluate set of estimators.

Function for evaluating a set of estimators using cross validation. Similar to a randomized grid search, but applies the grid search to all specified preprocessing pipelines.

Parameters:
  • X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
  • y (array-like, shape=[n_samples, ]) – training labels.
  • estimators (list or dict) –

    set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e. ('my_name', my_est)).

    If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed: {'case_a': list_of_est, ...}.

  • param_dicts (dict) –

    parameter distribution mapping for estimators. Current implementation only supports randomized grid search. Passed distribution object must have an rvs method. See Scipy.stats for details.

    There is quite some flexibility in specifying param_dicts. If there is no preprocessing, or if all estimators are fitted on all preprocessing cases, the param_dict should have keys matching the names of the estimators.

    estimators = [('name', est), est]
    
    param_dicts = {'name': {'param-1': some_distribution},
                   'est': {'param-1': some_distribution}
                  }
    

    It is possible to specify different distributions for some or all preprocessing cases:

    preprocessing = {'case-1': transformer_list,
                     'case-2': transformer_list}
    
    estimators = [('name', est), est]
    
    param_dicts = {'name':
                       {'param-1': some_distribution},
                   ('case-1', 'est'):
                       {'param-1': some_distribution}
                   ('case-2', 'est'):
                       {'param-1': some_distribution,
                        'param-2': some_distribution}
                  }
    

    If estimators are mapped on a per-preprocessing case basis as a dictionary, param_dict must have key entries of the form (case_name, est_name).

  • n_iter (int) – number of parameter draws to evaluate.
Returns:

self – class instance with stored estimator evaluation results.

Return type:

instance

fit(X, y, estimators, param_dicts, n_iter=2, preprocessing=None)[source]

Fit the Evaluator to given data, estimators and preprocessing.

Utility function that calls preprocess and evaluate. The following is equivalent:

# Explicitly calling preprocess and evaluate
evaluator.preprocess(X, y, preprocessing)
evaluator.evaluate(X, y, estimators, param_dicts, n_iter)

# Calling fit
evaluator.fit(X, y, estimators, param_dicts, n_iter, preprocessing)
Parameters:
  • X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
  • y (array-like, shape=[n_samples, ]) – training labels.
  • estimators (list or dict) –

    set of estimators to use. If no preprocessing is desired or if only on preprocessing pipeline should apply to all, pass a list of estimators. The list can contain elements of named tuples (i.e. ('my_name', my_est)).

    If different estimators should be mapped to preprocessing cases, a dictionary that maps estimators to each case should be passed: {'case_a': list_of_est, ...}.

  • param_dicts (dict) –

    parameter distribution mapping for estimators. Current implementation only supports randomized grid search. Passed distribution object must have an rvs method. See Scipy.stats for details.

    There is quite some flexibility in specifying param_dicts. If there is no preprocessing, or if all estimators are fitted on all preprocessing cases, the param_dict should have keys matching the names of the estimators.

    estimators = [('name', est), est]
    
    param_dicts = {'name': {'param-1': some_distribution},
                   'est': {'param-1': some_distribution}
                  }
    

    It is possible to specify different distributions for some or all preprocessing cases:

    preprocessing = {'case-1': transformer_list,
                     'case-2': transformer_list}
    
    estimators = [('name', est), est]
    
    param_dicts = {'name':
                       {'param-1': some_distribution},
                   ('case-1', 'est'):
                       {'param-1': some_distribution}
                   ('case-2', 'est'):
                       {'param-1': some_distribution,
                        'param-2': some_distribution}
                  }
    

    If estimators are mapped on a per-preprocessing case basis as a dictionary, param_dict must have key entries of the form (case_name, est_name).

  • n_iter (int) – number of parameter draws to evaluate.
  • preprocessing (dict, optional) –

    preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.

    preprocessing = {'case_name': transformer_list,}
    
Returns:

self – class instance with stored estimator evaluation results.

Return type:

instance

initialize(X, y)[source]

Set up ParallelEvaluation job manager.

preprocess(X, y, preprocessing=None)[source]

Preprocess folds.

Method for preprocessing data separately from the evaluation method. Helpful if preprocessing is costly relative to estimator fitting and several evaluate calls might be desired.

Parameters:
  • X (array-like, shape=[n_samples, n_features]) – input data to preprocess and create folds from.
  • y (array-like, shape=[n_samples, ]) – training labels.
  • preprocessing (list or dict, optional) –

    preprocessing cases to consider. Pass a dictionary mapping a case name to a preprocessing pipeline.

    preprocessing = {'case_name': transformer_list,}
    
Returns:

self – class instance with stored estimator evaluation results.

Return type:

instance

terminate()[source]

Terminate evaluation job.