mlens.preprocessing package

Module contents

author:Sebastian Flennerhag
copyright:2017
licence:MIT
class mlens.preprocessing.Subset(subset=None)[source]

Bases: mlens.externals.sklearn.base.BaseEstimator, mlens.externals.sklearn.base.TransformerMixin

Select a subset of features.

The Subset class acts as a transformer that reduces the feature set to a subset specified by the user.

Parameters:subset (list) – list of columns indexes to select subset with. Indexes can either be of type str if data accepts slicing on a list of strings, otherwise the list should be of type int.
fit(X, y=None)[source]

Learn what format the data is stored in.

Parameters:
  • X (array-like of shape = [n_samples, n_features]) – The whose type will be inferred.
  • y (array-like of shape = [n_samples, n_features]) – pass-through for Scikit-learn pipeline compatibility.
transform(X, y=None, copy=False)[source]

Return specified subset of X.

Parameters:
  • X (array-like of shape = [n_samples, n_features]) – The whose type will be inferred.
  • y (array-like of shape = [n_samples, n_features]) – pass-through for Scikit-learn pipeline compatibility.
  • copy (bool (default = None)) – whether to copy X before transforming.
class mlens.preprocessing.Shift(s)[source]

Bases: mlens.externals.sklearn.base.BaseEstimator, mlens.externals.sklearn.base.TransformerMixin

Lag operator.

Shift an input array \(X\) with \(s\) steps, i.e. for some time series \(\mathbf{X} = (X_t, X_{t-1}, ..., X_{0})\),

\[L^{s} \mathbf{X} = (X_{t-s}, X_{t-1-s}, ..., X_{s - s})\]
Parameters:s (int) – number of lags to generate

Examples

>>> import numpy as np
>>> from mlens.preprocessing import Shift
>>> X = np.arange(10)
>>> L = Shift(2)
>>> Z = L.fit_transform(X)
>>> print("X : {}".format(X[2:]))
>>> print("Z : {}".format(Z))
X : [2 3 4 5 6 7 8 9]
Z : [0 1 2 3 4 5 6 7]
fit(X, y=None)[source]

Pass through for compatability.

transform(X)[source]

Return lagged dataset.

class mlens.preprocessing.EnsembleTransformer(shuffle=False, random_state=None, scorer=None, raise_on_exception=True, array_check=2, verbose=False, n_jobs=1, layers=None, backend=None, sample_dim=10)[source]

Bases: mlens.ensemble.base.BaseEnsemble

Ensemble Transformer class.

The Ensemble class allows users to build layers of an ensemble through a transformer API. The transformer is closely related to SequentialEnsemble, in that any accepted type of layer can be added. The transformer differs fundamentally in one significant aspect: when fitted, it will store a random sample of the training set together with the training dimensions, and if in a call to transform, the data to be transformed correspodns to the training set, the transformer will recreate the prediction matrix from the fit call. In contrast, a fitted ensemble will only use the base learners fitted on the full dataset, and as such predicting the training set will not reproduce the predictions from the fit call.

The EnsembleTransformer is a powerful tool to use as a preprocessing pipeline in an Evaluator instance, as it would faithfully recreate the prediction matrix a potential meta learner would face. Hence, a user can ‘preprocess’ the training data with the EnsembleTransformer to generate k-fold base learner predictions, and then fit different meta learners (or higher-order layers) in a call to evaluate.

See also

SequentialEnsemble, Evaluator

Parameters:
  • shuffle (bool (default = True)) – whether to shuffle data before generating folds.
  • random_state (int (default = None)) – random seed if shuffling inputs.
  • scorer (object (default = None)) – scoring function. If a function is provided, base estimators will be scored on the training set assembled for fitting the meta estimator. Since those predictions are out-of-sample, the scores represent valid test scores. The scorer should be a function that accepts an array of true values and an array of predictions: score = f(y_true, y_pred).
  • raise_on_exception (bool (default = True)) – whether to issue warnings on soft exceptions or raise error. Examples include lack of layers, bad inputs, and failed fit of an estimator in a layer. If set to False, warnings are issued instead but estimation continues unless exception is fatal. Note that this can result in unexpected behavior unless the exception is anticipated.
  • sample_dim (int (default = 10)) – dimensionality of training set to sample. During a call to fit, a random sample of size [sample_dim, sample_dim] will be sampled from the training data, along with the dimensions of the training data. If in a call to transform, sampling the same indices on the array to transform gives the same sample matrix, the transformer will reproduce the predictions from the call to fit, as opposed to using the base learners fitted on the full training data.
  • array_check (int (default = 2)) –

    level of strictness in checking input arrays.

    • array_check = 0 will not check X or y
    • array_check = 1 will check X and y for inconsistencies and warn when format looks suspicious, but retain original format.
    • array_check = 2 will impose Scikit-learn array checks, which converts X and y to numpy arrays and raises an error if conversion fails.
  • verbose (int or bool (default = False)) –

    level of verbosity.

    • verbose = 0 silent (same as verbose = False)
    • verbose = 1 messages at start and finish (same as verbose = True)
    • verbose = 2 messages for each layer

    If verbose >= 50 prints to sys.stdout, else sys.stderr. For verbosity in the layers themselves, use fit_params.

  • n_jobs (int (default = 1)) – number of CPU cores to use for fitting and prediction.
scores_

dict – if scorer was passed to instance, scores_ contains dictionary with cross-validated scores assembled during fit call. The fold structure used for scoring is determined by folds.

Examples

>>> from mlens.preprocessing import EnsembleTransformer
>>> from mlens.model_selection import Evaluator
>>> from mlens.metrics.metrics import rmse
>>> from sklearn.datasets import load_boston
>>> from sklearn.linear_model import Lasso
>>> from sklearn.svm import SVR
>>> from scipy.stats import uniform
>>> from pandas import DataFrame
>>>
>>> X, y = load_boston(True)
>>>
>>> ensemble = EnsembleTransformer()
>>>
>>> ensemble.add('stack', [SVR(), Lasso()])
>>>
>>> evl = Evaluator(scorer=rmse, random_state=10)
>>>
>>> evl.preprocess(X, y, [('scale', ensemble)])
>>>
>>> draws = {(None, 'svr'): {'C': uniform(10,  100)},
...          (None, 'lasso'): {'alpha': uniform(0.01, 0.1)}}
>>>
>>> evl.evaluate(X, y, [SVR(), Lasso()], draws, n_iter=10)
>>>
>>> DataFrame(evl.summary)
       fit_time_mean  fit_time_std  test_score_mean  test_score_std  \
lasso       0.000818      0.000362         7.514181        0.827578
svr         0.009790      0.000596        10.949149        0.577554
       train_score_mean  train_score_std                      params
lasso          6.228287         0.949872  {'alpha': 0.0871320643267}
svr            5.794856         1.348409        {'C': 12.0751949359}
add(cls, estimators, preprocessing=None, **kwargs)[source]

Add layer to ensemble transformer.

Parameters:
  • cls (str) –

    layer class. Accepted types are:

    • ‘blend’ : blend ensemble
    • ‘subset’ : subsemble
    • ‘stack’ : super learner
  • estimators (dict of lists or list or instance) –

    estimators constituting the layer. If preprocessing is none and the layer is meant to be the meta estimator, it is permissible to pass a single instantiated estimator. If preprocessing is None or list, estimators should be a list. The list can either contain estimator instances, named tuples of estimator instances, or a combination of both.

    option_1 = [estimator_1, estimator_2]
    option_2 = [("est-1", estimator_1), ("est-2", estimator_2)]
    option_3 = [estimator_1, ("est-2", estimator_2)]
    

    If different preprocessing pipelines are desired, a dictionary that maps estimators to preprocessing pipelines must be passed. The names of the estimator dictionary must correspond to the names of the estimator dictionary.

    preprocessing_cases = {"case-1": [trans_1, trans_2],
                           "case-2": [alt_trans_1, alt_trans_2]}
    
    estimators = {"case-1": [est_a, est_b],
                  "case-2": [est_c, est_d]}
    

    The lists for each dictionary entry can be any of option_1, option_2 and option_3.

  • preprocessing (dict of lists or list, optional (default = None)) –

    preprocessing pipelines for given layer. If the same preprocessing applies to all estimators, preprocessing should be a list of transformer instances. The list can contain the instances directly, named tuples of transformers, or a combination of both.

    option_1 = [transformer_1, transformer_2]
    option_2 = [("trans-1", transformer_1),
                ("trans-2", transformer_2)]
    option_3 = [transformer_1, ("trans-2", transformer_2)]
    

    If different preprocessing pipelines are desired, a dictionary that maps preprocessing pipelines must be passed. The names of the preprocessing dictionary must correspond to the names of the estimator dictionary.

    preprocessing_cases = {"case-1": [trans_1, trans_2],
                           "case-2": [alt_trans_1, alt_trans_2]}
    
    estimators = {"case-1": [est_a, est_b],
                  "case-2": [est_c, est_d]}
    

    The lists for each dictionary entry can be any of option_1, option_2 and option_3.

  • **kwargs (optional) – optional keyword arguments to instantiate layer with. See respective ensemble for further details.
Returns:

self – ensemble instance with layer instantiated.

Return type:

instance

fit(X, y=None)[source]

Fit the transformer.

Same as the fit method on an ensemble, except that a sample of X is stored for future comparison.

predict(X)[source]

Generate predictions for X. Same as transform.

transform(X, y=None)[source]

Transform input \(X\) into a prediction matrix \(Z\).

If \(X\) is the training set, the transformer will reproduce the \(Z\) from the call to fit. If X is another data set, \(Z\) will be produced using base learners fitted on the full training data (equivalent to calling predict on an ensemble.)