mlens.preprocessing package¶
Module contents¶
author: | Sebastian Flennerhag |
---|---|
copyright: | 2017 |
licence: | MIT |
-
class
mlens.preprocessing.
Subset
(subset=None)[source]¶ Bases:
mlens.externals.sklearn.base.BaseEstimator
,mlens.externals.sklearn.base.TransformerMixin
Select a subset of features.
The
Subset
class acts as a transformer that reduces the feature set to a subset specified by the user.Parameters: subset (list) – list of columns indexes to select subset with. Indexes can either be of type str
if data accepts slicing on a list of strings, otherwise the list should be of typeint
.-
fit
(X, y=None)[source]¶ Learn what format the data is stored in.
Parameters: - X (array-like of shape = [n_samples, n_features]) – The whose type will be inferred.
- y (array-like of shape = [n_samples, n_features]) – pass-through for Scikit-learn pipeline compatibility.
-
transform
(X, y=None, copy=False)[source]¶ Return specified subset of X.
Parameters: - X (array-like of shape = [n_samples, n_features]) – The whose type will be inferred.
- y (array-like of shape = [n_samples, n_features]) – pass-through for Scikit-learn pipeline compatibility.
- copy (bool (default = None)) – whether to copy X before transforming.
-
-
class
mlens.preprocessing.
Shift
(s)[source]¶ Bases:
mlens.externals.sklearn.base.BaseEstimator
,mlens.externals.sklearn.base.TransformerMixin
Lag operator.
Shift an input array \(X\) with \(s\) steps, i.e. for some time series \(\mathbf{X} = (X_t, X_{t-1}, ..., X_{0})\),
\[L^{s} \mathbf{X} = (X_{t-s}, X_{t-1-s}, ..., X_{s - s})\]Parameters: s (int) – number of lags to generate Examples
>>> import numpy as np >>> from mlens.preprocessing import Shift >>> X = np.arange(10) >>> L = Shift(2) >>> Z = L.fit_transform(X) >>> print("X : {}".format(X[2:])) >>> print("Z : {}".format(Z)) X : [2 3 4 5 6 7 8 9] Z : [0 1 2 3 4 5 6 7]
-
class
mlens.preprocessing.
EnsembleTransformer
(shuffle=False, random_state=None, scorer=None, raise_on_exception=True, array_check=2, verbose=False, n_jobs=1, layers=None, backend=None, sample_dim=10)[source]¶ Bases:
mlens.ensemble.base.BaseEnsemble
Ensemble Transformer class.
The Ensemble class allows users to build layers of an ensemble through a transformer API. The transformer is closely related to
SequentialEnsemble
, in that any accepted type of layer can be added. The transformer differs fundamentally in one significant aspect: when fitted, it will store a random sample of the training set together with the training dimensions, and if in a call totransform
, the data to be transformed correspodns to the training set, the transformer will recreate the prediction matrix from thefit
call. In contrast, a fitted ensemble will only use the base learners fitted on the full dataset, and as such predicting the training set will not reproduce the predictions from thefit
call.The
EnsembleTransformer
is a powerful tool to use as a preprocessing pipeline in anEvaluator
instance, as it would faithfully recreate the prediction matrix a potential meta learner would face. Hence, a user can ‘preprocess’ the training data with theEnsembleTransformer
to generate k-fold base learner predictions, and then fit different meta learners (or higher-order layers) in a call toevaluate
.See also
SequentialEnsemble
,Evaluator
Parameters: - shuffle (bool (default = True)) – whether to shuffle data before generating folds.
- random_state (int (default = None)) – random seed if shuffling inputs.
- scorer (object (default = None)) – scoring function. If a function is provided, base estimators will be
scored on the training set assembled for fitting the meta estimator.
Since those predictions are out-of-sample, the scores represent valid
test scores. The scorer should be a function that accepts an array of
true values and an array of predictions:
score = f(y_true, y_pred)
. - raise_on_exception (bool (default = True)) – whether to issue warnings on soft exceptions or raise error.
Examples include lack of layers, bad inputs, and failed fit of an
estimator in a layer. If set to
False
, warnings are issued instead but estimation continues unless exception is fatal. Note that this can result in unexpected behavior unless the exception is anticipated. - sample_dim (int (default = 10)) – dimensionality of training set to sample. During a call to fit, a
random sample of size [sample_dim, sample_dim] will be sampled from the
training data, along with the dimensions of the training data. If in a
call to
transform
, sampling the same indices on the array to transform gives the same sample matrix, the transformer will reproduce the predictions from the call tofit
, as opposed to using the base learners fitted on the full training data. - array_check (int (default = 2)) –
level of strictness in checking input arrays.
array_check = 0
will not checkX
ory
array_check = 1
will checkX
andy
for inconsistencies and warn when format looks suspicious, but retain original format.array_check = 2
will impose Scikit-learn array checks, which convertsX
andy
to numpy arrays and raises an error if conversion fails.
- verbose (int or bool (default = False)) –
level of verbosity.
verbose = 0
silent (same asverbose = False
)verbose = 1
messages at start and finish (same asverbose = True
)verbose = 2
messages for each layer
If
verbose >= 50
prints tosys.stdout
, elsesys.stderr
. For verbosity in the layers themselves, usefit_params
. - n_jobs (int (default = 1)) – number of CPU cores to use for fitting and prediction.
-
scores_
¶ dict – if
scorer
was passed to instance,scores_
contains dictionary with cross-validated scores assembled duringfit
call. The fold structure used for scoring is determined byfolds
.
Examples
>>> from mlens.preprocessing import EnsembleTransformer >>> from mlens.model_selection import Evaluator >>> from mlens.metrics.metrics import rmse >>> from sklearn.datasets import load_boston >>> from sklearn.linear_model import Lasso >>> from sklearn.svm import SVR >>> from scipy.stats import uniform >>> from pandas import DataFrame >>> >>> X, y = load_boston(True) >>> >>> ensemble = EnsembleTransformer() >>> >>> ensemble.add('stack', [SVR(), Lasso()]) >>> >>> evl = Evaluator(scorer=rmse, random_state=10) >>> >>> evl.preprocess(X, y, [('scale', ensemble)]) >>> >>> draws = {(None, 'svr'): {'C': uniform(10, 100)}, ... (None, 'lasso'): {'alpha': uniform(0.01, 0.1)}} >>> >>> evl.evaluate(X, y, [SVR(), Lasso()], draws, n_iter=10) >>> >>> DataFrame(evl.summary) fit_time_mean fit_time_std test_score_mean test_score_std \ lasso 0.000818 0.000362 7.514181 0.827578 svr 0.009790 0.000596 10.949149 0.577554 train_score_mean train_score_std params lasso 6.228287 0.949872 {'alpha': 0.0871320643267} svr 5.794856 1.348409 {'C': 12.0751949359}
-
add
(cls, estimators, preprocessing=None, **kwargs)[source]¶ Add layer to ensemble transformer.
Parameters: - cls (str) –
layer class. Accepted types are:
- ‘blend’ : blend ensemble
- ‘subset’ : subsemble
- ‘stack’ : super learner
- estimators (dict of lists or list or instance) –
estimators constituting the layer. If preprocessing is none and the layer is meant to be the meta estimator, it is permissible to pass a single instantiated estimator. If
preprocessing
isNone
orlist
,estimators
should be alist
. The list can either contain estimator instances, named tuples of estimator instances, or a combination of both.option_1 = [estimator_1, estimator_2] option_2 = [("est-1", estimator_1), ("est-2", estimator_2)] option_3 = [estimator_1, ("est-2", estimator_2)]
If different preprocessing pipelines are desired, a dictionary that maps estimators to preprocessing pipelines must be passed. The names of the estimator dictionary must correspond to the names of the estimator dictionary.
preprocessing_cases = {"case-1": [trans_1, trans_2], "case-2": [alt_trans_1, alt_trans_2]} estimators = {"case-1": [est_a, est_b], "case-2": [est_c, est_d]}
The lists for each dictionary entry can be any of
option_1
,option_2
andoption_3
. - preprocessing (dict of lists or list, optional (default = None)) –
preprocessing pipelines for given layer. If the same preprocessing applies to all estimators,
preprocessing
should be a list of transformer instances. The list can contain the instances directly, named tuples of transformers, or a combination of both.option_1 = [transformer_1, transformer_2] option_2 = [("trans-1", transformer_1), ("trans-2", transformer_2)] option_3 = [transformer_1, ("trans-2", transformer_2)]
If different preprocessing pipelines are desired, a dictionary that maps preprocessing pipelines must be passed. The names of the preprocessing dictionary must correspond to the names of the estimator dictionary.
preprocessing_cases = {"case-1": [trans_1, trans_2], "case-2": [alt_trans_1, alt_trans_2]} estimators = {"case-1": [est_a, est_b], "case-2": [est_c, est_d]}
The lists for each dictionary entry can be any of
option_1
,option_2
andoption_3
. - **kwargs (optional) – optional keyword arguments to instantiate layer with. See respective ensemble for further details.
Returns: self – ensemble instance with layer instantiated.
Return type: instance
- cls (str) –
-
fit
(X, y=None)[source]¶ Fit the transformer.
Same as the fit method on an ensemble, except that a sample of X is stored for future comparison.
-
transform
(X, y=None)[source]¶ Transform input \(X\) into a prediction matrix \(Z\).
If \(X\) is the training set, the transformer will reproduce the \(Z\) from the call to
fit
. If X is another data set, \(Z\) will be produced using base learners fitted on the full training data (equivalent to callingpredict
on an ensemble.)