mlens.base package¶
Module contents¶
ML-ENSEMBLE
author: | Sebastian Flennerhag |
---|---|
copyright: | 2017 |
licence: | MIT |
Base modules
-
class
mlens.base.
IdTrain
(size=10)[source]¶ Bases:
mlens.externals.sklearn.base.BaseEstimator
Container to identify training set.
Samples a random subset from set passed to the fit method, to allow identification of the training set in a transform or predict method.
Parameters: size (int) – size to sample. A random subset of size [size, size] will be stored in the instance.
-
class
mlens.base.
BlendIndex
(test_size=0.5, train_size=None, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.base.indexer.BaseIndex
Indexer that generates two non-overlapping subsets of
X
.Iterator that generates one training fold and one test fold that are non-overlapping and that may or may not partition all of X depending on the user’s specification.
BlendIndex creates a singleton generator (has on iteration) that yields two tuples of
(start, stop)
integers that can be used for numpy array slicing (i.e.X[stop:start]
). If a full array index is desired this can easily be achieved with:for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) test_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in test_tup])
Parameters: - test_size (int or float (default = 0.5)) – Size of the test set. If
float
, assumed to be proportion of full data set. - train_size (int or float, optional) – Size of test set. If not specified (i.e.
train_size = None
, train_size is equal ton_samples - test_size
. Iffloat
, assumed to be a proportion of full data set. Iftrain_size
+test_size
amount to less than the observations in the full data set, a subset of specified size will be used. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
See also
Examples
Selecting an absolute test size, with train size as the remainder
>>> import numpy as np >>> from mlens.base.indexer import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 TEST (idx | array): (5, 8) | array([5, 6, 7]) TRAIN (idx | array): (0, 5) | array([0, 1, 2, 3, 4])
Selecting a test and train size less than the total
>>> import numpy as np >>> from mlens.base.indexer import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, 4, X) >>> print('Test size: 3') >>> print('Train size: 4') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 3 Train size: 4 TEST (idx | array): (4, 7) | array([4, 5, 6]) TRAIN (idx | array): (0, 4) | array([0, 1, 2, 3])
Selecting a percentage of observations as test and train set
>>> import numpy as np >>> from mlens.base.indexer import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(0.25, 0.45, X) >>> print('Test size: 25% * 8 = 2') >>> print('Train size: 45% * 8 < 4 -> 3') >>> for tri, tei in idx.generate(X): ... print('TEST (idx | array): (%i, %i) | %r ' % (tei[0], tei[1], ... X[tei[0]:tei[1]])) ... print('TRAIN (idx | array): (%i, %i) | %r ' % (tri[0], tri[1], ... X[tri[0]:tri[1]])) Test size: 25% * 8 = 2 Train size: 50% * 8 < 4 -> TEST (idx | array): (3, 5) | array([[3, 4]]) TRAIN (idx | array): (0, 3) | array([[0, 1, 2]])
Rebasing the test set to be 0-indexed
>>> import numpy as np >>> from mlens.base.indexer import BlendIndex >>> X = np.arange(8) >>> idx = BlendIndex(3, rebase=True) >>> print('Test size: 3') >>> for tri, tei in idx.generate(X): ... print('TEST tuple: (%i, %i) | array: %r' % (tei[0], tei[1], ... np.arange(tei[0], ... tei[1]))) Test size: 3 TEST tuple: (0, 3) | array: array([0, 1, 2])
- test_size (int or float (default = 0.5)) – Size of the test set. If
-
class
mlens.base.
FoldIndex
(n_splits=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.base.indexer.BaseIndex
Indexer that generates the full size of
X
.K-Fold iterator that generates fold index tuples.
FoldIndex creates a generator that returns a tuple of stop and start positions to be used for numpy array slicing [stop:start]. Note that slicing works well for the test set, but for the training set it is recommended to concatenate the index for training data that comes before the current test set with the index for the training data that comes after. This can easily be achieved with:
for train_tup, test_tup in self.generate(): train_slice = numpy.hstack([numpy.arange(t0, t1) for t0, t1 in train_tup]) xtrain, xtest = X[train_slice], X[test_tup[0]:test_tup[1]]
Warning
Simple clicing (i.e.
X[start:stop]
generally does not work for the train set, which often requires concatenating the train index range below the current test set, and the train index range above the current test set. To build get a training index, use``hstack([np.arange(t0, t1) for t0, t1 in train_index_tuples])``.
See also
Examples
Creating arrays of folds and checking overlap
>>> import numpy as np >>> from mlens.base.indexer import FoldIndex >>> X = np.arange(10) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(4, X) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %32r | TEST IDX: %16r' % (train, test)) >>> >>> print() >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN SET: %32r | TEST SET: %16r' % (X[train], X[test])) >>> >>> for train_idx, test_idx in idx.generate(as_array=True): ... assert not any([i in X[test_idx] for i in X[train_idx]]) >>> >>> print() >>> >>> print("No overlap between train set and test set.") Data set: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
TRAIN IDX: array([3, 4, 5, 6, 7, 8, 9]) | TEST IDX: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2, 6, 7, 8, 9]) | TEST IDX: array([3, 4, 5]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST IDX: array([6, 7]) TRAIN IDX: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST IDX: array([8, 9])
TRAIN SET: array([3, 4, 5, 6, 7, 8, 9]) | TEST SET: array([0, 1, 2]) TRAIN SET: array([0, 1, 2, 6, 7, 8, 9]) | TEST SET: array([3, 4, 5]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 8, 9]) | TEST SET: array([6, 7]) TRAIN SET: array([0, 1, 2, 3, 4, 5, 6, 7]) | TEST SET: array([8, 9])
No overlap between train set and test set.
Passing
n_splits = 1
without raising exception.>>> import numpy as np >>> from mlens.base.indexer import FoldIndex >>> X = np.arange(3) >>> print("Data set: %r" % X) >>> print() >>> >>> idx = FoldIndex(1, X, raise_on_exception=False) >>> >>> for train, test in idx.generate(as_array=True): ... print('TRAIN IDX: %10r | TEST IDX: %10r' % (train, test)) /../mlens/base/indexer.py:167: UserWarning: 'n_splits' is 1, will return full index as both training set and test set. warnings.warn("'n_splits' is 1, will return full index as "
Data set: array([0, 1, 2]) TRAIN IDX: array([0, 1, 2]) | TEST IDX: array([0, 1, 2])
-
class
mlens.base.
SubsetIndex
(n_partitions=2, n_splits=2, X=None, raise_on_exception=True)[source]¶ Bases:
mlens.base.indexer.BaseIndex
Subsample index generator.
Generates cross-validation folds according used to create
J
partitions of the data andv
folds on each partition according to as per [1]:Split
X
intoJ
partitionsFor each partition:
- For each fold
v
, create train index of all idx not inv
- Concatenate all the fold
v
indices into a test index for foldv
that spans all partitions
- For each fold
Setting
J = 1
is equivalent to theFullIndexer
, which returns standard K-Fold train and test set indices.See also
FoldIndex
,BlendIndex
,Subsemble
References
[1] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - n_partitions (int, list (default = 2)) – Number of partitions to split data in. If
n_partitions=1
,SubsetIndex
reduces to standard K-Fold. - n_splits (int (default = 2)) – Number of splits to create in each partition.
n_splits
can not be 1 ifn_partition > 1
. Note that ifn_splits = 1
, both the train and test set will index the full data. - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from mlens.base import SubsetIndex >>> X = np.arange(10) >>> idx = SubsetIndex(3, X=X) >>> >>> print('Expected partitions of X:') >>> print('J = 1: {!r}'.format(X[0:4])) >>> print('J = 2: {!r}'.format(X[4:7])) >>> print('J = 3: {!r}'.format(X[7:10])) >>> print('SubsetIndexer partitions:') >>> for i, part in enumerate(idx.partition(as_array=True)): ... print('J = {}: {!r}'.format(i + 1, part)) >>> print('SubsetIndexer folds on partitions:') >>> for i, (tri, tei) in enumerate(idx.generate()): ... fold = i % 2 + 1 ... part = i // 2 + 1 ... train = np.hstack([np.arange(t0, t1) for t0, t1 in tri]) ... test = np.hstack([np.arange(t0, t1) for t0, t1 in tei]) >>> print("J = %i | f = %i | " ... "train: %15r | test: %r" % (part, fold, train, test)) Expected partitions of X: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer partitions: J = 1: array([0, 1, 2, 3]) J = 2: array([4, 5, 6]) J = 3: array([7, 8, 9]) SubsetIndexer folds on partitions: J = 1 | f = 1 | train: array([2, 3]) | test: array([0, 1, 4, 5, 7, 8]) J = 1 | f = 2 | train: array([0, 1]) | test: array([2, 3, 6, 9]) J = 2 | f = 1 | train: array([6]) | test: array([0, 1, 4, 5, 7, 8]) J = 2 | f = 2 | train: array([4, 5]) | test: array([2, 3, 6, 9]) J = 3 | f = 1 | train: array([9]) | test: array([0, 1, 4, 5, 7, 8]) J = 3 | f = 2 | train: array([7, 8]) | test: array([2, 3, 6, 9])
-
fit
(X, y=None, job=None)[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, optional]) – array to _collect dimension data from.
- y (None) – for compatibility
- job (None) – for compatibility
Returns: indexer with stores sample size data.
Return type: instance
-
partition
(X=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)
indices are returned.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
-
class
mlens.base.
FullIndex
(X=None)[source]¶ Bases:
mlens.base.indexer.BaseIndex
Vacuous indexer to be used with final layers.
FullIndex is a compatibility class to be used with meta layers. It stores the sample size to be predicted for use with the
ParallelProcessing
job manager, and yields aNone, None
index when generate is called. However, it is preferable to build code that avoids call thegenerate
method when the indexer is known to be an instance of FullIndex for transparency and maintainability.
-
class
mlens.base.
ClusteredSubsetIndex
(estimator, n_partitions=2, n_splits=2, X=None, y=None, fit_estimator=True, attr='predict', partition_on='X', raise_on_exception=True)[source]¶ Bases:
mlens.base.indexer.BaseIndex
Clustered Subsample index generator.
Generates cross-validation folds according used to create
J
partitions of the data andv
folds on each partition according to as per [2]:Split
X
intoJ
partitionsFor each partition:
- For each fold
v
, create train index of all idx not inv
- Concatenate all the fold
v
indices into a test index for foldv
that spans all partitions
- For each fold
Setting
J = 1
is equivalent to theFullIndexer
, which returns standard K-Fold train and test set indices.ClusteredSubsetIndex
uses a user-provided estimator to partition the data, in contrast to theSubsetIndex
generator, which partitions data into randomly into equal sizes.See also
References
[2] Sapp, S., van der Laan, M. J., & Canny, J. (2014). Subsemble: an ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6), 1247-1259. http://doi.org/10.1080/02664763.2013.864263 Parameters: - estimator (instance) – Estimator to use for clustering.
- n_partitions (int) – Number of partitions the estimator will create.
- n_splits (int (default = 2)) – Number of folds to create in each partition.
n_splits
can not be 1 ifn_partition > 1
. Note that ifn_splits = 1
, both the train and test set will index the full data. - fit_estimator (bool (default = True)) – whether to fit the estimator separately before generating labels.
- attr (str (default = 'predict')) – the attribute to use for generating cluster membership labels.
- X (array-like of shape [n_samples,] , optional) – the training set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - raise_on_exception (bool (default = True)) – whether to warn on suspicious slices or raise an error.
Examples
>>> import numpy as np >>> from sklearn.cluster import KMeans >>> from mlens.base.indexer import ClusteredSubsetIndex >>> >>> km = KMeans(3, random_state=0) >>> X = np.arange(12).reshape(-1, 1); np.random.shuffle(X) >>> print("Data: {}".format(X.ravel())) >>> >>> s = ClusteredSubsetIndex(km) >>> s.fit(X) >>> >>> P = s.estimator.predict(X) >>> print("cluster labels: {}".format(P)) >>> >>> for j, i in enumerate(s.partition(as_array=True)): ... print("partition ({}) index: {}, cluster labels: {}".format(i, j + 1, P[i])) >>> >>> for i in s.generate(as_array=True): ... print("train fold index: {}, cluster labels: {}".format(i[0], P[i[0]])) Data: [ 8 7 5 2 4 10 11 1 3 6 9 0] cluster labels: [0 2 2 1 2 0 0 1 1 2 0 1] partition (1) index: [ 0 5 6 10], cluster labels: [0 0 0 0] partition (2) index: [ 3 7 8 11], cluster labels: [1 1 1 1] partition (3) index: [1 2 4 9], cluster labels: [2 2 2 2] train fold index: [0 3 5], cluster labels: [0 0 0] train fold index: [ 6 10], cluster labels: [0 0] train fold index: [2 7], cluster labels: [1 1] train fold index: [ 9 11], cluster labels: [1 1] train fold index: [1 4], cluster labels: [2 2] train fold index: [8], cluster labels: [2]
-
fit
(X, y=None, job='fit')[source]¶ Method for storing array data.
Parameters: - X (array-like of shape [n_samples, n_features]) – input array.
- y (array-like of shape [n_samples, ]) – labels.
- job (str, ['fit', 'predict'] (default='fit')) – type of estimation job. If ‘fit’, the indexer will be fitted, which involves fitting the estimator. Otherwise, the indexer will not be fitted (since it is not used for prediction).
Returns: indexer with stores sample size data.
Return type: instance
-
partition
(X=None, y=None, as_array=False)[source]¶ Get partition indices for training full subset estimators.
Returns the index range for each partition of X.
Parameters: - X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If
X
is not passed at instantiation, thefit
method must be called beforegenerate
, orX
must be passed as an argument ofgenerate
. - y (array-like of shape [n_samples,], optional) – the labels of the set to partition.
- as_array (bool (default = False)) – whether to return partition as an index array. Otherwise tuples
of
(start, stop)
indices are returned.
- X (array-like of shape [n_samples, n_features] , optional) – the set to partition. The training label array is also,
accepted, as only the first dimension is used. If