skorch.helper ¶

Helper functions and classes for users.

They should not be used in skorch directly.

class


    skorch.helper.


    DataFrameTransformer

( treat_int_as_categorical=False , float_dtype=<class 'numpy.float32'> , int_dtype=<class 'numpy.int64'> ) [source] ¶

Transform a DataFrame into a dict useful for working with skorch.

Transforms cardinal data to floats and categorical data to vectors of ints so that they can be embedded.

Although skorch can deal with pandas DataFrames, the default behavior is often not very useful. Use this transformer to transform the DataFrame into a dict with all float columns concatenated using the key “X” and all categorical values encoded as integers, using their respective column names as keys.

Your module must have a matching signature for this to work. It must accept an argument X for all cardinal values. Additionally, for all categorical values, it must accept an argument with the same name as the corresponding column (see example below). If you need help with the required signature, use the describe_signature method of this class and pass it your data.

You can choose whether you want to treat int columns the same as float columns (default) or as categorical values.

To one-hot encode categorical features, initialize their corresponding embedding layers using the identity matrix.

Parameters:

treat_int_as_categorical : bool (default=False)

Whether to treat integers as categorical values or as cardinal values, i.e. the same as floats.

float_dtype : numpy dtype or None (default=np.float32)

The dtype to cast the cardinal values to. If None, don’t change them.

int_dtype : numpy dtype or None (default=np.int64)

The dtype to cast the categorical values to. If None, don’t change them. If you do this, it can happen that the categorical values will have different dtypes, reflecting the number of unique categories.

>>> df = pd.DataFrame({
...     'col_floats': np.linspace(0, 1, 12),
...     'col_ints': [11, 11, 10] * 4,
...     'col_cats': ['a', 'b', 'a'] * 4,
... })
>>> # cast to category dtype to later learn embeddings
>>> df['col_cats'] = df['col_cats'].astype('category')
>>> y = np.asarray([0, 1, 0] * 4)
>>> class MyModule(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.reset_params()
>>>     def reset_params(self):
...         self.embedding = nn.Embedding(2, 10)
...         self.linear = nn.Linear(2, 10)
...         self.out = nn.Linear(20, 2)
...         self.nonlin = nn.Softmax(dim=-1)
>>>     def forward(self, X, col_cats):
...         # "X" contains the values from col_floats and col_ints
...         # "col_cats" contains the values from "col_cats"
...         X_lin = self.linear(X)
...         X_cat = self.embedding(col_cats)
...         X_concat = torch.cat((X_lin, X_cat), dim=1)
...         return self.nonlin(self.out(X_concat))
>>> net = NeuralNetClassifier(MyModule)
>>> pipe = Pipeline([
...     ('transform', DataFrameTransformer()),
...     ('net', net),
... ])
>>> pipe.fit(df, y)
Methods
describe_signature(df)[source]¶
Describe the signature required for the given data.
Pass the DataFrame to receive a description of the signature
required for the module’s forward method. The description
consists of three parts:
1. The names of the arguments that the forward method
needs.
2. The dtypes of the torch tensors passed to forward.
3. The number of input units that are required for the
corresponding argument. For the float parameter, this is just
the number of dimensions of the tensor. For categorical
parameters, it is the number of unique elements.
signature : dict
Returns a dict with each key corresponding to one key
required for the forward method. The values are dictionaries
of two elements. The key “dtype” describes the torch dtype
of the resulting tensor, the key “input_units” describes the
required number of input units.
X_dict: dict
Dictionary with all floats concatenated using the key “X”
and all categorical values encoded as integers, using their
respective column names as keys.
class skorch.helper.SliceDataset(dataset, idx=0, indices=None)[source]¶
Helper class that wraps a torch dataset to make it work with
sklearn.
Sometimes, sklearn will touch the input data, e.g. when splitting
the data for a grid search. This will fail when the input data is
a torch dataset. To prevent this, use this wrapper class for your
dataset.
Note: This class will only return the X value by default (i.e. the
first value returned by indexing the original dataset). Sklearn,
and hence skorch, always require 2 values, X and y. Therefore, you
still need to provide the y data separately.
Note: This class behaves similarly to a PyTorch
Subset when it is indexed by a slice or
numpy array: It will return another SliceDataset that
references the subset instead of the actual values. Only when it
is indexed by an int does it return the actual values. The reason
for this is to avoid loading all data into memory when sklearn,
for instance, creates a train/validation split on the
dataset. Data will only be loaded in batches during the fit loop.
idx : int (default=0)
Indicates which element of the dataset should be
returned. Typically, the dataset returns both X and y
values. SliceDataset can only return 1 value. If you want to
get X, choose idx=0 (default), if you want y, choose idx=1.
indices : list, np.ndarray, or None (default=None)
If you only want to return a subset of the dataset, indicate
which subset that is by passing this argument. Typically, this
can be left to be None, which returns all the data. See also
Subset.
>>> X = MyCustomDataset()
>>> search = GridSearchCV(net, params, ...)
>>> search.fit(X, y)  # raises error
>>> ds = SliceDataset(X)
>>> search.fit(ds, y)  # works
transform(data)[source]¶
Additional transformations on data.
Note: If you use this in conjuction with PyTorch
DataLoader, the latter will call
the dataset for each row separately, which means that the
incoming data is a single rows.
class skorch.helper.SliceDict(**kwargs)[source]¶
Wrapper for Python dict that makes it sliceable across values.
Use this if your input data is a dictionary and you have problems
with sklearn not being able to slice it. Wrap your dict with
SliceDict and it should usually work.
Note:
SliceDict cannot be indexed by integers, if you want one row,
say row 3, use [3:4].
SliceDict accepts numpy arrays and torch tensors as values.
Examples
>>> X = {'key0': val0, 'key1': val1}
>>> search = GridSearchCV(net, params, ...)
>>> search.fit(X, y)  # raises error
>>> Xs = SliceDict(key0=val0, key1=val1)  # or Xs = SliceDict(**X)
>>> search.fit(Xs, y)  # works
fromkeys(*args, **kwargs)
fromkeys method makes no sense with SliceDict and is thus not supported.
get($self, key[, default])
Return the value for key if key is in the dictionary, else default.
items()
keys()
pop(k[,d])
If key is not found, d is returned if given, otherwise KeyError is raised
popitem()
2-tuple; but raise KeyError if D is empty.
setdefault($self, key[, default])
Insert key with a value of default if key is not in the dictionary.
update([E, ]**F)
If E is present and has a .keys() method, then does:  for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does:  for k, v in E: D[k] = v In either case, this is followed by: for k in F:  D[k] = F[k]
values()
fromkeys(*args, **kwargs)[source]¶
fromkeys method makes no sense with SliceDict and is thus not
supported.
update([E, ]**F) → None.  Update D from dict/iterable E and F.[source]¶
If E is present and has a .keys() method, then does:  for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does:  for k, v in E: D[k] = v
In either case, this is followed by: for k in F:  D[k] = F[k]