我在使用 dask.dataframe.core.DataFrame 规范 Dask.dask_ml.preprocessing.MinMaxScaler 时遇到了问题,我可以使用 sklearn.preprocessing.MinMaxScaler ,但是我希望使用dask进行扩展。
dask.dataframe.core.DataFrame
Dask.dask_ml.preprocessing.MinMaxScaler
sklearn.preprocessing.MinMaxScaler
最小的、可复制的例子:
# Get data ddf = dd.read_csv('test.csv') # See below ddf = ddf.set_index('index') # Pivot ddf = ddf.categorize(columns=['item', 'name']) ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean') col = ddf_p.columns.to_list() # sklearn verison from sklearn.preprocessing import MinMaxScaler scaler_s = MinMaxScaler() scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works! # dask verison from dask_ml.preprocessing import MinMaxScaler scaler_d = MinMaxScaler() scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
错误消息:
TypeError: Categorical is not ordered for operation min you can use .as_ordered() to change the Categorical to an ordered one
不确定旋转表中的“分类”是什么,但我尝试过 .as_ordered() 索引:
.as_ordered()
from dask_ml.preprocessing import MinMaxScaler scaler_d = MinMaxScaler() ddf_p = ddf_p.index.cat.as_ordered() scaled_values_d = scaler_d.fit_transform(ddf_p[col])
但我得到了错误信息:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
附加信息
test.csv
index,item,name,value 2015-01-01,item_1,A,1 2015-01-01,item_1,B,2 2015-01-01,item_1,C,3 2015-01-01,item_1,D,4 2015-01-01,item_1,E,5 2015-01-02,item_2,A,10 2015-01-02,item_2,B,20 2015-01-02,item_2,C,30 2015-01-02,item_2,D,40