[ Gradient Boosting with Sklearn ]

I want to use Sklearn's GradientBoostingRegressor class to predict values for a target variable in a regression problem. The features I have are of mixed type - some are continuous numeric, some are boolean, two are categorical, and one is a vector of continuous numbers. I am choosing gradient boosting trees specifically because the data is of mixed data types. An example of a feature vector would be:

['Category1', 41.93655, -87.642079, 0, 0, <1x822 sparse matrix of type '' with 4 stored elements in Compressed Sparse Row format>, 'mobile_app', 'NA']

However, when I try to train the GradientBoostingRegressor with fit(), I get an error saying:

ValueError: could not convert string to float: Category1

This feature's values are implemented with an enum. I just have a method:

def enum(self, **enums):
    return type('Enum', (), enums)

Then when I create my categories, I do it like this:

categories = self.enum(Category1='Category1', Category2='Category2', ...)

I guess the problem is that it is still returning the actual value as a string. But if I change the values to 0, 1, 2, etc, that would make some categories "closer" to others when they should be equidistant from all the other categories.

So does this object actually handle data of mixed type or does it all have to be done numerically? If it has to be all numeric, has anyone who has handled categorical data with this object shed light on the best way to represent the categories? Any help is appreciated

Answer 1


Every feature must be numerical. Since gradient boosting is based on decision trees, and decision trees work based on feature splits rather than distances, the "0, 1, 2, etc." representation should actually work just fine as long as you set the max_depth parameter appropriately (grid-search it to be sure).

Answer 2


As Fred Foo wrote - every feature must be numerical, because GradientBoosting algorithm is doing sorting for each attribute when searching for best split.

You can convert categorical attributes into binary representation or into a number. There are ready sklearn implementations for this: sklearn.preprocessing.LabelEncoder and sklearn.preprocessing.LabelBinarizer