TAGS :Viewed: 17 - Published at: a few seconds ago

[ Dummy variables from levels of other data frame ]

I'd like to be able to do one hot encoding on a data frame based on levels from another data frame. For instance, in the example below data provides the levels for two variables. Based on those levels only, I want to create dummy variables in data2.

How can I go about this?

import pandas as pd

#unique levels (A,B for VAR1, and X,Y,Z for VAR2) in
#this dataset determine the possible levels for the following dataset
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],

'VAR2': ['X', 'Y', 'Y', 'Y','X', 'Z']}

frame = pd.DataFrame(data)

#data2 contains same variables as data, but might or might not
#contain same levels

data2 = {'VAR1': ['A', 'C'],

'VAR2': ['X', 'Y']}

frame2 = pd.DataFrame(data2) 

#after applying one hot encoding to data2, this is what it should look like

data_final = {
'A': ['1', '0'],
'B': ['0', '0'],
'X': ['1', '0'],
'Y': ['0', '1'],
'Z': ['0', '0'],
}

frame_final = pd.DataFrame(data_final)

Answer 1


There are probably a lot of ways to achieve this. For whatever reason I'm draw this approach:

In [74]: part = pd.concat([pd.get_dummies(frame2[x]) for x in frame2], axis=1)

In [75]: part
Out[75]: 
   A  C  X  Y
0  1  0  1  0
1  0  1  0  1

You can see we are already almost there, the only missing columns are those that don't show up anywhere in frame2, B and Z. Again there would be multiple ways to get these added in (I'd be curious to hear of any you think are more suitable), but I wanted to use the reindex_axis method. To use this, we need another index containing all the possible values.

In [76]: idx = pd.Index(np.ravel(frame.values)).unique()

In [77]: idx
Out[77]: array(['A', 'X', 'Y', 'B', 'Z'], dtype=object)

Finally reindex and fill the NaNs with 0:

In [78]: part.reindex_axis(idx, axis=1).fillna(0)
Out[78]: 
   A  X  Y  B  Z
0  1  1  0  0  0
1  0  0  1  0  0

You can sort if necessary.