[ Dummy variables from levels of other data frame ]
I'd like to be able to do one hot encoding on a data frame based on levels from another data frame. For instance, in the example below data provides the levels for two variables. Based on those levels only, I want to create dummy variables in data2.
How can I go about this?
import pandas as pd
#unique levels (A,B for VAR1, and X,Y,Z for VAR2) in
#this dataset determine the possible levels for the following dataset
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],
'VAR2': ['X', 'Y', 'Y', 'Y','X', 'Z']}
frame = pd.DataFrame(data)
#data2 contains same variables as data, but might or might not
#contain same levels
data2 = {'VAR1': ['A', 'C'],
'VAR2': ['X', 'Y']}
frame2 = pd.DataFrame(data2)
#after applying one hot encoding to data2, this is what it should look like
data_final = {
'A': ['1', '0'],
'B': ['0', '0'],
'X': ['1', '0'],
'Y': ['0', '1'],
'Z': ['0', '0'],
}
frame_final = pd.DataFrame(data_final)
Answer 1
There are probably a lot of ways to achieve this. For whatever reason I'm draw this approach:
In [74]: part = pd.concat([pd.get_dummies(frame2[x]) for x in frame2], axis=1)
In [75]: part
Out[75]:
A C X Y
0 1 0 1 0
1 0 1 0 1
You can see we are already almost there, the only missing columns are those that don't show up anywhere in frame2
, B and Z. Again there would be multiple ways to get these added in (I'd be curious to hear of any you think are more suitable), but I wanted to use the reindex_axis
method. To use this, we need another index containing all the possible values.
In [76]: idx = pd.Index(np.ravel(frame.values)).unique()
In [77]: idx
Out[77]: array(['A', 'X', 'Y', 'B', 'Z'], dtype=object)
Finally reindex and fill the NaN
s with 0:
In [78]: part.reindex_axis(idx, axis=1).fillna(0)
Out[78]:
A X Y B Z
0 1 1 0 0 0
1 0 0 1 0 0
You can sort if necessary.