[ Denormalise the data using Pandas ]
If I have data of the following format (stored in a pandas dataframe), essentially a normalised form of categories and wares to a slug:
pandas.DataFrame:
categories slug wares
0 [developer, mac, web] alex.payne [macbook-pro, cinema-display, readynas-nv-plus...
1 [mac, musician] jona.bechtolt [audio-kontrol-1, powershot-sd1000, live, mda-...
2 [game, suit, windows] gabe.newell [oa-desk, beyond-tv, windows-xp, office, visua...
3 [developer, mac, software] steven.frank [mac-pro, macbook-air, apple-tv, itunes, addre...
And my intension is to plot graphs of categories correlated with wares, I'd need the data in a denormalised format, in some such format:
categories wares slug
0 developer macbook-pro alex.payne
1 mac macbook-pro alex.payne
2 web macbook-pro alex.payne
3 developer cinema-display alex.payne
4 mac cinema-display alex.payne
5 web cinema-display alex.payne
6 developer readynas-nv-plus alex.payne
What is the best way to convert the data from the format above to the one below, preferably the one, that also leverages the internals of numpy, so it is fast.
My approach was to this was rather a naive one, looping through each row in the data frame, maintaining a list of tuples and then passing it over to the pandas.DataFrame constructor. Any suggestion of yours will probably end up being faster and better, so suggest away!
I am also thinking about alternative representations of such a data in the pandas DataFrame, specifically, a sparse matrix. But I think this would be better particularly for groupby queries. If there are other formats or if sparse matrix renders it better for such aggregation queries, suggest how to go about it.
Here is the entire thing, for those interested: http://j.mp/lp-usesthis I ended up not doing the denormalisation the way I originally intended, instead looped over column of interest only. But any ability to denormalise better would make it better.
Answer 1
Firstly I strongly recommend that you don't store you data like this in the first place, pandas isn't really meant for general objects like lists.
Here's one way to extract the data (using joins, similar to Dan Allen's answer).
def denormalise(df, *colnames):
df1 = df.copy() # optional, but means we're not changing df globally
cols = [(colname, df1.pop(colname).apply(pd.Series).stack()) for colname in colnames]
for colname, c in cols:
c.index = c.index.droplevel(-1)
c.name = colname
df1 = df1.join(c)
return df1
# optionally .reindex_axis(df.columns, axis=1) # reorder columns
# and .reset_index(drop=True) # 0,1,...n index
In use:
In [11]: denormalise(df1, 'wares')
Out[11]:
categories slug wares
0 [developer, mac, web] alex.payne macbook-pro
0 [developer, mac, web] alex.payne cinema-display
0 [developer, mac, web] alex.payne readynas-nv-plus
1 [mac, musician] jona.bechtolt audio-kontrol-1
1 [mac, musician] jona.bechtolt powershot-sd1000
1 [mac, musician] jona.bechtolt live
Pandas allows you to add this as a DataFrame method for convenience:
In [12]: pd.DataFrame.denormalise = denormalise
In [13]: df1.denormalise('wares', 'categories')
Out[13]:
slug wares categories
0 alex.payne macbook-pro developer
0 alex.payne macbook-pro mac
0 alex.payne macbook-pro web
0 alex.payne cinema-display developer
0 alex.payne cinema-display mac
0 alex.payne cinema-display web
0 alex.payne readynas-nv-plus developer
0 alex.payne readynas-nv-plus mac
0 alex.payne readynas-nv-plus web
1 jona.bechtolt audio-kontrol-1 mac
1 jona.bechtolt audio-kontrol-1 musician
1 jona.bechtolt powershot-sd1000 mac
1 jona.bechtolt powershot-sd1000 musician
1 jona.bechtolt live mac
1 jona.bechtolt live musician
In [14]: df1.denormalise('wares', 'categories').reset_index(drop=True)
Out[14]:
slug wares categories
0 alex.payne macbook-pro developer
1 alex.payne macbook-pro mac
2 alex.payne macbook-pro web
3 alex.payne cinema-display developer
4 alex.payne cinema-display mac
5 alex.payne cinema-display web
6 alex.payne readynas-nv-plus developer
7 alex.payne readynas-nv-plus mac
8 alex.payne readynas-nv-plus web
9 jona.bechtolt audio-kontrol-1 mac
10 jona.bechtolt audio-kontrol-1 musician
11 jona.bechtolt powershot-sd1000 mac
12 jona.bechtolt powershot-sd1000 musician
13 jona.bechtolt live mac
14 jona.bechtolt live musician