[ Pandas Label Duplicates ]
Given the following data frame:
import pandas as pd
d=pd.DataFrame({'label':[1,2,2,2,3,4,4],
'values':[3,5,7,2,5,8,3]})
d
label values
0 1 3
1 2 5
2 2 7
3 2 2
4 3 5
5 4 8
6 4 3
I know how to count the unique values like this:
d['dup']=d.groupby('label')['label'].transform('count')
Which results in:
label values dup
0 1 3 1
1 2 5 3
2 2 7 3
3 2 2 3
4 3 5 1
5 4 8 2
6 4 3 2
But what I would like is a column to have the following values:
1
if there is 1 unique
row per the label column, 2
if there are duplicates
and the row in question is the first
of such, and 0
if the row is a duplicate
of an original. Like this:
label values dup status
0 1 3 1 1
1 2 5 3 2
2 2 7 3 0
3 2 2 3 0
4 3 5 1 1
5 4 8 2 2
6 4 3 2 0
Thanks in advance!
Answer 1
I think you can use loc
with condition created by function duplicated
:
d['status'] = 2
d.loc[d.dup == 1, 'status'] = 1
d.loc[d.label.duplicated(), 'status'] = 0
print (d)
label values dup status
0 1 3 1 1
1 2 5 3 2
2 2 7 3 0
3 2 2 3 0
4 3 5 1 1
5 4 8 2 2
6 4 3 2 0
Or double numpy.where
:
d['status1'] = np.where(d.dup == 1, 1,
np.where(d.label.duplicated(), 0, 2))
print (d)
label values dup status status1
0 1 3 1 1 1
1 2 5 3 2 2
2 2 7 3 0 0
3 2 2 3 0 0
4 3 5 1 1 1
5 4 8 2 2 2
6 4 3 2 0 0
Answer 2
Another option is to clip the count column at 2, then subtract away 2 times duplicated
. Since duplicated
uses keep='first'
as default, all but the first duplicated label will get reduced to zero.
d['status'] = d['dup'].clip_upper(2) - 2*d.duplicated(subset='label')
The resulting output:
label values dup status
0 1 3 1 1
1 2 5 3 2
2 2 7 3 0
3 2 2 3 0
4 3 5 1 1
5 4 8 2 2
6 4 3 2 0