I wrote this bit of code so I can group any Pandas DataFrame and get the group size and a sample row of the dataframe as a quick result.
It works great, with one problem:
The name of the new column/index "Size" is fixed, because the .assign( ... ) command doesn't take variables. So if my DataFrame has a column named "Size" it is lost.
My plan is to check, if a column named "Size" exists and if it does,
use a different name for the index. Can I use the assign command with
a variable for the field name instead of a fixed text?
I'd like to avoid a hacky solution like multiple renaming of columns.
import pandas as pd
try:
from pandas.api.extensions import register_dataframe_accessor
except ImportError:
raise ImportError('Pandas 0.24 or better needed')
@register_dataframe_accessor("cgrp")
class CustomGrouper:
"""Extra methods for dataframes."""
def __init__(self, df):
self._df = df
def group_sample(self, by, subset=None):
result = (self._df.groupby(by).apply(lambda x: x.sample(1).assign(Size = len(x)))).set_index('Size').sort_index(ascending=False)
return result
I can call this like this
df.cgrp.group_sample(by=['column1', ... ])
And get a result with an index "Size"