2

I wrote this bit of code so I can group any Pandas DataFrame and get the group size and a sample row of the dataframe as a quick result.

It works great, with one problem: The name of the new column/index "Size" is fixed, because the .assign( ... ) command doesn't take variables. So if my DataFrame has a column named "Size" it is lost.

My plan is to check, if a column named "Size" exists and if it does, use a different name for the index. Can I use the assign command with a variable for the field name instead of a fixed text?

I'd like to avoid a hacky solution like multiple renaming of columns.

import pandas as pd
try:
    from pandas.api.extensions import register_dataframe_accessor
except ImportError:
    raise ImportError('Pandas 0.24 or better needed')

@register_dataframe_accessor("cgrp")
class CustomGrouper:
    """Extra methods for dataframes."""

    def __init__(self, df):
        self._df = df

    def group_sample(self, by, subset=None):
        result = (self._df.groupby(by).apply(lambda x: x.sample(1).assign(Size = len(x)))).set_index('Size').sort_index(ascending=False)
        return result

I can call this like this

df.cgrp.group_sample(by=['column1', ... ])

And get a result with an index "Size"

576i
  • 7,579
  • 12
  • 55
  • 92

1 Answers1

1

The basic idea is to use dictionary unpacking. Instead of hard coding a name in the assign function:

.assign(Size = len(x))

You can use dictionary unpacking to specify a variable name:

.assign(**{col_name: len(x)})

I took some liberty to modify your group_sample function with 2 features: allow the user to specify a custom name and choose from a default list if they don't:

def group_sample(self, by, subset=None, col_name=None):
    _col_name = None

    if col_name is not None:
        # If a user specify a column name, use it
        # Raise error if the column already exists
        if col_name in self._df.columns:
            raise ValueError(f"Dataframe already has column '{col_name}'")
        else:
            _col_name = col_name
    else:
        # Choose from a list of default names
        _col_name = next((name for name in ['Size', 'Size_', 'Size__'] if name not in self._df.columns), None)

        if _col_name is None:
            raise ValueError('Cannot determine a default name for the size column. Please specify one manually')

    result = (self._df.groupby(by).apply(lambda x: x.sample(1).assign(**{_col_name: len(x)}))).set_index(_col_name).sort_index(ascending=False)
    return result

Usage:

df1 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df1.cgrp.group_sample(by=['A'])     # the column name is Size

df2 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','Size'])
df2.cgrp.group_sample(by=['A'])     # the column name is Size_

df3 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df3.cgrp.group_sample(by=['A'], col_name='B')  # error, B already exists

df4 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df4.cgrp.group_sample(by=['A'], col_name='MySize')  # custom column name
Code Different
  • 90,614
  • 16
  • 144
  • 163