Quicker way to iterate pandas dataframe and apply a conditional function

Question

Summary

I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?

Example

Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.

Edit: Resulting dataframe should preserve the duplicated groups. But with edited column 'D'

import pandas as pd
import numpy as np
import datetime

def time_mean_rows():
    # Generate some random data
    A = np.random.randint(0, 5, 1000)
    B = np.random.randint(0, 5, 1000)
    C = np.random.randint(0, 5, 1000)
    D = np.random.randint(0, 10, 1000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D]).T
    df.columns = ['A', 'B', 'C', 'D']


    tstart = datetime.datetime.now()

    # Get unique combinations of A, B, C
    unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()

    # Iterate unique groups
    normalised_solutions = []
    for idx, row in unique_groups.iterrows():
        # Subset dataframe to the unique group
        sub_df = df[
            (df['A'] == row['A']) &
            (df['B'] == row['B']) & 
            (df['C'] == row['C'])
            ]

        # If more than one solution, get mean of column D
        num_solutions = len(sub_df)        
        if num_solutions > 1:
            sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
            normalised_solutions.append(sub_df)

    # Concatenate results
    res = pd.concat(normalised_solutions)

    tend = datetime.datetime.now()
    time_elapsed = (tstart - tend).seconds
    print(time_elapsed)

I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently

AT_asks · Answer 1 · 2019-05-23T08:59:08.997

2

Hm, why don't you use groupby?

df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()

edited May 23 '19 at 08:59

answered May 23 '19 at 08:41

AT_asks

132
4

This returns a series of the mean values. How can I relate that back to which group they belong to? – Behzad May 23 '19 at 08:53
add `.reset_index()` at the end so you will get a dataframe object – AT_asks May 23 '19 at 08:58
Thanks AT, this helped me on the way. I'll post my answer below – Behzad May 23 '19 at 09:57

score 1 · Accepted Answer · answered May 23 '19 at 10:01

This is a complement to AT_asks's answer which only gave the first part of the solution.

Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. The global solution is then:

res = df.set_index(['A', 'B', 'C']).assign(
    D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()

This will contains same rows (even if a different order that the res dataframe from OP's question.

Thanks Serge, this is a great solution to my question. I found this post helpful for expanding this to iterating over multiple columns: https://stackoverflow.com/a/41759638/1470354 — Behzad, May 23 '19 at 11:07

Behzad · Answer 3 · 2019-05-24T10:14:15.527

Here's a solution I found

Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. Nice speedup!

def time_mean_rows():
    # Generate some random data
    np.random.seed(seed=42)
    A = np.random.randint(0, 10, 10000)
    B = np.random.randint(0, 10, 10000)
    C = np.random.randint(0, 10, 10000)
    D = np.random.randint(0, 10, 10000)
    E = np.random.randint(0, 10, 10000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D, E]).T
    df.columns = ['A', 'B', 'C', 'D', 'E']

    tstart_grpby = timer()
    cols = ['D', 'E']

    group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()

    # Merge df
    df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))

    # Get left columns (have not been normalised) and drop
    drop_cols = [x for x in df.columns if x.endswith('_left')]
    df.drop(drop_cols, inplace=True, axis='columns')

    tend_grpby = timer()
    time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
    print(time_elapsed_grpby)

Quicker way to iterate pandas dataframe and apply a conditional function

Summary

Example

3 Answers3