0

Summary

I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?

Example

Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.

Edit: Resulting dataframe should preserve the duplicated groups. But with edited column 'D'

import pandas as pd
import numpy as np
import datetime

def time_mean_rows():
    # Generate some random data
    A = np.random.randint(0, 5, 1000)
    B = np.random.randint(0, 5, 1000)
    C = np.random.randint(0, 5, 1000)
    D = np.random.randint(0, 10, 1000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D]).T
    df.columns = ['A', 'B', 'C', 'D']


    tstart = datetime.datetime.now()

    # Get unique combinations of A, B, C
    unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()

    # Iterate unique groups
    normalised_solutions = []
    for idx, row in unique_groups.iterrows():
        # Subset dataframe to the unique group
        sub_df = df[
            (df['A'] == row['A']) &
            (df['B'] == row['B']) & 
            (df['C'] == row['C'])
            ]

        # If more than one solution, get mean of column D
        num_solutions = len(sub_df)        
        if num_solutions > 1:
            sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
            normalised_solutions.append(sub_df)

    # Concatenate results
    res = pd.concat(normalised_solutions)

    tend = datetime.datetime.now()
    time_elapsed = (tstart - tend).seconds
    print(time_elapsed)

I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently

Behzad
  • 123
  • 1
  • 1
  • 11

3 Answers3

2

Hm, why don't you use groupby?

df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index() 

AT_asks
  • 132
  • 4
1

This is a complement to AT_asks's answer which only gave the first part of the solution.

Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. The global solution is then:

res = df.set_index(['A', 'B', 'C']).assign(
    D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()

This will contains same rows (even if a different order that the res dataframe from OP's question.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thanks Serge, this is a great solution to my question. I found this post helpful for expanding this to iterating over multiple columns: https://stackoverflow.com/a/41759638/1470354 – Behzad May 23 '19 at 11:07
0

Here's a solution I found

Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. Nice speedup!

def time_mean_rows():
    # Generate some random data
    np.random.seed(seed=42)
    A = np.random.randint(0, 10, 10000)
    B = np.random.randint(0, 10, 10000)
    C = np.random.randint(0, 10, 10000)
    D = np.random.randint(0, 10, 10000)
    E = np.random.randint(0, 10, 10000)

    # init dataframe
    df = pd.DataFrame(data=[A, B, C, D, E]).T
    df.columns = ['A', 'B', 'C', 'D', 'E']

    tstart_grpby = timer()
    cols = ['D', 'E']

    group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()

    # Merge df
    df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))

    # Get left columns (have not been normalised) and drop
    drop_cols = [x for x in df.columns if x.endswith('_left')]
    df.drop(drop_cols, inplace=True, axis='columns')

    tend_grpby = timer()
    time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
    print(time_elapsed_grpby)
Behzad
  • 123
  • 1
  • 1
  • 11