Summary
I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?
Example
Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.
Edit: Resulting dataframe should preserve the duplicated groups. But with edited column 'D'
import pandas as pd
import numpy as np
import datetime
def time_mean_rows():
# Generate some random data
A = np.random.randint(0, 5, 1000)
B = np.random.randint(0, 5, 1000)
C = np.random.randint(0, 5, 1000)
D = np.random.randint(0, 10, 1000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D]).T
df.columns = ['A', 'B', 'C', 'D']
tstart = datetime.datetime.now()
# Get unique combinations of A, B, C
unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()
# Iterate unique groups
normalised_solutions = []
for idx, row in unique_groups.iterrows():
# Subset dataframe to the unique group
sub_df = df[
(df['A'] == row['A']) &
(df['B'] == row['B']) &
(df['C'] == row['C'])
]
# If more than one solution, get mean of column D
num_solutions = len(sub_df)
if num_solutions > 1:
sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
normalised_solutions.append(sub_df)
# Concatenate results
res = pd.concat(normalised_solutions)
tend = datetime.datetime.now()
time_elapsed = (tstart - tend).seconds
print(time_elapsed)
I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently