2

I am attempting to iterate over a dict of dataframes, modify them with a function, and then assign the returned dfs to their global vars. I would expect any value in a key-value pair of a dict to be a pointer to the variable passed. Instead, it appears to be updating only the value in the data dict. This is unexpected. What am I misunderstanding about identifiers? I found this question asked the same thing in the second half, but I did not understand the accepted answer.

Please see my demonstration below:

import pandas as pd

bids = pd.read_csv('data/as_bid_aggregated_data.csv')
plans = pd.read_csv('data/as_plan.csv')
energy_prices = pd.read_csv('data/as_bid_aggregated_data.csv')
price_vol = pd.read_csv('data/as_price_vol.csv')
generation = pd.read_csv('data/generation.csv')

data = {'bids':bids,
        'plans':plans,
        'energy_prices':energy_prices,
        'price_vol':price_vol,
        'generation':generation,
       }

I evaluate bids to demonstrate what it looks initially like after import:

bids.head().to_clipboard()
 OUTPUT: 
 note the index, date, and hr_beg cols. These should be modified for all dfs in data after processing. 
 V  V          V
    date          hr_beg    OFFNS_Unweighted Average Price  OFFNS_Max Price OFFNS_Min Price OFFNS_Total Quantity    OFFNS_Number of Bids    OFFNS_Weighted Avg Price    ONNS_Unweighted Average Price   ONNS_Max Price  ONNS_Min Price  ONNS_Total Quantity ONNS_Number of Bids ONNS_Weighted Avg Price REGDN_Unweighted Average Price  REGDN_Max Price REGDN_Min Price REGDN_Total Quantity    REGDN_Number of Bids    REGDN_Weighted Avg Price    REGUP_Unweighted Average Price  REGUP_Max Price REGUP_Min Price REGUP_Total Quantity    REGUP_Number of Bids    REGUP_Weighted Avg Price    RRSGN_Unweighted Average Price  RRSGN_Max Price RRSGN_Min Price RRSGN_Total Quantity    RRSGN_Number of Bids    RRSGN_Weighted Avg Price    RRSNC_Unweighted Average Price  RRSNC_Max Price RRSNC_Min Price RRSNC_Total Quantity    RRSNC_Number of Bids    RRSNC_Weighted Avg Price
# 0 2014-01-01  0   43.3190909090909    300.01  0.01    38144.7 22  59.51279016481975   22.016969696969696  250.0   1.0 32531.499999999985  33  36.74238980680264   20.669076923076922  500.0   0.92    71971.59999999992   65  26.577483215601717  19.744255319148944  500.0   0.01    56916.80000000003   47  27.33264099527731   20.85708333333334   500.0   0.01    107723.6    48  30.19552034094665   1.5 3.0 0.0 2236.8  2   1.5996512875536482
# 1 2014-01-01  1   43.342727272727274  300.01  0.01    38216.4 22  59.505340220428934  20.93514285714285   250.0   1.0 34781.19999999998   35  34.95683860821363   21.764761904761905  500.0   0.8 70412.39999999994   63  27.92263442234607   18.834375000000012  500.0   0.01    50201.80000000002   48  28.87979570453649   19.6692 500.0   0.01    107145.0    50  30.00068717158991   1.5 3.0 0.0 2235.8  2   1.599695858305752
# 2 2014-01-01  2   43.34818181818181   300.01  0.01    38336.9 22  59.49848289767822   20.97   250.0   1.0 34741.39999999999   35  35.091575987150776  21.836461538461545  500.0   0.58    72212.29999999992   65  28.27043938498013   18.856041666666666  500.0   0.01    50769.90000000001   48  28.61359006025224   19.5252 500.0   0.01    105503.8    50  30.27549695840339   1.5 3.0 0.0 2236.2  2   1.5996780252213578
# 3 2014-01-01  3   43.35000000000001   300.01  0.01    38374.5 22  59.492013316134425  21.00257142857142   250.0   1.0 34761.399999999994  35  35.11167079001421   22.38730158730159   500.0   0.53    70801.39999999994   63  28.66950969896075   18.854583333333334  500.0   0.01    50313.10000000001   48  28.865852233314985  19.5298 500.0   0.01    105024.0    50  30.41884454981718   1.5 3.0 0.0 2238.2  2   1.5995889554105982
# 4 2014-01-01  4   46.431  300.01  0.01    33460.8 20  64.00475684980633   20.75628571428571   250.0   1.0 34829.29999999999   35  34.791386648597594  21.684531250000006  500.0   0.7 71841.29999999992   64  27.846364904309922  19.238510638297864  500.0   0.01    50767.90000000001   47  28.70213516808849   19.801836734693875  500.0   0.01    104199.79999999996  49  30.477332029428077  1.5 3.0 0.0 2242.4  2   1.5994024259721726

I then create a function to modify a given dataframe. It combines cols to create a single datetime index, and replace the index with that, I've redacted the logic for simplicity.

def create_dt(input_df):
    '''create a dataframe with a datetime index from multiple cols
    '''
    df = input_df.copy()
    #modify the df
    df = df.set_index(dt_index)
    df = df.drop(columns=[date_col,hr_col])
    return df

I then attempt to unpack data, pass them into create_dt() and assign the results. I would expect this to update the global var for each df, via the pointer in the dict.

for key, df in data.items():
    data[key] = create_dt(data[key],'date','hr_beg')

I evaluate bids global, post function call. It remains the same.

# OUTPUT:
bids.head().to_clipboard()
# note the index, date, and hr_beg cols. Same as initial value
# V  V          V
#       date    hr_beg  OFFNS_Unweighted Average Price  OFFNS_Max Price OFFNS_Min Price OFFNS_Total Quantity    OFFNS_Number of Bids    OFFNS_Weighted Avg Price    ONNS_Unweighted Average Price   ONNS_Max Price  ONNS_Min Price  ONNS_Total Quantity ONNS_Number of Bids ONNS_Weighted Avg Price REGDN_Unweighted Average Price  REGDN_Max Price REGDN_Min Price REGDN_Total Quantity    REGDN_Number of Bids    REGDN_Weighted Avg Price    REGUP_Unweighted Average Price  REGUP_Max Price REGUP_Min Price REGUP_Total Quantity    REGUP_Number of Bids    REGUP_Weighted Avg Price    RRSGN_Unweighted Average Price  RRSGN_Max Price RRSGN_Min Price RRSGN_Total Quantity    RRSGN_Number of Bids    RRSGN_Weighted Avg Price    RRSNC_Unweighted Average Price  RRSNC_Max Price RRSNC_Min Price RRSNC_Total Quantity    RRSNC_Number of Bids    RRSNC_Weighted Avg Price
# 0 2014-01-01  0   43.3190909090909    300.01  0.01    38144.7 22  59.51279016481975   22.016969696969696  250.0   1.0 32531.499999999985  33  36.74238980680264   20.669076923076922  500.0   0.92    71971.59999999992   65  26.577483215601717  19.744255319148944  500.0   0.01    56916.80000000003   47  27.33264099527731   20.85708333333334   500.0   0.01    107723.6    48  30.19552034094665   1.5 3.0 0.0 2236.8  2   1.5996512875536482
# 1 2014-01-01  1   43.342727272727274  300.01  0.01    38216.4 22  59.505340220428934  20.93514285714285   250.0   1.0 34781.19999999998   35  34.95683860821363   21.764761904761905  500.0   0.8 70412.39999999994   63  27.92263442234607   18.834375000000012  500.0   0.01    50201.80000000002   48  28.87979570453649   19.6692 500.0   0.01    107145.0    50  30.00068717158991   1.5 3.0 0.0 2235.8  2   1.599695858305752
# 2 2014-01-01  2   43.34818181818181   300.01  0.01    38336.9 22  59.49848289767822   20.97   250.0   1.0 34741.39999999999   35  35.091575987150776  21.836461538461545  500.0   0.58    72212.29999999992   65  28.27043938498013   18.856041666666666  500.0   0.01    50769.90000000001   48  28.61359006025224   19.5252 500.0   0.01    105503.8    50  30.27549695840339   1.5 3.0 0.0 2236.2  2   1.5996780252213578
# 3 2014-01-01  3   43.35000000000001   300.01  0.01    38374.5 22  59.492013316134425  21.00257142857142   250.0   1.0 34761.399999999994  35  35.11167079001421   22.38730158730159   500.0   0.53    70801.39999999994   63  28.66950969896075   18.854583333333334  500.0   0.01    50313.10000000001   48  28.865852233314985  19.5298 500.0   0.01    105024.0    50  30.41884454981718   1.5 3.0 0.0 2238.2  2   1.5995889554105982
# 4 2014-01-01  4   46.431  300.01  0.01    33460.8 20  64.00475684980633   20.75628571428571   250.0   1.0 34829.29999999999   35  34.791386648597594  21.684531250000006  500.0   0.7 71841.29999999992   64  27.846364904309922  19.238510638297864  500.0   0.01    50767.90000000001   47  28.70213516808849   19.801836734693875  500.0   0.01    104199.79999999996  49  30.477332029428077  1.5 3.0 0.0 2242.4  2   1.5994024259721726

I then evaluate the bids dataframe k-v pair in data. It is successfully modified.

data['bids'].head().to_clipboard()
#OUTPUT
# note datetime index, no date or hr_beg cols, see .columns() output one cell below. 
# V
#   OFFNS_Unweighted Average Price  OFFNS_Max Price OFFNS_Min Price OFFNS_Total Quantity    OFFNS_Number of Bids    OFFNS_Weighted Avg Price    ONNS_Unweighted Average Price   ONNS_Max Price  ONNS_Min Price  ONNS_Total Quantity ONNS_Number of Bids ONNS_Weighted Avg Price REGDN_Unweighted Average Price  REGDN_Max Price REGDN_Min Price REGDN_Total Quantity    REGDN_Number of Bids    REGDN_Weighted Avg Price    REGUP_Unweighted Average Price  REGUP_Max Price REGUP_Min Price REGUP_Total Quantity    REGUP_Number of Bids    REGUP_Weighted Avg Price    RRSGN_Unweighted Average Price  RRSGN_Max Price RRSGN_Min Price RRSGN_Total Quantity    RRSGN_Number of Bids    RRSGN_Weighted Avg Price    RRSNC_Unweighted Average Price  RRSNC_Max Price RRSNC_Min Price RRSNC_Total Quantity    RRSNC_Number of Bids    RRSNC_Weighted Avg Price
# 2014-01-01 00:00:00   43.3190909090909    300.01  0.01    38144.7 22  59.51279016481975   22.016969696969696  250.0   1.0 32531.499999999985  33  36.74238980680264   20.669076923076922  500.0   0.92    71971.59999999992   65  26.577483215601717  19.744255319148944  500.0   0.01    56916.80000000003   47  27.33264099527731   20.85708333333334   500.0   0.01    107723.6    48  30.19552034094665   1.5 3.0 0.0 2236.8  2   1.5996512875536482
# 2014-01-01 01:00:00   43.342727272727274  300.01  0.01    38216.4 22  59.505340220428934  20.93514285714285   250.0   1.0 34781.19999999998   35  34.95683860821363   21.764761904761905  500.0   0.8 70412.39999999994   63  27.92263442234607   18.834375000000012  500.0   0.01    50201.80000000002   48  28.87979570453649   19.6692 500.0   0.01    107145.0    50  30.00068717158991   1.5 3.0 0.0 2235.8  2   1.599695858305752
# 2014-01-01 02:00:00   43.34818181818181   300.01  0.01    38336.9 22  59.49848289767822   20.97   250.0   1.0 34741.39999999999   35  35.091575987150776  21.836461538461545  500.0   0.58    72212.29999999992   65  28.27043938498013   18.856041666666666  500.0   0.01    50769.90000000001   48  28.61359006025224   19.5252 500.0   0.01    105503.8    50  30.27549695840339   1.5 3.0 0.0 2236.2  2   1.5996780252213578
# 2014-01-01 03:00:00   43.35000000000001   300.01  0.01    38374.5 22  59.492013316134425  21.00257142857142   250.0   1.0 34761.399999999994  35  35.11167079001421   22.38730158730159   500.0   0.53    70801.39999999994   63  28.66950969896075   18.854583333333334  500.0   0.01    50313.10000000001   48  28.865852233314985  19.5298 500.0   0.01    105024.0    50  30.41884454981718   1.5 3.0 0.0 2238.2  2   1.5995889554105982
# 2014-01-01 04:00:00   46.431  300.01  0.01    33460.8 20  64.00475684980633   20.75628571428571   250.0   1.0 34829.29999999999   35  34.791386648597594  21.684531250000006  500.0   0.7 71841.29999999992   64  27.846364904309922  19.238510638297864  500.0   0.01    50767.90000000001   47  28.70213516808849   19.801836734693875  500.0   0.01    104199.79999999996  49  30.477332029428077  1.5 3.0 0.0 2242.4  2   1.5994024259721726

data['bids'].columns()

#OUTPUT:
# Index(['OFFNS_Unweighted Average Price', 'OFFNS_Max Price', 'OFFNS_Min Price',
#        'OFFNS_Total Quantity', 'OFFNS_Number of Bids',
#        'OFFNS_Weighted Avg Price', 'ONNS_Unweighted Average Price',
#        'ONNS_Max Price', 'ONNS_Min Price', 'ONNS_Total Quantity',
#        'ONNS_Number of Bids', 'ONNS_Weighted Avg Price',
#        'REGDN_Unweighted Average Price', 'REGDN_Max Price', 'REGDN_Min Price',
#        'REGDN_Total Quantity', 'REGDN_Number of Bids',
#        'REGDN_Weighted Avg Price', 'REGUP_Unweighted Average Price',
#        'REGUP_Max Price', 'REGUP_Min Price', 'REGUP_Total Quantity',
#        'REGUP_Number of Bids', 'REGUP_Weighted Avg Price',
#        'RRSGN_Unweighted Average Price', 'RRSGN_Max Price', 'RRSGN_Min Price',
#        'RRSGN_Total Quantity', 'RRSGN_Number of Bids',
#        'RRSGN_Weighted Avg Price', 'RRSNC_Unweighted Average Price',
#        'RRSNC_Max Price', 'RRSNC_Min Price', 'RRSNC_Total Quantity',
#        'RRSNC_Number of Bids', 'RRSNC_Weighted Avg Price'],
#       dtype='object')

  • humm, I stripped this down quite a bit to attempt to provide that with enough context. I do so further. – nickolasclarke Apr 23 '20 at 00:06
  • _I would expect this to update the global var for each df, via the pointer in the dict._ You expected the values of `bids`, `plans` etc. to change when you modify the dictionary? Have I understood things correctly? – AMC Apr 23 '20 at 00:07
  • yes, I would expect the values the global vars: `bids` and `plans`, etc to be updated when I modify the value in `data[key]` – nickolasclarke Apr 23 '20 at 00:10
  • Ah, well they shouldn't be. I'm trying to think of some decent resources which could explain things better... – AMC Apr 23 '20 at 00:13
  • Unrelated but in your code example, you provide three arguments into your call of `create_dt` but you only have one parameter in the func def? – Phillyclause89 Apr 23 '20 at 00:28
  • @Phillyclause89 sorry that was a mistake I left in when I was attempting to strip down the code for a simplified example. – nickolasclarke Apr 23 '20 at 00:30

2 Answers2

1

In the global scope you define a number of dataframes:

bids = pd.read_csv('data/as_bid_aggregated_data.csv')
plans = pd.read_csv('data/as_plan.csv')
energy_prices = pd.read_csv('data/as_bid_aggregated_data.csv')
price_vol = pd.read_csv('data/as_price_vol.csv')
generation = pd.read_csv('data/generation.csv')

Then you create a dictionary of the following keys and assign the values to the above dataframes:


data = {'bids':bids,
        'plans':plans,
        'energy_prices':energy_prices,
        'price_vol':price_vol,
        'generation':generation,
       }

At this point your keys point to the dataframes in the outer scope.

Then you call a function that creates a COPY of the source dataframe, modifies it and returns it.

def create_dt(input_df):
    '''create a dataframe with a datetime index from multiple cols
    '''
    df = input_df.copy()
    #modify the df
    df = df.set_index(dt_index)
    df = df.drop(columns=[date_col,hr_col])
    return df

for key, df in data.items():
    data[key] = create_dt(data[key],'date','hr_beg')

At this point the returned df from create_dt() is a different dataframe from the one given (you created a copy), and changed the reference in the dictionary, data. So there is no reason why the outer scope dataframes should be modified. (if you remove the input_df.copy() line it may work as your expecting)

In anycase, if this is all you are doing in the function there is no reason not do perform the drop columns in the loop from the outer scope, if you *want* the results to be modified.

monkut
  • 42,176
  • 24
  • 124
  • 155
  • Ah. I would have expected `data[key] = create_dt(data[key],'date','hr_beg')` to still update with whatever was returned from `create_dt`, even if not a dataframe. I actually create that copy explicitly so as to avoid any side affects and to perform everything within the context of the function, and explicitly assigning the results with the return at the global scope. edit: Oh, I am doing more, I just stripped it out for clarity. – nickolasclarke Apr 23 '20 at 00:22
  • Due to the copy the reference in the dictionary changes which leaves the outer-scoped dataframes unchanged. – monkut Apr 23 '20 at 00:23
  • In this case the dictionary keys are like pointers to referenced data structures in memory. You can use the `id()` function to see the reference of the object has changed. – monkut Apr 23 '20 at 00:26
  • Would there be a way do perform the operation in a "functional" style as I described above while still updating the global? – nickolasclarke Apr 23 '20 at 00:29
  • What's your concern here? If it's memory you can just delete the outer-scoped variables after processing them. Or even just define them directly on dictionary creation. I don't see a benefit to maintaining the outer-scoped dataframe references if you want them modified and already have them in the `data` dictionary. – monkut Apr 23 '20 at 00:52
  • as I mentioned, to maintain a functional style, with no side effects. By modifying the df in the manner suggested, it is modified outside of the scope of the function and breaks idempotence. But perhaps this should concern? Without deleting the globals, it would be easy to mistakenly call the wrong version of the given dataframe, especially if I am not interested in iterating through all the dataframes, but rather do something to just one. Anyhow, that was my rationale. Thanks to all for the explanations! – nickolasclarke Apr 23 '20 at 03:34
  • 1
    It really comes down to preference then. I'd probably l just keep the dataframes in the `data` dictionary, and use that as the source of truth. You may consider just passing the whole `data` dictionary to your function and overwritting the dictionary with the modified version. It's up to you. – monkut Apr 23 '20 at 07:34
1

If you want to change the pandas.DataFrame object and have it updated for all variables pointed to that object then you need to use the inplace=True agrument for all the df method calls you use. Step through these examples in python tutor to get a clearer idea of what objects are being pointed to by your variables:

import pandas as pd
'''Basicly What you were doing'''
def create_dt(input_df):
    df = input_df.copy()
    df = df.set_index(pd.Series(['i','j']))
    return df


x = pd.DataFrame({'a':[1,2],'b':[3,4]})
datax = {'x':x,}
for key, df in datax.items():
    datax[key] = create_dt(datax[key])
print(x)


'''Basicly What was recommended'''
def create_dt2(input_df):
    input_df = input_df.set_index(pd.Series(['i','j']))
    return input_df


y = pd.DataFrame({'a':[5,6],'b':[7,8]})
datay = {'y':y,}
for key, df in datay.items():
    datay[key] = create_dt2(datay[key])

print(y)

'''Using inplace = True is the only way to change the object'''
def modify_df(input_df):
    input_df.set_index(pd.Series(['i','j']), inplace=True)

z = pd.DataFrame({'a':[9,10],'b':[11,12]})
dataz = {'z':z,} 
for key, df in dataz.items():
    modify_df(dataz[key])

print(z)


Phillyclause89
  • 674
  • 4
  • 12