2

I am working in Pandas to create a series of data frames, each of which is an aggregated version of the previous one. (I have a statistic that has to be calculated recursively.) Let's assume that I have a series of variables that I'll be aggregating, saved in a dictionary called aggVars. The original data frame, df, has nested observations:

year area ind occ aggVars
2000 0001 001 001 ...
2000 0001 001 002 ...
2000 0001 002 001 ...
2000 0001 002 002 ...
2000 0002 001 001 ...
2000 0002 001 002 ...
2000 0002 002 001 ...
2000 0002 002 002 ...
2001 0001 001 001 ...
2001 0001 001 002 ...
.
.
.

I think you get the idea. Observations are nested in occupations, which are within industries, which are within areas, which are within years.

I can do this the dumb way:

# DUMB CODE BLOCK
df_job = df.groupby(['year', 'area', 'ind', 'occ'], as_index=False)
df_job = df_job.agg(aggVars)
df_work = df_job.groupby(['year', 'area', 'ind'], as_index=False)
df_work = df_work.agg(aggVars)
df_county = df_work.groupby(['year', 'area'], as_index=False)
df_county = df_county.agg(aggVars)
df_year = df_county.groupby(['year'], as_index=False)
df_year = df_year.agg(aggVars)

(I have intentionally used different names for the data frames, i.e., df_county instead of df_area, to reflect that the .groupby() variables in the real data do not map so neatly onto the hierarchical levels.)

I have tested this, and it works fine. But this is CLEARLY a stupid way to do this. This should be a loop of some sort. And here is where my troubles begin. I could specify a list of lists:

aggHierarchy = [['job', ['year', 'area', 'ind', 'occ']],
                ['work', ['year', 'area', 'ind']],
                ['county', ['year', 'area']],
                ['year', ['year']]]

And then loop over the list, something like this:

# BROKEN CODE BLOCK
old_df = 'df'
for level in aggHierarchy:
    new_df = 'df_%s' % level[0]
    new_df = old_df.groupby(level[1], as_index=False)
    new_df = new_df.agg(aggVars)
    old_df = new_df

The logic here would be to assign the new data frame's name based on the first sub-element of the aggHierarchy element, then group things using the second sub-element. But of course, this doesn't work. The loop I've written basically tries to assign a NAME for the new data frame using new_df = 'df_%s' % level[0], but all I've actually done there is create a string.

Furthermore, Stack Overflow is full of people pointing out that using lists to assign variable names in a loop is Considered Harmful. I get that--I can tell how janky this is. "Use a dictionary," I see people writing. But here's the thing: The aggregation of those data frames has to happen in a certain order, which (I believe) I can't specify with a dictionary. I am failing to grasp how I go from adding variable names to a dictionary, to calling them in some specified order in a loop.

Thus my question, which hopefully I've given enough background information to specify well: given a block of code like DUMB CODE BLOCK above, where I need to update variable names based on a list (or dictionary!) whose exact contents I might not know in advance...how can I create some sort of loop there?

JP Ferguson
  • 59
  • 1
  • 9
  • `collections.OrderedDict` might suit your usecase. – Josha Inglis Nov 03 '17 at 04:17
  • Are you only interested in the *end* result or do you actually need all of those intermediate DataFrames for something else? – wwii Nov 03 '17 at 04:25
  • It'd be helpful if you can provide a working minimal example, including specific names and values for `aggVars`, so it's possible to see the exact input and output. Also, what is your required output - do you need all of the intermediary data frames, or only the final one (e.g. `df_year` in your example)? – andrew_reece Nov 03 '17 at 04:25
  • I do need the intermediate data frames. I have realized that I can collapse most of that `dumb code block` in half, i.e., `df_year = df_county.groupby(['year'], as_index=False).agg(aggVars)`. – JP Ferguson Nov 03 '17 at 04:30
  • A minimum working example might have three variables over which I'm aggregating, say `white`, `black`, and `workers`. These hold counts. Then `aggVars = {'white': 'sum', 'black': 'sum', 'workers': 'sum'}`. Each of the aggregated data frames would then hold sums collapsed over the grouping variables. (In this case you can build ALL of the intermediate data frames from the original, but that doesn't hold for more complicated statistics.) – JP Ferguson Nov 03 '17 at 04:33

1 Answers1

0

Keep using your list but use the DataFrame name as a dictionary key to store all the intermediate DataFrames.

Assuming the original DataFrame is named df:

aggHierarchy = [['job', ['year', 'area', 'ind', 'occ']],
                ['work', ['year', 'area', 'ind']],
                ['county', ['year', 'area']],
                ['year', ['year']]]

data_frames = {}
for df_name, group in aggHierarchy:
    df = df.groupby(group, as_index=False)
    df = df.agg(aggVars)
    data_frames[df_name] = df

If you do not want to lose the original DataFrame:

data_frames = {}
new_df = df
for df_name, group in aggHierarchy:
    new_df = new_df.groupby(group, as_index=False)
    new_df = new_df.agg(aggVars)
    data_frames[df_name] = new_df

Your comment:

If, in aggHierarchy, I list the names as df_industry, df_area and so on, and then loop using for df_name, group in aggHierarchy, the resulting dictionary keys are industry, area and so on. That is, python seems to "absorb" the prefix when the prefix is included in the looping variable and the things looped over.

Surprised me when I saw it - I certainly don't see that behaviour

aggHierarchy = [['df_job', ['year', 'area', 'ind', 'occ']],
                ['df_work', ['year', 'area', 'ind']],
                ['df_county', ['year', 'area']],
                ['df_year', ['year']]]

d = {}
for df_name, thing in aggHierarchy:
    print(df_name, thing)
    d[df_name] =  thing

>>>
df_job ['year', 'area', 'ind', 'occ']
df_work ['year', 'area', 'ind']
df_county ['year', 'area']
df_year ['year']
>>>
>>> d
{'df_job': ['year', 'area', 'ind', 'occ'], 'df_work': ['year', 'area', 'ind'], 'df_county': ['year', 'area'], 'df_year': ['year']}
>>>
>>> d['df_year']
['year']
>>> d['year']
Traceback (most recent call last):
  File "<pyshell#206>", line 1, in <module>
    d['year']
KeyError: 'year'
>>>

In the loop, df_name is just a name. The for statement is implicitly(?) assigning the first item from each row in aggHierarchy to it. No magic going on.

wwii
  • 23,232
  • 7
  • 37
  • 77
  • Bravo. That does indeed produce the same results! Now I'm going to sit for a while and figure out WHY it works. (I came to python with >15 years of using Stata for data analysis, and Stata has no dictionary structure. Hence the intuition still escapes me.) Thanks for the quick, and useful, reply. – JP Ferguson Nov 03 '17 at 04:45
  • @JPFerguson, Anything specific that you are trying to figure out `WHY`? – wwii Nov 03 '17 at 04:49
  • Sometimes [Python Names](https://nedbatchelder.com/text/names.html) trip people up at first also. – wwii Nov 03 '17 at 04:55
  • 1
    At this point, no. I understand now that the dictionary is linking those names to data frames. Per one of the posts I linked to: "The thing all these have in common is trying to bridge the gap between two domains: the data in your program, and the names of data in your program. Any time this happens, it's a clear sign that you need to move up a level in your data modeling. Instead of 26 lists, you need one dictionary. Instead of N tables, you should have one table, with one more column in it." I see how this gives us a dictionary of data frames rather than a cluttered set of variables. – JP Ferguson Nov 03 '17 at 04:56
  • Actually @wwii, one specific question, if you don't mind. If, in aggHierarchy, I list the names as `df_industry`, `df_area` and so on, and then loop using `for df_name, group in aggHierarchy`, the resulting dictionary keys are `industry`, `area` and so on. That is, python seems to "absorb" the prefix when the prefix is included in the looping variable and the things looped over. If I keep the names like that and loop using `for name, group...`, that doesn't happen. What is this phenomenon called, so I can read up on it? It seems like a future landmine if I don't understand it. – JP Ferguson Nov 03 '17 at 17:00