4

I'm doing some natural language processing, and I have a MultiIndexed DataFrame that looks something like this (except there are actually about 3,000 rows):

                             Title                                              N-grams
Period  Date                                                                                                                     
2015-01 2015-01-01 22:00:10  SIRF: Simultaneous Image Registration and Fusi...  [@SENTBEGIN paper, paper propose, propose nove...    
        2015-01-02 16:54:13  Generic construction of scale-invariantly coar...  [@SENTBEGIN encode, encode temporal, temporal ...
        2015-01-04 00:07:00  Understanding Trajectory Behavior: A Motion Pa...  [@SENTBEGIN mining, mining underlie, underlie ...
        2015-01-04 09:07:45  Hostile Intent Identification by Movement Patt...  [@SENTBEGIN the, the recent, recent year, year...
        2015-01-04 14:35:58  A New Method for Signal and Image Analysis: Th...  [@SENTBEGIN brief, brief review, review provid...

What I want to do is to count how many times each n-gram appears in each month (hence the first index, "Period"). Doing that is rather straightforward, if time-consuming (and because each cell in the "N-grams" column is a list, I'm not sure much could be done to speed it up). I create a new DataFrame to hold the counts, using this code:

# Create the frequencies DataFrame.
period_index = ngrams.index.unique(level = "Period")
freqs = DataFrame(index = period_index)

# Count the n-grams in each period.
for period in period_index:
    for ngrams_list in ngrams.loc[period, "N-grams"]:
        for ngram in ngrams_list:
            if not ngram in freqs.columns:
                freqs[ngram] = 0
            freqs.loc[period, ngram] += 1

The logic is pretty simple: if the n-gram in question has been seen already (there's a column for it in the "freqs"), increment the count by 1. If it hasn't been seen, create a new column of 0's for that n-gram, and then increment as normal. In the vast majority of cases, this works fine, but for a tiny fraction of n-grams, I get this error when the loop hits the increment line:

KeyError: u'the label [7 85.40] is not in the [index]'

(Sorry for the lack of a proper stack trace--I'm doing this in a Zeppelin Notebook, and Zeppelin doesn't give a proper stack trace.)

A little more debugging showed that, in these cases, the creation of the new column fails silently (that is, it doesn't work, but it doesn't return an exception, either).

It might be worth noting that in an earlier version of the code, I was using "loc" to assign directly to a cell in a newly created column, rather than creating the column first, like this:

if not ngram in freqs.columns:
    freqs.loc[period, ngram] = 1

I changed this because it caused problems by assigning NaN's for that n-gram to all the other periods, but the direct assignment choked on exactly the same n-grams as with the new code.

By wrapping the increment line in a try/except block, I've discovered that the error is extremely rare: it occurs for about 20 out of a total of over 100,000 n-grams in the corpus. Here are some examples:

"7 85.40"
"2014 july"
"2010 3.4"
"and 77"
"1997 and"
"and 2014"
"6 2008"
"879 --"
"-- 894"
"2003 -"
"- 2014"

Most of the 20 include digits, but at least one is entirely letters (two words separated by a space--it's not in the list above, because I re-ran the script while typing up this question, and didn't get all the way to that point), and plenty of digits-only n-grams don't cause problems. Most of the problematic ones involve years, which, on the face of it, might suggest some sort of confusion with the DataFrame's DatetimeIndex (given that a DatetimeIndex accepts partial matches), but that wouldn't explain the non-dates, especially the ones beginning with letters.

Despite the unlikelihood of the DatetimeIndex conflict, I tried a different method of creating each new column (as suggested by an answer to Adding new column to existing DataFrame in Python pandas), using "loc" to avoid any confusion between rows and columns:

freqs.loc[:, ngram] = Series(0, index = freqs.index)

...but that meets with exactly the same fate as my original code that created each new column implicitly by assigning to a non-existent column:

KeyError: u'7 85.40'

Next, I tried the DataFrame.assign method (suggested in the same answer cited above, though I needed to add a workaround suggested by an answer to pandas assign with new column name as string):

kwarg = {ngram: 0}
freqs = freqs.assign(**kwarg)

Alas, that produces exactly the same error.

Does anyone have any insights on why this might be happening? Given the rarity, I suppose I could just ignore the problematic n-grams, but it would be good to understand what's going on.

MTKnife
  • 133
  • 8

2 Answers2

2

A nested for loop is not recommended, or required. You can use MultiLabelBinarizer from the sklearn.preprocessing library to provide one-hot encoding, then use groupby + sum with the results and join to your original dataframe.

Here's a demonstration:

df = df.set_index(['L1', 'L2'])

row_counts = df['values'].apply(pd.Series.value_counts).fillna(0).astype(int)

# alternative if above does not work
row_counts = df['values'].apply(lambda x: pd.Series(x).value_counts(sort=False))\
                         .fillna(0).astype(int)

row_counts_grouped = row_counts.groupby(level='L1').sum()

df = df.join(row_counts_grouped, how='inner')

print(df)

          values  a  b  c  d  e  g
L1 L2                             
1  1   [a, a, c]  3  2  2  1  1  0
   2   [b, c, d]  3  2  2  1  1  0
   3   [a, b, e]  3  2  2  1  1  0
2  1   [a, e, g]  1  2  1  2  2  1
   2   [b, d, d]  1  2  1  2  2  1
   3   [e, b, c]  1  2  1  2  2  1

Setup / original solution

We don't consider duplicate values on a row with this solution:

from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame([[1,1,['a','a','c']], [1,2,['b','c','d']], [1,3,['a','b','e']],
                   [2,1,['a','e','g']], [2,2,['b','d','d']], [2,3,['e','b','c']]],
                  columns=['L1', 'L2', 'values'])

df = df.set_index(['L1', 'L2'])

mlb = MultiLabelBinarizer()

onehot = pd.DataFrame(mlb.fit_transform(df['values']),
                      columns=mlb.classes_,
                      index=df.index.get_level_values('L1'))

onehot_grouped = onehot.groupby(level='L1').sum()

df = df.join(onehot_grouped, how='inner')

print(df)

          values  a  b  c  d  e  g
L1 L2                             
1  1   [a, a, c]  2  2  2  1  1  0
   2   [b, c, d]  2  2  2  1  1  0
   3   [a, b, e]  2  2  2  1  1  0
2  1   [a, e, g]  1  2  1  1  2  1
   2   [b, d, d]  1  2  1  1  2  1
   3   [e, b, c]  1  2  1  1  2  1
MTKnife
  • 133
  • 8
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you. That's close but doesn't *quite* do what I need to do. I'd like to get a document-term frequency matrix before doing the sums--that is, if "a" appears twice in the first document, the count in question needs to be incremented by 2, not 1. – MTKnife Nov 29 '18 at 17:02
  • OK, I had to modify that a little, because the `Series.value_counts` method doesn't like lists as input: `counts_by_doc = ngrams["N-grams"].apply(lambda ngrams_list: Series.value_counts(Series(ngrams_list), sort = False)).fillna(0).astype(int)`. If you'd like to make the edit, I can mark your answers as the accepted one. Thanks again! – MTKnife Nov 29 '18 at 18:07
  • @MTKnife, It worked for me (Pandas v0.19), but I've added your version anyway. – jpp Nov 29 '18 at 18:14
  • 1
    Thank you! Yes, I'm using 23.4. BTW, the reason I was using the `for` loops was that I wasn't aware that anything like `count_values` existed. – MTKnife Nov 29 '18 at 18:19
  • 1
    I just caught something in the final answer. For the latest versions of pandas, that "alternative" line needs to convert "x" into a Series before using a Series method on it. The most straightforward way is `lambda x: pd.Series(x).value_counts(sort=False)`. I'll edit it. – MTKnife Nov 30 '18 at 20:04
0

For my original dataset of about 3,000 short documents, jpp's answer worked fine, and took about 10 minutes to run in Zeppelin on the server where I'm doing testing--which was an order of magnitude faster than the code I had been using (in addition to solving the KeyError problem). However, when I tried a bigger dataset, of about 10,000, the code hadn't finished after 18 hours of running--I suspect it had something to do with keeping everything in memory when using apply (see https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ for some reflections on this issue).

Figuring that apply was just an elegant way of creating a nested loop, I decided to avoid the apparent memory issues by writing the loops explicitly, while still using the Series.value_counts method, which should be the source of the real efficiency gains. This was a bit tricky due to problems with data types, but here's the end result:

period_index = ngrams.index.unique(level = "Period")
freqs = DataFrame()

for period in period_index:
    period_ngrams = ngrams.loc[period]
    period_freqs = DataFrame(index = period_ngrams.index)
    for i, doc in period_ngrams.iterrows():
        period_freqs = period_freqs.append(Series(doc["N-grams"]). \
                           value_counts(sort = False), ignore_index = True)
    period_sums = period_freqs.sum()
    period_sums.name = period
    freqs = freqs.append(period_sums)
    print "Processed period " + str(period) + "."

freqs["Totals"] = freqs.sum(axis = 1)
freqs = freqs.fillna(0).astype(int)

It turns out that not only does this work with the larger dataset, but it's actually faster: 5 minutes for 10,000 documents.

MTKnife
  • 133
  • 8