2

I have a pandas DataFrame, created this way:

import pandas as pd
wb = pd.io.parsers.ExcelFile('/path/to/data.xlsx')
df = wb.parse(wb.sheet_names[0])

The resulting dataframe has about a dozen columns, all having exactly the same length (about 150K).

For most columns, the following operation is nearly instantaneous

aset = set(df.acolumn)

But for some columns, the same operation, e.g.

aset = set(df.weirdcolumn)

takes > 10 minutes! (Or rather, the operation fails to complete before the 10-minute timeout period expires.) Same number of elements!

Stranger still:

In [106]: set([type(c) for c in df.weirdcolumn])
Out[106]: set([numpy.float64])

In [107]: df.weirdcolumn.value_counts()
Out[107]: []

It appears that the content of the column is all nans

In [118]: all(np.isnan(df.weirdcolumn.values))
Out[118]: True

But this does not explain the slowdown mentioned before, because the following operation takes only a couple of seconds:

In [121]: set([np.nan for _ in range(len(data))])
Out[121]: set([nan])

I have run out of ways to find out the cause of the massive slowdown mentioned above. Suggestions welcome.

kjo
  • 33,683
  • 52
  • 148
  • 265

1 Answers1

4

One weird thing about nans is that they don't compare as equal. This means that "different" nan objects will be inserted separately for sets:

>>> float('nan') == float('nan')
False
>>> float('nan') is float('nan')
False
>>> len(set([float('nan') for _ in range(1000)]))
1000

This doesn't happen for your test of np.nan, because it's the same object over and over:

>>> np.nan == np.nan
False
>>> np.nan is np.nan
True
>>> len(set([np.nan for _ in range(1000)]))
1

This is probably your problem; you're making a 150,000 element set where every single element has the exact same hash (hash(float('nan')) == 0). This means that an inserting a new nan into a set that already has n nans takes at least O(n) time, so building a set of N nans takes at least O(N^2) time. 150k^2 is...big.

So yeah, nans suck. You could work around this by doing something like

nan_idx = np.isnan(df.weirdcolumn)
s = set(df.weirdcolumn[~nan_idx])
if np.any(nan_idx):
    s.add(np.nan)
Danica
  • 28,423
  • 6
  • 90
  • 122
  • 4
    How peculiar. This would *kill* your performance. Since each `nan` would hash to the same value, this is the absolute worst case scenario for collision resolution in the hash table. I wonder if something like this could be exploited in python for nasty purposes ... – mgilson Feb 13 '13 at 04:17
  • It's a little weird that `np.nan` doesn't get repeated in the set. According to the glossary index for [hashable](http://docs.python.org/2/glossary.html), in order for an object to be hashable, all that is checked is `__eq__` (or `__cmp__`) and `__hash__`. – mgilson Feb 13 '13 at 04:24
  • [The docs say](http://docs.python.org/dev/reference/expressions.html#membership-test-details): "For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`." Presumably it checks `is` to shortcut the `==` test, since for most anything except nan `a is b` implies `a == b`, but that seems to be part of the semantics for the case when they're not. – Danica Feb 13 '13 at 04:26
  • See also http://stackoverflow.com/questions/9904699/checking-for-nan-presence-in-a-container (where @MarkDickinson linked to the relevant quote), http://www.gossamer-threads.com/lists/python/python/922088, and http://bugs.python.org/issue11945. – Danica Feb 13 '13 at 04:28
  • 1
    Interesting. That line isn't in the 2.7 docs which is what I usually peruse. – mgilson Feb 13 '13 at 04:29
  • "So yeah, nans suck." I'd say it's rather panda's use of nans that sucks. numpy's nans are fine, for instance. – kjo Feb 13 '13 at 13:29
  • Eh, this isn't really pandas's fault. Numpy does the same thing in most cases, it just so happens that np.nan itself happens to not. – Danica Feb 13 '13 at 16:40
  • @kjo Just to illustrate that this really has nothing to do with pandas, note that `len(set(np.array([np.nan] * 10))) = 10`. – Danica Feb 14 '13 at 05:34