I have a pandas DataFrame, created this way:
import pandas as pd
wb = pd.io.parsers.ExcelFile('/path/to/data.xlsx')
df = wb.parse(wb.sheet_names[0])
The resulting dataframe has about a dozen columns, all having exactly the same length (about 150K).
For most columns, the following operation is nearly instantaneous
aset = set(df.acolumn)
But for some columns, the same operation, e.g.
aset = set(df.weirdcolumn)
takes > 10 minutes! (Or rather, the operation fails to complete before the 10-minute timeout period expires.) Same number of elements!
Stranger still:
In [106]: set([type(c) for c in df.weirdcolumn])
Out[106]: set([numpy.float64])
In [107]: df.weirdcolumn.value_counts()
Out[107]: []
It appears that the content of the column is all nans
In [118]: all(np.isnan(df.weirdcolumn.values))
Out[118]: True
But this does not explain the slowdown mentioned before, because the following operation takes only a couple of seconds:
In [121]: set([np.nan for _ in range(len(data))])
Out[121]: set([nan])
I have run out of ways to find out the cause of the massive slowdown mentioned above. Suggestions welcome.