16

my first time posting here, so hope I've asked my question in the right sort of way,

After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)

My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.

The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.

So, for example, given the following dictionary:

d = {}
for i in xrange(15000):
    d[random.randint(15000000, 18000000)] = 0

can you get Python to tell you how many collisions happened when creating it?

My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.

To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.

Thanks for your help.

Edit: Some code to implement @Winston Ewert's solution:

n = 1500
global collision_count
collision_count = 0

class Foo():

    def __eq__(self, other):
        global collision_count
        collision_count += 1
        return id(self) == id(other)

    def __hash__(self):
        #return id(self) # @John Machin: yes, I know!
        return 1

objects = [Foo() for i in xrange(n)]

d = {}
for o in objects:
    d[o] = 1

print collision_count

Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.

It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.

Anyone know why this is saying 0 collisions?

Edit: I might have figured this out.

Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as @John Machin put it)?

Adam Nellis
  • 1,500
  • 1
  • 16
  • 23
  • 1
    You mean that you want to know if the internal dict algorithms did any hash table probing to find an element? Is that what you mean by "collision"? – S.Lott Feb 01 '11 at 17:02
  • 1
    Some semi-relevant info: `hash(-1)==hash(-2)`. Other than that, all ints x in the interval `-sys.maxint-1 <= x <= sys.maxint` have unique hashes. The algorithm for hashing long ints is described here: http://effbot.org/zone/python-hash.htm – unutbu Feb 01 '11 at 18:07
  • "The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead." Ouch. – UncleZeiv Feb 01 '11 at 18:27
  • @unutbu: (1) the OP's keys are object ids, not long integers. Different story. (2) Urban myth: unique hash means no collisions. Is wrong. See my answer. – John Machin Feb 01 '11 at 20:33
  • @S.Lott: Yes, that's precisely what I meant. – Adam Nellis Feb 01 '11 at 23:05
  • looking at the code again, I see that it does compare the actual hash values and thus won't call '__eq__' unless they are actually the same hash. Thus my plan doesn't work. :( – Winston Ewert Feb 02 '11 at 15:24
  • @Adam Nellis, @Winston Ewart: Naturally it will compare hashes instead of the probe values before calling `__eq__`; there would be many cases where the probe values were the same and the hashes different. – John Machin Feb 02 '11 at 22:17

2 Answers2

10

Short answer:

You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.

Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".

You shouldn't be worrying about collisions.

Long answer:

Some explanations, derived from reading the source code:

A dict is implemented as a table of 2 ** i entries, where i is an integer.

dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.

When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer

It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.

Consequently collisions are inevitable.

However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.

The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).

In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.

Update after testing on Python 2.6:

Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".

>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
...    i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
...     pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>
svemaraju
  • 90
  • 7
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • This is very good - thanks! This is all really helpful. Ok, I have two questions/comments: (1) Rather than adding "o" to the dictionary (where o is an instance of an object), I am adding id(o). Presumably, this doesn't rotate the address right by 4 bits and might be giving me more collisions that would be expected. If so, I should use o rather than id(o). (2) I am using two levels of dicts: d[x][y], because for a given x, I want to iterate through all its neighbours (all y). Is this fast to do if you use d[(x, y)]? I can post this as a separate question, if that's more appropriate. – Adam Nellis Feb 01 '11 at 22:43
  • @Adam Nellis: (1) Using `id(o)` instead of `o` as a dict key is wasting a function call and getting a result that can't be better and is likely to be worse. (2) No. You would have to iterate over all items and ignore ones with non-interesting x values. – John Machin Feb 01 '11 at 22:57
  • @John Machin: Thanks for your help. From reading Objects/object.c, I believe you about hash(o) != id(o), but when I print out [hash(o) for o in olist] and [id(o) for o in olist], they are the same. Am I missing something? (I'm running Python 2.6.2) – Adam Nellis Feb 02 '11 at 12:08
  • @John Machin: Also, I think you have a slight typing mistake in your answer. You've got an extra order of magnitude in "12997", as this isn't 13.3% of 15000. Running your code, I get similar percentages of collisions (I get 6.8% collisions on my system). – Adam Nellis Feb 02 '11 at 12:11
  • @Adam Nellis: (1) The rotate-the-address enhancement isn't in 2.6; see my updated answer. (2) 12997 = number of unique probes; number of collisions = 15000 - 12997 i.e. 2003 i.e. 13.4% – John Machin Feb 02 '11 at 21:58
  • 1
    @John Machin: (2) Oops, yes, sorry - I'm an idiot! (1) Wow, yes - after running it on both Python 2.6 and Python 2.7 (and counting the collisions, rather than unique probes!) I get 93% collisions on Python 2.6 (or Python 2.7 using `id(o)`) but only 17% collisions on Python 2.7 (using `hash(o)`)! So yes, collisions were a problem. I've updated to 2.7 and re-written my code to hash object instances rather than their IDs. (It's still running too slowly though, but not as slowly as it was :)) – Adam Nellis Feb 03 '11 at 14:15
5

This idea doesn't actually work, see discussion in the question.

A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.

However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.

So:

Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.

Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83