1

I'm trying to register a backward hook on each neuron's weights in a network. By dynamic I mean that it will take a value and multiply the associated gradients by that value.

From here it seem like it's possible to register a hook on a tensor with a fixed value (though note that I need it to take a value that will change). From here it also seems like it's possible to register a hook on all of the parameters -- they use it to do gradients clipping (though note that I'm trying to only do it on each neuron's weights).

If my network is as follows:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.fc1 = nn.Linear(3,5)
        self.fc2 = nn.Linear(5,10)
        self.fc3 = nn.Linear(10,1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        return x 

The first layer has 5 neurons with 3 associated weights for each. Hence, this layer should have 5 hooks that modifies (i.e change the current gradient by multiplying it) their 3 associated weights gradients during the backward step.

Training pseudo-code example:

net = Model()
for epoch in epochs:
    out = net(data)
    loss = criterion(out, target)
    optimizer.zero_grad()
    loss.backward()
    for hook in list_of_hooks: #not sure if there's a more "pytorch" way of doing this without a for loop
        hook(random_value)
    optimizer.step()
Axo
  • 33
  • 5

1 Answers1

0

What about exploiting lambdas closure over names?

A short example:

import torch

net_params = torch.rand(5, 3, requires_grad=True)

msg = "Hello!"

t.register_hook(lambda g: print(msg))


out1 = net_params * 2.

loss = out1.sum()
loss.backward()  # Activates the hook and prints "Hello!"


msg = "How are you?"  # The lambda is affected by this change

out2 = t ** 4.
loss2 = out2.sum()

loss2.backward()  # Activates the hook again and prints "How are you?"

So a possible solution to your problem:

net = Model()
# Replace it with your computed values
rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)

net.fc1.weight.register_hook(lambda g: g * rand_values) 

for epoch in epochs:
    out = net(data)
    loss = criterion(out, target)
    optimizer.zero_grad()
    loss.backward()  # fc1 gradients are multiplied by rand_values
    optimizer.step()

    # Update rand_values. The lambda computation will change accordingly
    rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)

Edit

To make things clearer, if you specifically want to multiply each set of weights i by a single value vi you can exploit broadcasting semantic and define values = torch.tensor([v0, v1, v2, v3, v4]).reshape(5, 1), then the lambda becomes lambda g: g * values

aretor
  • 2,379
  • 2
  • 22
  • 38
  • Interesting, thanks for the help! A few questions. When you say `fc1 weights are multiplied by rand_values` do you mean the gradients are multiplied?. Also, The `lambda` seems pretty cool as it takes a value so I think it might solve the dynamic part of the hook that I'm looking for. But it seems like you're applying it over all the neurons in `fc1` where I need to apply a unique hook to each neuron – Axo Feb 20 '22 at 15:59
  • Appreciate the clarification! I still don't see how to register a hook on each neuron. Wouldn't `lambda g, i=i: g[:, i] * rand_values[:, i] for i in range(5)` just create 5 hooks in general on a specific layer? So it will just multiply the gradients 5 times. If you can edit to the complete answer with the `hook per neuron` I'll accept it quickly! – Axo Feb 21 '22 at 14:20
  • Forget the past comment, it's wrong and I'll delete it. The code in the answer multiplies the gradient weights of **each** neuron at the same time. Do you need a sequential approach (e.g. gradient of neuron `i + 1`, has to be multiplied by something dependent on gradient of neuron `i`)? – aretor Feb 21 '22 at 16:38
  • Maybe the `torch.rand(net.fc1.out_features, net.fc1.in_features)` confuses me. It looks like it's 1 hook that multiplies each weight by a random value. Where I need to have 1 hook per a set of weights (the ones that are corresponding to a neuron / in-weights). So the idea is to multiply these set of weights per neuron by one value, and that each neuron weights will have a different value that they will be multiplied by. Does this makes more sense? – Axo Feb 21 '22 at 17:38
  • The point is that creating a hook per out-neuron weight set is quite cumbersome and inefficient. Instead, multiplying the out-neurons weight matrix by a vector containing your values is much better. So you can set a tensor `v = torch.tensor([v1, v2, v3, v4, v5]).reshape(5, 1)` (where `v1`, ..., `v5` are your out-neuron values) and then you perform `g * v`. This is equivalent to multiplying `g[i, :] * vi` (that is "multiplying each set of weights per neuron by one value"). Tell me if I got it right. – aretor Feb 22 '22 at 09:43
  • I think so? Just to make sure I understand. for `fc1` we will do `v= torch.tensor([v1, v2, v3, v4, v5]).reshape(5, 1)` and multiply `g*v`. Each `v_i` will take a different random value and multiply the neuron weights' gradients (there are 3 weights for each of the neurons in that layer) by that value? How does each `v_i` know which neuron to correspond to? If that's the case, then for `fc2` where we have 10 neurons with 5 weights each, we will do `v = torch.tensor([v1, v2,...,v10]).reshape(10, 1)`? Then in the trainin loop do I just change the values of each `v_i`? – Axo Feb 22 '22 at 15:17
  • Yes the reasoning is correct, let me add some more details. (a) `v[i]` corresponds to the weight-set of out-neuron `i`. (b) For `fc2` you can just define another hook and multiply its gradient by another `tensor` `u`. (c) Of course, `v` needs not to be random, you can just set with whatever values you need. (d) In the training loop you can either change the entire tensor `v` at once (more efficient) or loop through each value `v[i]` and change it (far less efficient for large layer sizes). – aretor Feb 22 '22 at 15:58
  • Okay this is starting to make sense and look like what I want. For `(a)`, if I run `net.fc1.weight` I get a tensor of size `(5,3)`. Are you saying that `v[1]` will correspond to the first tensor (the 3 weights) and will multiply only these 3 weight gradients? Where `v[2]` will multiply the next tensor? – Axo Feb 22 '22 at 16:09
  • Yes, you can experiment with `net.fc1.weight` or `net.fc1.weight.grad` in order to understand the mechanism better (have a look at [broadcasting semantics](https://pytorch.org/docs/stable/notes/broadcasting.html)). Just a detail, it is `v[0]` and `v[1]`, since Python indexing starts from 0. – aretor Feb 22 '22 at 17:00
  • Oops, I meant `v1` rather than `v[1]`. Okay great. So I think this is a solution. Can you edit your answer for completeness? ie have the 3 hooks (`v= torch.tensor([v1, v2, v3, v4, v5]).reshape(5, 1)`, `u= torch.tensor([u1,..., u10]).reshape(10, 1)`...) and their corresponding update values (can just be random). I'll accept it! – Axo Feb 22 '22 at 17:13
  • Also note that when I run `net.fc1.register_hook` I get `AttributeError: 'Linear' object has no attribute 'register_hook'`. also when I try to register a hook on each layer (I tried `register_backward_hook` because the `register_hook` didn't work), I get `TypeError: () takes 1 positional argument but 3 were given` – Axo Feb 23 '22 at 14:48
  • Does `net.fc1.weight.register_hook` work instead? – aretor Feb 23 '22 at 15:01
  • Yes. But I still get the `TypeError: () takes 1 positional argument but 3 were given` for some reason – Axo Feb 23 '22 at 15:17
  • I tried the code with that last modification I mentioned (I updated the answer) and now it works fine. The error may be related to your specific implementation. – aretor Feb 23 '22 at 15:39
  • 1
    Got it. Looks good! Doing some experiments to see that this works and I'll accept it! Thanks a lot – Axo Feb 23 '22 at 23:28