*23 Jul 2015, 09:50 UTC*

The literature on Deep Learning is somewhat intractable for a beginner. One can get a feeling for it, but little more. Also, Deep Learning is a somewhat ad-hoc subject anyway.

After failing to understand the second long overview I read, I thought I'd try and figure it out myself.

Let's start with some axioms. These axioms are not found in any Deep Learning texts I could find, but seem reasonable given Stafford Beer's work on cybernetics. Here they are simply quoted; for some justification see e.g. "The Brain of the Firm" (although Beer mostly relates these as theories that come from practical experience anyway):

- AS ABOVE, SO BELOW - every layer of the neural network works in the same way regarding teaching methods (although each layer has to be highly customizable in terms of actual function, the basic model of teaching must be the same at each level); this is justified by Beer's hierarchical/fractal model, where the connections between "manager" and "managee" take the same form, regardless of which hierarchical level we're looking at.
- LEARNING IS LAYER-LOCAL - there can be no global optimization pass; this doesn't (appear to) happen in the brain, and if our job is teaching a large distributed system (e.g. a multinational corporation) there may not be enough processing power in the world to optimize the behaviour of every low-level component at once. On the other hand, optimizing only individual neurons without regard to other neurons in that layer is never going to work.

These axioms come from cybernetics, so they would apply to any learning network, whatever underlying technology it is based on (we're looking at Neural Networks, but we could be looking at Bayesian networks or logic nets instead). They tell us that the learning algorithm at each stage can only act on:

- Local state of the layer
- (Data) signals coming in from lower layers
- (Teaching) signals coming in from higher layers

(Actually, higher layers may send other signals too - more on this later)

A DNN layer has some number of input values, N, and some number of output values, M.

In the literature, the input values are multiplied by a NxM matrix to create M linear output values, which are then offset before going through some non-linearity. Examples of non-linearities used in DNNs are:

- log, exp, tanh, max, min, clamp

So a DNN layer might perform O = tanh(A.I + B) where A is an NxM matrix and B is an M-component vector.

Note that tanh is invertible, so we have A.I + B = tanh'(O). If A is invertible too, then we have I = A'.(tanh'(O) - B). In other words, given a set of desired outputs we can calculate the exact desired inputs to achieve this goal. If the whole DNN is invertible, then, we can provide a desired output vector and generate a suitable input vector - i.e. we can synthesize input data from semantic data! (NB: this assumes M = N).

(Have you ever made a new friend and found yourself talking like them? This suggests that learning someone's idiosyncrasies in the "understanding" part of your brain leads to using those idiosyncrasies in the "synthesis" part of your brain - i.e. that perhaps the *same neural circuits* are used to do both things! Also, how is it that we are capable of having reasonably coherent dreams with plots, where we can "see" the activity? Could it be that the same neural circuits that allow us to see can be run in reverse?)

This duality between analysis and synthesis seems important.

It's worth reflecting on this for a while. Suppose we have no non-linearity, then O = A.I + B. In other words, each neuron performs a + operation across an input vector. If elements of A are zero, these inputs are excluded from the operation. For a single neuron, fed by two neurons below, we have O = aX + bY + c.

Suppose we have conjugate non-linearities at the output of the previous layer, and the input of this layer. One such pair is log/exp. For a single neuron, fed by two neurons below, we have O = exp(alog(X) + blog(Y) + log(c)) = X^a.Y^b.c, so each neuron performs a * operation across an input vector. If elements of A are zero, these inputs are again excluded. The weights in this case raise the inputs to a given power before multiplying.

Suppose the values are probabilities. Then multiplying them is equivalent to saying "my output is only likely if all the inputs are likely" - i.e. it's a probability operation biased towards consensus, or a sort of fuzzy AND gate.

Now look at tanh. If we have O = tanh(aX + bY + c) and X,Y are probabilities, then O will be about 1 if either X or Y is about 1. So this is a sort of fuzzy OR gate.

Or, if X,Y are in the scale [-1,1] then we have a kind of fuzzy majority vote - most voting yes means we vote yes; most voting no means we vote no; and a hung vote means we give an undecided output.

Now look at log/exp in the other direction. O = log(a.exp(X) + b.exp(Y) + c). If elements of A are zero, again the input is excluded. For X,Y positive, exp(X) grows very quickly as X increases, so the overall output is approximately equal to the larger of X and Y, i.e. this is a fuzzy MAX operation.

Now look at sqr and sqrt. O = sqrt(a.sqr(X) + b.sqr(Y) + c). This calculates the L2 norm.

So by using different non-linearities we can approximately synthesize +, *, AND, OR, MAX, and L2-NORM. It's easy to derive MIN, and by changing weights we can derive NAND, NOR, ANDN etc. and other norms.

The weird thing is, current DNNs seem to choose *one* operation and stick with it (not sure, maybe some don't) whereas surely we need to be able to *adapt* which operation we're doing according to requirements? A DNN that only has tanh for instance might turn out to be like me saying "well, I tried implementing this function using only AND gates, and I got a solution that's pretty good". Whereas the code might be both simpler and more reliable if we can use AND and OR gates. Certainly, programmers like to use all the operations, not just pick from a restricted set and do their best. Could this be holding DNN development back? (I'm aware that you can make OR gates from AND and NOT gates, but you can't make * very easily from gates if that's what you need and gates is all you have). Or maybe some non-linearities are general, and can effectively produce any primitive function given suitable weights. Such a non-linearity would be worth seeking.

But ... all of tanh, log, exp are invertible ... so we have the duality of analysis and synthesis as well.

So should we have both input and ouput non-linearities per layer, so they can be adjusted? And how adjustable can they be?

The last non-linearity is applied to a vector in parallel. Given a desired output, we run it through the inverse transform to get desired linear output. Each non-linearity can be tweaked, then, and they don't all have to be the same.

The first non-linearity, though, is applied N^2 times (N times in each node). If these are different in different nodes, then the inverse matrix problem becomes intractable, as it's no longer linear (the non-linearity pops into the problem to be solved). The only way to prevent this is if the non-linearities are the same for each input in every neuron. This implies only N operations, so applying this inside the neuron is inefficient. Thus, we should *not* have a tunable non-linearity at the input to each neuron, only at the outputs.

But that means the non-linearity performs two jobs - one is the post-conjugate operation from the previous layer, and one is the pre-conjugate operation for the next layer. So if we aim to tweak the operations of each layer, we will need a series of two non-linearities at each output node, where one can be tweaked by the layer and one can be tweaked by the next layer. Since we're running matrix code (we don't actually have different neurons) we *could* then move the pre-conjugate operation into the next layer to get better separation of concerns, so long as it's applied uniformly *as if* it were in the previous layer.

i.e.

```
Ilin = preNL(I)
Olin = A.Ilin + B
O = postNL(Olin)
```

The inverse being:

```
Olin = postNL'(O)
Ilin = A'.(Olin - B)
I = preNL'(Ilin)
```

How do we adjust weights? Well, if we want to synthesize, we need the matrix to be invertible. This suggests the matrix should be adjusted only using invertible operations. An invertible matrix can be written in SVD form A = VDW. If we allow V and W to be operated on (pre- and post-) using simple orthogonal matrices e.g. small Gibbs rotations, and we allow D to be operated on using simple diagonal matrices (e.g. all elements 1 except some element ~= 1) then we preserve invertibility automatically. This suggests that the weight adjustment protocol must work like this.

NB: NO weight adjustment protocol I've seen works like this, although I don't understand ANY of them yet anyway.

What are the constraints we have?

- Our teaching signal comes from above
- We might want to learn without a teaching signal, to get information about input data structure
- When we receive a teaching signal we have two basic choices and we can do either or both: (a) modify our own weights (b) send a teaching signal to our child layers

What is the teaching signal? Well, at the highest level, it consists of the programmer/user saying "you should have output this, not that, for this case". Using the AS ABOVE SO BELOW axiom, this method should therefore be used in ALL layers. So the teaching signal is "this was the desired output".

But the layer is invertible so we can compute the desired input given the desired output. Then we can say to the layer below "you should have given me this".

How would this work in a management situation? Something goes wrong, so you think (a) how should I change my management to make this not occur in future but you *also* think (b) it would have helped if my managee had given me this information instead of that information beforehand. So you *both* adjust your own weights *and* you instruct your managees. These two operations must be done in some blend. Presumably the limit is *how fast can you adapt?*. i.e. you make *small* changes to your weights, and once this is done you compute the input you *wanted*, and then you send that downwards. Then that layer does the same operation. Finally, the lowest layer can only adjust its own weights. Once it does so it propagates back upwards, until you reach the initial layer again, now with adjusted input data. You can now repeat this process if you like, so in theory for a single example you can iterate until you are producing the exact required output (or is that over-learning?)

The actual method for adjusting weights isn't something I've figured out yet, but I keep reading about gradient descent etc. but it seems that for some cases you need the whole *set* of input/output data to optimize, while in other cases you optimize a small amount for a single set, turn by turn. In the latter case you might need to repeat the same training data over and over again until you "get it". Hmm, which is more like the human brain?

If you're having trouble understanding someone, what they're saying just doesn't compute. Then suddenly you figure it out, and the noises they're making suddenly *sound like* what they're saying, when before they didn't. This suggests that the high-level semantic layers are somehow able to tune the lower-level layers, perhaps dynamically/temporarily. Another example is this: the speech recognizer figures out "oh, this is Eddie talking". Then the high-level semantic layers know that Eddie talks about computers a lot, but rarely about modern art. This means that if he says a word that has a computer meaning and a modern art meaning, more than likely he's talking about computers. (The fact that the *current conversation* is about computers is also useful input data.) This helps resolves ambiguities at lower levels, so in some sense higher levels should be able to "focus" lower levels i.e. tell them to pay attention to some things and ignore other things (such as the possibility of Eddie using modern art words). Thinking about it, it seems the *number* of possible focii are finite and small - we might adapt to our partner's speech, but not to everybody's, and we might talk about a few subjects, but not all subjects. Thinking about it, it also seems that each focus requires storage at each neuron, so they can change "mode". So perhaps we need weights which are in some sense conditional i.e. p(W1 = value | mode="computers").

I don't even know :)