Brian Castle
Computations


Self-organization in the broader sense refers to whole brain wiring, however the machine learning community uses this terms to refer specifically to the synaptic modeling of data. We've reviewed some of the synaptic mechanisms for plasticity on the previous pages. Now we'll look at what we can do with them.


Memorization

Two of the great strengths of the human network are our ability to store and retrieve memory patterns after just one exposure, and our ability to infer optimal sequences of action from sparse data. The basic ability to memorize can be accomplished in two ways: either alter the synapses to change the system trajectories, or alter the neurons themselves by programming specific spike patterns. The second method is biologically implausible, because human memory capacity greatly exceeds the number of neurons (it has been estimated at the equivalent of 2.5 petabytes of digital memory).

The easiest way to get a network to memorize an image is to simply burn it in. From a correlation standpoint, when the same image is presented again the result will be a 1, otherwise it will be a 0.

The machine learning community distinguishes between supervised and unsupervised learning, which is different vocabulary from the classical and operant conditioning in psychology. Unsupervised learning is akin to classical conditioning, where the repeated association of two stimuli leads to a mutual expectation. Supervised learning is more akin to operant conditioning, where there is a "reward" (or lack thereof) to indicate success or failure - in other words there is a "teacher" of sorts.

There are certainly many other forms of learning. One of the most important for humans is imitative learning, being able to repeat the movements of another simply by watching them. Apparently there are some specialized neurons in the brain that support this behavior, they go by names like "mirror neurons".


Feature Extraction

Once we are able to store an image so we can work on it, we'll be interested in extracting the high level features. For instance, chairs can be distinguished from tables on the basis that they have a back. When learning about chairs, we would like to give the network as many examples as possible, and each sample can be presented in several views. This way the network comes to learn the overall statistics of the feature set.

Image processing machines use many motifs commonly found in biological neural networks. For example the award-winning ResNet architecture creates "residual networks" consisting of feed forward connections that bypass layers. This architecture should be nothing new for neuroscientists, practically every set of connections in the brain has similar connectivity. However the machine learning people do the due diligence, they figure out exactly how bypassing contributes to better performance - whereas the neuroscientists tend to stop at the "look, it's possible" point.

Nevertheless, machine learning has not yet explored some important aspects of neuroscience. The so-called "residual connections" are ubiquitous in biological architectures, yet they were only introduced into machine learning about 8 years ago. Another unexplored area in machine learning is the multiple connectivity between two neurons, both from the standpoint of multiple neurotransmitters with different actions and time courses, and because there are sometimes a hundred or more synapses between any two biological neurons, and what is the purpose of these, when one synapse would suffice to carry the weight information? The obvious answer is "timing", and the machine learning community is just now getting up to speed on the meaning of timing within spike trains.

Recently neuroscientists have mapped the entire neural connectome of larval Drosophila (the common fruit fly), and as per the information transmitted in this outline, the entire network is organized on the basis of loops and there are very few unidirectional pathways. There are some interesting caveats to this general plan, for instance the learning part of the brain (Drosophila's associative memory) doesn't seem to have any loops, at least not in the larval stage.


Classification

A slightly more complex version of feature recognition is classification. You can show the network 100 pictures of cats and dogs and ask it to identify which ones are cats. We would like to obtain the category with the highest probability of fitting the data. For proper classification we need to look at the whole image, not just a subset of its features.

For example, the figure shows some people walking down the street, but there are also cars, dogs, strollers, and a variety of other visual information. What happens when we ask the network to identify all the "people" in the image?

Typically this question needs to be answered in stages, which is what the convolutions are for, in a CNN architecture. A person could be considered like a stick figure, two arms, two legs, a head, a torso... and people look different from dogs and cars. The separation of "people" from the rest of the image is a kind of classification problem, and we could even classify "within" the group of people, like show me all the men, or women.

Classification thus necessarily incorporates parts of the feature extraction paradigm. Successful classification occurs on the basis of the features in the image.


Pattern Association

Pattern association combines feature extraction with memorization. In the previous example we can extend the network's task by asking it to find all the men wearing NY Yankees hats. To do this, the network has to know that a hat goes on top of the head, and that dogs don't wear hats, and that we should ignore all the women wearing NY Yankees hats because we only want the men.

What are the combinations of features that will allows us to extract what we want? We need a surefire way of telling women apart from men. We could use long hair, but some men have long hair too (although a beard would certainly be a giveaway). We could also look for "fashion accessories" like purses (which men don't usually carry), and so on. In other words, it's the combination of features that will give us what we want. We are therefore correlating information within the image, instead of between images. If we want to do it between images, we have to memorize the images first, however if it occurs quickly enough we might be able to do it while the image still persists. Obviously, some short term memory might be helpful for us, so we can recall the previous views as they're needed.


Sequence Learning

Of course, in the real world, we do not simply receive static images from the machine learning engineer. Our visual field is dynamic, everything is constantly moving. The movie playing across the retina is no different from the audio in spoken speech, it is a stream of data that conveys meaning. Thus, we are interested in knowing the sequence of data, perhaps within a scene, perhaps across scenes, perhaps the properties a particular object across scenes.

The prototypical sequence learning machine is the RNN (recurrent neural network). This is a layer with self-recurrent connectivity, therefore in real life it can exhibit dynamics that include multistability and chaos. There are a number of ways of limiting the weight range in such systems, to avoid seizure-like activity.

From a computational standpoint, an RNN can be "unrolled" in time. This process is shown in the figure. This helps us computationally from a machine learning standpoint, because we can back-propagate derivatives all the way to the beginning of time. However it's also quite un-biological. We're just now beginning to discover how scene processing really works (with place and time maps in the hippocampus and so on), and it's already clear that there is substantial encoding before the information goes into memory, and at this point it's still entirely unclear how it's recalled. This is currently an area of intense research.


Inference

One of the most powerful applications of artificial neural networks is in the area of inference. If there is a belief related to a model with some parameters, the neural network can fit the parameters to the data, "inferring" their values based on joint and conditional probability distributions.

A complete treatment of inference is beyond the scope of this short survey. However the landscape is pretty simple, the primary objective is to be able to navigate a graph with probabilites. The basic paradigm for machine learning is "Bayesian inference", where the posterior probability is updated based on incoming evidence. Fortunately the math around this is easy and straightforward. Bayes' Rule relates the posterior probability to the prior probability, as follows:

P ( Θ | X ) = P ( X | Θ ) * P ( Θ ) / P ( X )

where X is the input (the "evidence"), Θ is the model (containing the model parameters, which is what we're trying to find), and P is of course a probability. This equation reads: "the posterior probability is equal to the prior probability multiplied by the likelihood, normalized to the evidence". The likelihood is the term P ( X | Θ ), this gives the probability of the event before the new evidence.

Bayes' Rule is usually used to update belief in real time, for example if we have a time series X with points Xi arriving at times i=t, we usually recalculate the model at every time step i. However we don't have to do that, we can wait till the very end and use the "ensemble" of the input to make the adjustment. If we do this we may sometimes lose a little bit of data, so it's usually done once per input - but on the other hand, there is also a situation called "overfitting" where the network ends up thrashing around the answer, so in practice we have to balance performance against accuracy.


Generative Networks

Once we have the parameters of a model, we can generate examples from it. In the Bayesian paradigm, once we have decided on the model parameters Θ,
we can create examples from the model distribution. This can be done in a systematic way, to explore the organization of the memory, or it can be done randomly using a Monte Carlo approach.

The generation of examples from a distribution is not the same as adding "noise" to the input or the memory. If we have a model Θ, and we add noise to the parameters, we're actually changing the optimal configuration that was determined by our inference network. However if we get a sample first, and then add hoise, we're just creating a "noisy sample", and this can then be used to test the network's response to noise, and in associative memory experiments can be used to test auto-completion and overlap.
There is a special kind of generative network called "adversarial". This is a way of getting two networks to learn from each others' mistakes. In an adversarial network, there are basically two identical networks that feed each other models. The adversarial generator is creating "examples" from its own model distribution, and the other AI is looking at those trying to update its own internal model. This is a way of quickly exploring very large decision trees, for example the game of Go, or chess.


Attention

The word "apple" is ambiguous. It could mean a fruit, or it could mean a technology company. If I say "that apple was delicious", you can infer that I'm talking about a fruit. On the other hand if I say "my Apple broke", it's more likely that I'm talking about a computer. "Attention" in the machine learning sense, is a way of assigning context to data. So a heads-up, this usage is different from the way it's typically used in psychology.

A neural network derives context from memory (from "previous scenes"). If there is a word called "apple", the machine must execute a memory search for the context related to "apple". In the information that is returned, the memory subspace will have two distinct attractors, one for "fruit" and another for "technology company". To search memory, one must extract the search key from the image, and present it to the memory system "while" the image is still being processed. And, the memory search can not disrupt any other ongoing memory-related activities.

One of the things we've learned from machine learning, is it's very difficult to extract a subspace of the energy surface and use it to derive context. More likely, what happens is, an associative search through a Hopfield network will produce a sequential series of attractors, in other words the individual attractors will present in a sequence instead of all at once. The order they come out in, is related to their covariance with the search key, so there is no guarantee they'll come out in any particular order the way you want them.


Next: Let's Model It!

Back to Cerebral Cortex

Back to the Console


(c) 2026 Brian Castle
All Rights Reserved
webmaster@briancastle.com