It seems we know quite a lot about the human visual system. Could we build a machine to do the same thing? Many have tried... the difficulty is that machines don't have dynamics (so far), and they don't have the genetic programming that codes for development of complex systems like the brain.
In the retina for example, before the eyes even open, the retina starts generating waves of electrical activity that travel from the periphery to the center. These waves help organize the connections with and within the superior colliculus, and these connections end up aligned with the direction of "optic flow" as the organism traverses the environment. The retina knows on the basis of genetically programmed chemical markers, where the midline is along both horizontal and vertical axes - this is how it separates the visual fibers at the level of the optic chiasm, and also how the neurons map directional selectivity in both the retina and the SC.
Furthermore, the form and function of the many types of synaptic plasticity in the visual system are still largely unknown, and the internal function of the (visual) cerebral cortex is still mostly unknown. Machines that try to "see", are limited by the available technology, which today is great in the digital world and not so great in the analog world. However photonics is right around the corner, enabling dense computation with tiny amounts of energy, and the machine learning community is already creating layered systems with billions of simulated neurons.
In a real brain, the areas that we study, like the retina and LGN and visual cortex, are embedded into much larger systems. If we disconnect the other parts, the behavior changes. For example a simple lesion in the midbrain very close to the pontine visual and oculomotor areas can put a person to sleep, or even into a coma. The activity in the visual cortex is coordinated with other modalities in other areas of the cortex, by rhythmic electrical activity including alpha and theta waves, but also including high frequency components that were often filtered out in the early EEG research.
Another big difference between humans and machines is how we handle noise. Everything in biology is noisy and stochastic, very little is deterministic. Our neural networks take advantage of this noise, they make use of it - whereas engineers have been taught to get rid of it. A constantly changing input is vital for humans, our vision stops entirely if we don't have it. Our retinas adapt after seconds to minutes, and our eye movements fatigue after holding an eccentric position for 15 to 20 seconds.
How Machine Vision WorksOther than advances in camera technology, the biggest part of machine vision uses artificial neural networks. However these networks for the most part, are highly non-biological. In some cases the machines "emulate" humans, and in some cases they're specifically built to meet or surpass human capabilities. The machines tend to focus on visual data, because it's the only way they can be tested. Hence you'll see terms like recognition, classification, feature extraction, and pattern analysis used ubiquitously in the machine learning literature.
Convolutional Neural NetworksArtificial "convolutional" neural networks mimic and interesting aspect of human vision, which is spatial correlation, but they do it differently, in a way more suited to sequential digital processing rather than parallel analog processing.
A digital convolution is like a small filter, that's swept across the visual field (or the "input space"). The filter looks a lot like a "Mexican hat" function, which looks a lot like the center-surround organization of a retinal ganglion cell. And, as in the retina, the filter is designed to perform a specific task, like contrast enhancement or edge detection.
There are three main differences between the machine convolutions and the biological systems. First, in the machines the filter is swept sequentially across the image, which means the same filter can be re-used over and over again. However in biological systems the filters are hard-wired in the form of neurons, and everything occurs in parallel, so there is no need for nVidia GPU's to perform matrix multiplications.
The second difference is that machines use feed forward architectures for computation, and back propagation for learning. A machine vision system is necessarily synchronized by a master clock, that first sends the information in the forward direction through the network, then sends the errors back the other way using back propagation. But there is no clock in the brain, and while there is synchronization in the visual system, it occurs in a different way. In humans, the pattern analysis portion of vision is interrupted by saccadic eye movements that occur about three times a second, and eye blinks that occur about once every three seconds. The brain uses this timing to coordinate pattern recognition through rhythmic activity involving alpha waves in the visual cortex and theta waves in areas like the hippocampus and the claustrum.
The third major difference between humans and machines is that humans are noisy, and machines don't like noise. Humans are stochastic and noisy, every biological process depends on molecular interactions that have an imperfect probability of occurring, or occurring within an allotted time frame, or occurring unmolested. Relative to the statement that "machines don't like noise", there are exceptions. There are algorithms and components that have been built from the ground up using stochastic and statistical targets, and there are certainly algorithms that handle that aspect of the data better than others.
There are common elements too, between humans and machines. One of the things we learn from both humans and machines is that there's always more than one way to do something. The field of machine learning is only a few years old, it's still in its infancy. It started trying to mimic biological systems, and rapidly evolved along functional lines into things like large language models that have commercial value. Neuroscientists sometimes take an unrealistic view to what they're looking at, for example the matrix-like mapping of ocular dominance columns and orientation columns in the primary visual cortex turns out to be irregular and pinwheel-shaped, and this organization can be duplicated very precisely by a machine learning model that uses only the statistics of the data rather than any kind of inherent internal geometric organization. Both humans and machines are programmed by the data. Humans have meta-programming in the form of genetic instrutions, whereas machines rely on the cleverness of the PyTorch and TensorFlow programmer.
==>   Predictive Coding   <== So, how does all this jazz about neurons and synapses tie in with the timeline architecture and brain electrical potentials? Convolutional networks don't really cut it, they're non-biological and they don't have dynamics and they require 10,000 presentations of a face before they can recognize it (whereas you can do it with just one prior glance).
The computational method that ties it all together is predictive coding, and coding means in the neural sense (not like a computer program). The information ahead of T=0 on the timeline, is prediction, that's exactly what it is. An intended motor behavior can be staged along the timeline, but there's no guarantee that it will materialize at T=0. It may be inhibited before it ever gets there, or transmission can simply fail along the way. The common element of information along T > 0 is its predictive character.
Since its popularization in the insightful paper by (Rao and Ballard) the mechanisms around predictive coding have been extended by researchers like Karl Friston, who introduced the concepts of free energy and complexity into the basic paradigm. Free energy, in a neural network, is an indicator of surprisal. It has terms related to errors relative to previous predictions, and complexity related to the simplest possible description of the input.
The basic paradigm of predictive coding is easy to understand. At time step T-1, the network will create a prediction of what it will see at T=0. Let us say that this prediction is in the form of a neuron firing rate. (We'll keep it simple for now, to dovetail the description more closely with that of traditional machine learning mechanisms - later however, we'll see how this approach becomes amazingly useful when we introduce dynamics). When the input arrives at T=0, it is compared with the prediction. The actual input is subtracted from the prediction to get an error signal. The error signal is then used to update the next prediction. (Engineers will notice this process is much like an adaptive Kalman filter).

A simple network that accomplishes this is shown in the figure.

In a wonderful paper written from an evolutionary perspective, Pezzulo and colleagues describe how more complex architectures can be built up from this simple concept.
Over the last several years many variations of predictive coding networks have been created. There has been considerable focus on the various types of behaviors that can be supported by it.
Stochastic BehaviorIn college, engineers learn about control systems theory. First it's presented in a linear context, which always creates a few groans from the back row because it involves solving differential equations. But then the volume of groaning increases when the system becomes nonlinear, and the solutions become harder to find. The groans become an orchestra when students are asked to map dynamic behaviors into the phase plane - but then some magic happens, and everyone breathes a sigh of relief when it's discovered that complex systems can be approached statistically.
There is a difference between statistics and stochastic behavior. Statistics involves correlations, whereas stochastic behavior means noise. Correlative neural networks are driven by the data, which can be either noisy or not noisy. Stochastic networks are driven by the inherently noisy behavior of neurons, regardless of the composition of the data.
In a way, stochastic behavior is akin to the deliberate injection of noise. From a control system standpoint and the standpoint of the Kalman filters we discussed, noise is a way to overcome local minima, and the same thing applies to artificial neural networks. Stochastic behavior can be conceived in terms of random walks. The differential equations related to stabilizing a control system become stochastic differential equations, and can be solved using the methods of Wiener, Ito, Malliavin, and the like. Conversely, a noisy system may exhibit process noise, and when we measure it we may also add measurement noise. These sources can be separated, and they can be removed from analysis but in many cases they affect the system dynamics and the trajectories in the phase plane. (A Kalman filter usually relies on a sensor for measurement, and humans rely on proprioception, both of which may exhibit measurement noise). In any control system we're interested in stability and convergence, and sometimes noise plays directly into the boundaries of stability.
Dynamic BehaviorWe looked at a simple example of dynamic behavior with the Wilson-Cowan network, and in a real brain this behavior becomes considerably more complex. One can build a robust, controllable, and adaptable oscillator from just a few neurons, perhaps the same number one might find in a cortical mini-column (say, about 100, maybe less). When you have 100 million of these oscillators coupled together at different frequencies and phases, and they're all operating stochastically, even the Kuramoto approximation becomes inadequate to describe the resulting behaviors.
There are extensions of the Kuramoto model of coupled oscillators, for example the specific case of a bimodal frequency distribution has been solved exactly, for Lorentzian and Gaussian distributions. However the general case leads to a complex set of coupled differential equations that would be impossible to solve in real time with current technology. So we look for other avenues to approach questions like criticality, and one of the very promising avenues is nonlinear thermodynamics, which can be directly applied to neural networks with quantum computing.
Learning and TrainingBoth machine vision and human vision involve three things: energy functions, error minimization, and synaptic plasticity. Energy functions are often called "cost" or "objective" functions.
We covered synaptic plasticity at a basic level in the earlier section on synapses.Energy functions traditionally exist at the level of the network layer, although in non-layered systems like Hopfield networks and Boltzmann machines they may exist at the level of the whole network. In biological terms these might be equivalent to extracellular field potentials in some ways, and it is entirely plausible that these might have meaning locally as well as globally.
The term "error" is ubiquitous in machine learning, and it means many different things. Depending on the type of learning (supervised learning, unsupervised learning, and so on), "error" is measured against either a correct answer provided by the experimenter, or an internal model generated by the network itself.
There are also auto-associative networks that don't use "error" at all, instead they rely on spatial and temporal correlations to form relationships between features of the data. These networks were explored by Kohonen in the context of simple threshold neurons, and have been extended in many ways to include spiking networks. No model to my knowledge has yet explored the embedding of an auto-associative network into an electrical syncytium (of the kind formed by gap junctions in astrocytes). That would be a worthy PhD thesis, maybe even a career. Next: Development & Plasticity |