What Is ChatGPT Doing … and Why Does It Work?

It’s Just Adding One Word at a Time

ChatGPT can automatically generate something that reads even superficially like human-written text
It looks for things that in a certain sense “match in meaning”.
The end result is that it produces a ranked list of words that might follow, together with probabilities
A temperature parameter determines how often lower-ranked words will be used
For essay generation, it turns out that a “temperature” of 0.8 seems best.

Where Do the Probabilities Come From?

Take a sample of English text, and calculate how often different letters occur in it.

Break this into words by adding in spaces as if they were letters with a certain probability, and generate words by forcing the distribution of “word lengths” to agree with what it is in English
Generate sentences by randomly choosing words at random, with the same probability that it appears in the corpus
There are about 40,000,000 commonly used words in English, so we can estimate how common each word is.

“Surely a Network That’s Big Enough Can Do Anything!”

There are things that can be figured out by formal processes, but aren’t readily accessible to immediate human thinking.

Computational irreducibility means that we can never guarantee that the unexpected won’t happen-and it’s only by explicitly doing the computation that you can tell what actually happens in any particular case.
There’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreduciibility, and the less it is going to be trainable.

Inside ChatGPT

The goal is to continue text in a reasonable way, based on what it’s seen from the training it’s had

It operates in three basic stages

First, it takes the sequence of tokens that corresponds to the text so far
Second, it finds an embedding vector
Third, it operates on this embedding-in a “standard neural net way”, with values “rippling through” successive layers in a network-to produce a new embedding (i.e. a new array of numbers)

Every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network

The Training of ChatGPT

The result of large-scale training, based on a huge corpus of text-on the web, in books, etc.-written by humans

A neural net with 175 billion weights can make a “reasonable model” of text humans write

Given all this data, how does one train a neural net from it?

You present a batch of examples, and then you adjust the weights in the network to minimize the error
Current methods require one to do this batch by batch
How efficient is it going to be at implementing a model based on algorithmic content?

Machine Learning, and the Training of Neural Nets

Numerical analysis provides a variety of techniques for finding the minimum in cases like this

The point is that the trained network “generalizes” from the particular examples it’s shown.

How does neural net training actually work?

Essentially what we’re always trying to do is to find weights that make the neural net successfully reproduce the examples we’ve given

To find out “how far away we are” from getting the function we want-and then to update the weights in such a way as to get closer

What Is a Model?

Any model you use has some particular underlying structure-then a certain set of “knobs you can turn” (i.e. parameters you can set) to fit your data.

The underlying structure of ChatGPT-with just a few parameters-is sufficient to make a model that computes next-word probabilities “well enough” to give us reasonable essay-length pieces of text

The Practice and Lore of Neural Net Training

Over the past decade, there have been many advances in the art of training neural nets.

The tasks we’re trying to get neural nets to do are human-like ones-and neural nets can capture quite general “human-like processes.”

For example, in converting speech to text, it was thought that one should first analyze the audio of the speech, break it into phonemes, etc., but what was found is that it is usually better just to try to train the neural net on the “end-to-end problem”.

The Concept of Embeddings

An embedding is a way to try to represent the “essence” of something by an array of numbers with the property that “nearby things” are represented by nearby numbers.

How can we construct such an embedding?

Look at large amounts of text and then see “how similar” the environments are in which different words appear
Instead of directly trying to characterize “what image is near what other image”, we instead consider a well-defined task (in this case digit recognition) for which we can get explicit training data-then use the fact that in doing this task the neural net implicitly has to make what amount to “nearness decisions”.
We are just talking about the concrete question of what digit an image represents, and then we’re “leaving it to the neural network” to implicitly determine what that implies about the nearness of images.

Models for Human-Like Tasks

For ChatGPT, we have to make a model of human-language text of the kind produced by a human brain.

An example of this would be making a model for numerical data that essentially comes from simple physics, where we’ve known for several centuries that “simple mathematics applies”.
Another human-like task would be recognizing images: feed in images of handwritten digits, and output the numbers these correspond to.

Neural Nets

A neural net is a connected collection of idealized “neurons”-usually arranged in layers-with a simple example being:

Each “neuron” is effectively set up to evaluate a simple numerical function.
The goal is to take an “input” corresponding to a position {x,y}-and then to “recognize” it as whichever of the three points it’s closest to.
To “use” the network, we simply feed the network with numbers, and then feed it more numbers.

Source