Post #1: What is Generative Artificial Intelligence?
Let’s begin by examining each of the terms.
Artificial Intelligence
Le’s begin by unpacking the term artificial intelligence. To do that, I think it helps to break down each word.
Intelligence
We ordinarily don’t think much about what the term ‘intelligence” means. When we do, we usually think of it in a more judgmental sense – “I’m ‘more intelligent’ than that person,” without being specific as to what we mean.
But think for a minute, what does it mean to be intelligent?
One way to think about it is to think about what you need you use intelligence for. This is a question you can think through, but most people say you need intelligence to be able to read, write, engage in a conversation, do math problems, make predictions, suggest judgments, reason, plan ahead, and have “common sense. Being sentient/conscious, which means you are aware of your existence, also requires intelligence.
Artificial
But what does it mean for that intelligence to be “artificial?” Generally speaking, artificial intelligence is intelligence that is based in silicon (a computer) as opposed to carbon (us biological beings are made out of carbon) intelligence, which is embedded in the brain of biological beings. Eventually, we may no longer need silicon to embed that intelligence and instead will be able to embed it in synthetic biological structures, but for now it is embedded in silicon.
Some individuals (generally people who are opposed to AI), argue that computers are not yet “intelligent” because they currently do not have all of the ingredients of AI: they cannot yet fully reason and plan ahead; they are not sentient/conscious.
Everyone agrees that computers are not yet fully intelligent. At best, they have very limited reasoning abilities. They certainly can’t plan ahead. They do not have common sense. Most leading AI scientists think AI does not understand what it is doing, but a few leading ones (Geoff Hinon, Ilya Sutskever) do. But while they cannot do many things, they can engage in what is called “natural language” conversation, known as “natural language processing” (NLP), and the ability to engage in natural language conversation is an ingredient of intelligence. When they produce text as part of the conversation, they are generating output.
Generative
Historically, scientists doubted that machines could ever fully grasp the nuanced and complex nature of human language. Language involves intricate subtleties, such as context, tone, and cultural references, which are challenging for machines to interpret accurately. An example of this complexity is the difference in meaning between “I saw her duck” (referring to either a bird or the act of ducking) and “I saw her duck” (meaning she avoided something). These nuances make language processing a formidable task for AI.
The breakthrough that made it possible for AI to understand these nuances came with the introduction of the Transformer model, detailed in the seminal paper Attention Is All You Need by Vaswani et al. in 2017 that was written by many Google researchers, including one intern. The Transformer model’s ability to focus on different parts of an input sequence allowed it to better capture the nuances in language.
The attention mechanism in the model enables it to weigh the importance of each word in a sentence relative to others, allowing it to assess the surrounding context to determine the likely meaning of ambiguous words like “duck” in the sentence “I saw her duck.” For example, if the sentence continues with “under the table,” the model can infer that “duck” refers to the action of avoiding something. Unlike previous models that processed language sequentially, the attention mechanism allows the Transformer to dynamically focus on relevant parts of the text, regardless of their position in the sentence, capturing long-range dependencies and subtle contextual cues. Additionally, the Transformer processes all words in a sentence simultaneously, ensuring that the entire context is considered when interpreting ambiguous language. Through training on large datasets, the Transformer model learns to recognize patterns and nuances in language. AI models have now been trained on at least the entire public internet, which is approximately 10 trillion words.
OpenAI’s scientists picked-up on the 2017 paper and worked hard to development advancements in generative text models between 2017 and 2022, culminating in the release of ChatGPT in November 2022. ChatGPT, based on the GPT-3.5 architecture, demonstrated the practical application of these models by generating human-like text responses, showing that machines could engage in natural language conversations. These text conversations can now occur in audio and OpenAI has also lead breakthroughs that allow the emotional nuance of conversations to come through.
ChatGPT and other similar models that grew (Gemini (Google); Llama (Meta); Claude (Anthropic); Mistral from France; GLM-4 from Zhipu in China) use this same basic architecture.
These models laid the foundation for AI generating text and back and forth text conversations.
We also have models that generate images and video. AI models that generate images and videos have become increasingly sophisticated, enabling the creation of high-quality visual content from simple text prompts. These models, known as Generative Adversarial Networks (GANs), work by using advanced algorithms and machine learning techniques to understand and generate visual data.
These models are trained on vast datasets of images, learning the various styles, concepts, and attributes present in the data. When given a text prompt, the model translates this input into a numerical representation that captures the semantic meaning and context. This representation guides the AI in generating an image that aligns with the prompt, allowing for the creation of unique and realistic visuals.
In the past, AI models for text, images, and videos operated independently, each focusing on a single type of data input. Text models, like those used in natural language processing (NLP), were designed to understand and generate human language. Image models, such as those based on convolutional neural networks (CNNs), were developed to recognize and generate images. Video models, often leveraging both image and temporal data, aimed to process and generate video content. These models were effective within their domains but limited in their ability to integrate information across different types of data.
The advent of multimodal AI represents a significant shift in how these models operate. Multimodal AI systems can process and interpret information from various data types simultaneously, such as text, images, audio, and video. This integration allows these models to understand and generate content that is more aligned with how humans perceive the world, which is inherently multimodal.
This was created with AI and by 1 person. No one in this film is real. T2V & i2V —Raw. Runway, Kling, LumaLabs, Magnific, ElevenLabs, HeyGen, and some secret source.
By combining inputs from different modalities, these models can achieve a deeper understanding of context and nuance. For example, a customer service AI that analyzes both the text of a customer’s query and the emotional tone of their voice can provide more empathetic and accurate responses than one that only examines the text of their inquiry.
Multimodal AI opens up many applications, such as virtual assistants that understand spoken instructions and read facial expressions, or educational tools that combine text, images, and interactive elements for more engaging learning experience.
So, we have artificial intelligence systems that can generate text, audio (including music), and video (including movies). Before generative AI, AI tools were already better at making predictions (who will win the election) than humans, and perhaps rendering judgments (you should watch this movie). As stated, they can’t yet engage in advanced reasoning, hierarchical planning or use common sense, but it is likely that those capabilities will develop and that you will spend at least your adult lives in a world where machines are at least as intelligent as you!