A brief history of AI

A brief history of AI

NewsSeptember 29, 2025Artifice Prime

Alan Turing famously thought that the question of whether machines can think is “too meaningless” to deserve discussion. To better define “thinking machines” or artificial intelligence, Turing proposed “The Imitation Game,” now usually called “The Turing Test,” in which an interrogator has to determine which of two entities in another room is a person and which is a machine by asking them both questions.

In his 1950 paper about this game, Turing wrote:

I believe that in about fifty years’ time it will be possible to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70% chance of making the right identification after five minutes of questioning. … I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

Turing also addressed potential objections to his claim that digital computers can think. These are discussed at some length in the Stanford Encyclopedia of Philosophy article on the Turing Test.

Spoiler: The Imitation Game wasn’t passed according to Turing’s criteria in 2000, and probably hasn’t been passed in 2025. Of course, there have been major advances in the field of artificial intelligence over the years, but the new goal is to achieve artificial general intelligence (AGI), which as we’ll see is much more ambitious.

Language models

Language models go back to Andrey Markov in 1913; that area of study is now called Markov chains, a special case of Markov models. Markov showed that in Russian, specifically in Pushkin’s Eugene Onegin, the probability of a character appearing depends on the previous character, and that, in general, consonants and vowels tend to alternate. Markov’s methods have since been generalized to words, to other languages, and to other language applications.

Markov’s work was extended by Claude Shannon in 1948 for communications theory, and again by Fred Jelinek and Robert Mercer of IBM in 1985 to produce a language model based on cross-validation (which they called deleted estimates) and applied to real-time, large-vocabulary speech recognition. Essentially, a statistical language model assigns probabilities to sequences of words.

To quickly see a language model in action, type a few words into Google Search, or a text message app on your phone, and allow it to offer auto-completion options.

In 2000 Yoshua Bengio et al published a paper on a neural probabilistic language model in which neural networks replace the probabilities in a statistical language model, bypassing the curse of dimensionality and improving the word predictions (based on previous words) over a smoothed trigram model (then the state of the art) by 20% to 35%. The idea of feed-forward, autoregressive neural network models of language is still used today, although the models now have billions of parameters and are trained on extensive corpora, hence the term “large language models.”

Image recognition

While language models can be traced back to 1913, image models can only be traced back to newspaper printing in the 1920s, and even that’s a stretch. In 1962, Huber and Wiesel published research on functional architecture in the cat’s visual cortex; ongoing research in the next two decades led to the invention of the Neocognitron in 1980, an early precursor of convolutional neural networks (CNNs).

LeNet (1989) was a CNN for digit recognition; LeNet-5 (1998) from Yann LeCun et al, at Bell Labs, was an improved seven-layer CNN. LeCun went on to head Meta’s Facebook AI Research (FAIR) and teach at the Courant Institute of New York University, and CNNs became the backbone of deep neural networks for image recognition.

Text to and from speech

The history of text to speech (TTS) goes back at least to ~1000 AD, when a “brazen head” of Pope Silvester II was able to speak, or at least that’s the legend. (I have visions of a dwarf hidden in the base of the statue.)

More verifiably, there were attempts at “speech machines” in the late 18th century, the Bell Labs vocoder in the 1930s, and early computer-based speech synthesis in the 1960s. In 2001: A Space Odyssey, HAL 9000 sings “Daisy Bell (A Bicycle Built for Two)” thanks to a real-life IBM 704-based demo that writer Arthur C. Clarke heard at Bell Labs in 1961. Texas Instruments produced the Speak & Spell toy in 1978, using linear predictive coding (LPC) chips.

Currently, text to speech is, at its best, almost believably human, available in both male and female voices, and available in a range of accents and languages. Some models based on deep learning are able to vary their output based on the implied emotion of the words being spoken, although they aren’t exactly Gielgud or Brando.

Speech to text (STT) or automatic speech recognition (ASR) goes back to the early 1950s, when a Bell Labs system called Audrey was able to recognize digits spoken by a single speaker. By 1962, an IBM Shoebox system could recognize a vocabulary of 16 words from multiple speakers. In the late 1960s, Soviet researchers used a dynamic time warping algorithm to achieve recognition of a 200-word vocabulary.

In the late 1970s, James and Janet Baker applied the hidden Markov model (HMM) to speech recognition at CMU; the Bakers founded Dragon Systems in 1982. At the time, Dragon was one of the few competitors to IBM in commercial speech recognition. IBM boasted a 20K-word vocabulary. Both systems required users to train them extensively to be able to achieve reasonable recognition rates.

In the 2000s, HMMs were combined with feed-forward neural networks, and later with Gaussian mixture models. Today, the speech recognition field is dominated by long short-term memory (LSTM) models, time delay neural networks (TDNNs), and transformers. Speech recognition systems rarely need speaker training and have vocabularies bigger than most humans.

Language translation

Automatic language translation has its roots in the work of Abu Yusuf Al-Kindi, a ninth-century Arabic cryptographer who worked on cryptanalysis, frequency analysis, and probability and statistics. In the 1930s, Georges Artsrouni filed patents for an automatic bilingual dictionary based on paper tape. In 1949 Warren Weaver of the Rockefeller Foundation proposed computer–based machine translation based on information theory, code breaking, and theories about natural language.

In 1954 a collaboration of Georgetown University and IBM demonstrated a toy system using an IBM 701 to translate 60 Romanized Russian sentences into English. The system had six grammar rules and 250 lexical items (stems and endings) in its vocabulary, in addition to a word list slanted towards science and technology.

In the 1960s there was a lot of work on automating the Russian-English language pair, with little success. The 1966 US ALPAC report concluded that machine translation was not worth pursuing. Nevertheless, a few researchers persisted with rule-based mainframe machine translation systems, including Peter Toma, who produced SYSTRAN, and found customers in the US Air Force and the European Commission. SYSTRAN eventually became the basis for Google Language Tools, later named Google Translate.

Google Translate switched from statistical to neural machine translation in 2016, and immediately exhibited improved accuracy. At the time, Google claimed a 60% reduction in errors for some language pairs. Accuracy has only improved since then. Google has refined its translation algorithms to use a combination of long short-term memory (LSTM) and transformer blocks. Google Translate currently supports over 200 languages.

Google has almost a dozen credible competitors for Google Translate at this point. Some of the most prominent are DeepL Translator, Microsoft Translator, and iTranslate.

Code generation

Code generation models are a subset of language models, but they have some differentiating features. First of all, code is less forgiving than natural language in that it either compiles/interprets and runs correctly or it doesn’t. Code generation also allows for an automatic feedback loop that isn’t really possible for natural language generation, either using a language server running in parallel with a code editor or an external build process.

While several general large language models can be used for code generation as released, it helps if they are fine-tuned on some code, typically training on free open-source software to avoid overt copyright violation. That doesn’t mean that nobody will complain about unfair use, but as of now the court cases are not settled.

Even though new, better code generation models seem to drop on a weekly basis, they still can’t be trusted. It’s incumbent on the programmer to review, debug, and test any code he or she develops, whether it was generated by a model or written by a person. Given the unreliability of large language models and their tendency to hallucinate believably, I treat AI code generators as though they are smart junior programmers with a drinking problem.

Approaches to AI

Artificial intelligence as a field has a checkered history. Early work was directed at game playing (checkers and chess) and theorem proving, then the emphasis moved on to natural language processing, backward chaining, forward chaining, and neural networks. After the “AI winter” of the 1970s, expert systems became commercially viable in the 1980s, although the companies behind them didn’t last long.

In the 1990s, the DART scheduling application deployed in the first Gulf War paid back DARPA’s 30-year investment in AI, and IBM’s Deep Blue defeated chess grand master Garry Kasparov. In the 2000s, autonomous robots became viable for remote exploration (Nomad, Spirit, and Opportunity) and household cleaning (Roomba). In the 2010s, we saw a viable vision-based gaming system (Microsoft Kinect), self-driving cars (Google Self-Driving Car Project, now Waymo), IBM Watson defeating two past Jeopardy! champions, and a Go-playing victory against a ninth-Dan ranked Go champion (Google DeepMind’s AlphaGo).

Kinds of machine learning

Machine learning can solve non-numeric classification problems (e.g., “predict whether this applicant will default on his loan”) and numeric regression problems (e.g., “predict the sales of food processors in our retail locations for the next three months”), both of which are primarily trained using supervised learning (the training data has already been tagged with the answers). Tagging training data sets can be expensive and time-consuming, so supervised learning is often enhanced with semi-supervised learning (apply the supervised learning model from a small tagged data set to a larger untagged data set and add whatever predicted data that has a high probability of being correct to the model for further predictions). Semi-supervised learning can sometimes go off the rails, so you can improve the process with human-in-the-loop (HITL) review of questionable predictions.

While the biggest problem with supervised learning is the expense of labeling the training data, the biggest problem with unsupervised learning (where the data is not labeled) is that it often doesn’t work very well. Nevertheless, unsupervised learning does have its uses. It can sometimes be good for reducing the dimensionality of a data set, exploring the data’s patterns and structure, finding groups of similar objects, and detecting outliers and other noise in the data.

The potential of an agent that learns for the sake of learning is far greater than a system that reduces complex pictures to a binary decision (e.g., dog or cat). Uncovering patterns rather than carrying out a pre-defined task can yield surprising and useful results, as demonstrated when researchers at Lawrence Berkeley National Laboratory ran a text processing algorithm (Word2vec) on several million material science abstracts to predict discoveries of new thermoelectric materials.

Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value, usually by trial and error. That’s different from supervised and unsupervised learning, but reinforcement learning is often combined with them. It has proven useful for training computers to play games and for training robots to perform tasks.

Neural networks, which were originally inspired by the architecture of the biological visual cortex, consist of a collection of connected units, called artificial neurons, organized in layers. The artificial neurons often use sigmoid or ReLU (rectified linear unit) activation functions, as opposed to the step functions used for the early perceptrons. Neural networks are usually trained with supervised learning.

Deep learning uses neural networks that have a large number of “hidden” layers to identify features. Hidden layers come between the input and output layers. The more layers in the model, the more features can be identified. At the same time, the more layers in the model, the longer it takes to train. Hardware accelerators for neural networks include GPUs, TPUs, and FPGAs.

Fine-tuning can speed up the customization of models significantly by training a few final layers on new tagged data without modifying the weights of the rest of the layers. Models that lend themselves to fine-tuning are called base models or foundation models.

Vision models often use deep convolutional neural networks. Vision models can identify the elements of photographs and video frames, and are usually trained on very large photographic data sets.

Language models sometimes use convolutional neural networks, but recently tend to use recurrent neural networks, long-short term memory, or transformers. Language models can be constructed to translate from one language to another, to analyze grammar, to summarize text, to analyze sentiment, and to generate text. Language models are usually trained on very large language data sets.

AI application areas

Artificial intelligence can be used in many application areas, although how effective it is for any given use is another issue. For example, in healthcare, AI has been applied to diagnosis and treatment, to drug discovery, to surgical robotics, and to clinical documentation. While the results in some of these areas are promising, AI is not yet replacing doctors, not even overworked radiologists and pathologists.

In business, AI has been applied to customer service, with success as long as there’s a path to loop in a human; to data analytics, essentially as an assistant; to supply chain optimization; and to marketing, often for personalization. In technology, AI enables computer vision, i.e., identifying and/or locating objects in digital images and videos, and natural language processing, i.e., understanding written and spoken input and generating written and spoken output. Thus AI helps with autonomous vehicles, as long as they have multi-band sensors; with robotics, as long as there are hardware-based safety measures; and with software development, as long as you treat it like a junior developer with a drinking problem. Other application areas include education, gaming, agriculture, cybersecurity, and finance.

In manufacturing, custom vision models can detect quality deviations. In plant management, custom sound models can detect impending machine failures, and predictive models can replace parts before they actually wear out.

Large language models

Language models have a history going back to the early 20th century, but large language models (LLMs) emerged with a vengeance after improvements from the application of neural networks in 2000 and, in particular, the introduction of the transformer deep neural network architecture in 2017. LLMs can be useful for a variety of tasks, including text generation from a descriptive prompt, code generation and code completion in various programming languages, text summarization, translation between languages, text to speech, and speech to text.

LLMs often have drawbacks, at least in their current stage of development. Generated text is usually mediocre, and sometimes comically bad and/or wrong. LLMs can invent facts that sound reasonable if you don’t know better; in the trade, these inventions are called hallucinations. Automatic translations are rarely 100% accurate, unless they’ve been vetted by native speakers, which is most often for common phrases. Generated code often has bugs, and sometimes doesn’t even have a hope of running. While LLMs are usually fine-tuned to avoid making controversial statements or recommend illegal acts, these guardrails can be breached by malicious prompts.

Training LLMs requires at least one large corpus of text. Examples for text generation training include the 1B Word Benchmark, Wikipedia, the Toronto Book Corpus, the Common Crawl data set and, for code, the public open-source GitHub repositories. There are (at least) two potential problems with large text data sets: copyright infringement and garbage. Copyright infringement is an unresolved issue that’s currently the subject of multiple lawsuits. Garbage can be cleaned up. For example, the Colossal Clean Crawled Corpus (C4) is an 800 GB, cleaned-up data set based on the Common Crawl data set.

Along with at least one large training corpus, LLMs require large numbers of parameters (weights). The number of parameters grew over the years, until it didn’t. ELMo (2018) has 93.6 M (million) parameters; BERT (2018) was released in 100 M and 340 M parameter sizes; GPT-1 (2018) uses 117 M parameters. T5 (2020) has 220 M parameters. GPT-2 (2019) has 1.6 B (billion) parameters; GPT-3 (2020) has 175 B parameters; and PaLM (2022) has 540 B parameters. GPT-4 (2023) has 1.76 T (trillion) parameters.

Small language models

More parameters make a model more accurate, but also make the model require more memory and run more slowly. In 2023, we started to see some smaller models released at multiple sizes. For example, Meta FAIR’s Llama 2 comes in 7B, 13B, and 70B parameter sizes, while Anthropic’s Claude 2 has 93B and 137B parameter sizes.

One of the motivations for this trend is that smaller generic models trained on more tokens are easier and cheaper to use as foundations for retraining and fine-tuning specialized models than huge models. Another motivation is that smaller models can run on a single GPU or even locally.

Meta FAIR has introduced a bunch of improved small language models since 2023, with the latest numbered Llama 3.1, 3.2, and 3.3. Llama 3.1 has multilingual models in 8B, 70B, and 405B sizes (text in/text out). The Llama 3.2 multilingual large language models comprise a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out); there are also quantized versions of these models. The Llama 3.2 models are smaller and less capable derivatives of Llama 3.1.

The Llama 3.2-Vision collection of multimodal large language models is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.3 multilingual large language model is a pretrained and instruction-tuned generative model in 70B (text in/text out).

Many other vendors have joined the small language model party, for example Alibaba with the Qwen series and QwQ; Mistral AI with Mistral, Mixtral, and Nemo models; the Allen Institute with Tülu; Microsoft with Phi; Cohere with Command R and Command A; IBM with Granite; Google with Gemma; Stability AI with Stable LM Zephyr; Hugging Face with SmolLM; Nvidia with Nemotron; DeepSeek with DeepSeek-V3 and DeepSeek-R1; and Manus AI with Manus. Many of these models are available to run locally in Ollama.

Image generators

Image generators can start with text prompts and produce images; start with an image and text prompt to produce other images; edit and retouch photographs; and create videos from text prompts and images. While there have been several algorithms for image generation in the past, the current dominant method is to use diffusion models.

Services that use diffusion models include Stable Diffusion, Midjourney, Dall-E, Adobe Firefly, and Leonardo AI. Each of these has a different model, trained on different collections of images, and has a different user interface.

In general, these models train on large collections of labeled images. The training process adds gaussian noise to each image, iteratively, and then tries to recreate the original image using a neural network. The difference between the original image and the recreated image defines the loss of the neural network.

To generate a new image from a prompt, the method starts with random noise, and iteratively uses a diffusion process controlled by the trained model and the prompt. You can keep running the diffusion process until you arrive at the desired level of detail.

Diffusion-based image generators currently tend to fall down when you ask them to produce complicated images with multiple subjects. They also have trouble generating the correct number of fingers on people, and tend to generate lips that are unrealistically smooth.

RAG, agents, and MCP

Retrieval-augmented generation (RAG) is a technique used to “ground” large language models with specific data sources, often sources that weren’t included in the models’ original training. RAG’s three steps are retrieval from a specified source, augmentation of the prompt with the context retrieved from the source, and then generation using the model and the augmented prompt.

At one point, RAG seemed like it would be the answer to everything that’s wrong with LLMs. While RAG can help, it isn’t a magical fix. In addition, RAG can introduce its own issues. Finally, as LLMs get better, adding larger context windows and better search integrations, RAG is becoming less necessary for many use cases.

Meanwhile, several new, improved kinds of RAG architectures have been introduced. One example combines RAG with a graph database. The combination can make the results more accurate and relevant, particularly when relationships and semantic content are important. Another example, agentic RAG, expands the resources available to the LLM to include tools and functions as well as external knowledge sources, such as text databases.

Agentic RAG, often called agents or AI assistants, is not at all the same as the agents of the late 1990s. Modern AI agents rely on other programs to provide context to assist them in generating correct answers to queries. The catch here is that other programs have no standard, universal interface or API.

In 2024, Anthropic open-sourced the Model Context Protocol (MCP), which allows all models and external programs that support it to communicate easily. I wouldn’t normally expect other companies to support something like MCP, as it normally takes years of acrimonious meetings and negotiations to establish an industry standard. Nevertheless, there are some encouraging mitigating factors:

There’s an open-source repository of MCP servers.
Anthropic has shared pre-built MCP servers for popular enterprise systems, such as Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer.
Claude 3.5 Sonnet is adept at quickly building MCP server implementations.

While no-one can promise wide adoption of MCP, Anthropic seems to have removed the technical barriers to adoption. If only removing the political barriers were as easy.

Ways of speeding up training and inference

Slow training and inference have been serious problems ever since we started using neural networks, and only got worse with the advent of deep learning models, never mind large language models. Nvidia made a fortune supplying GPU hardware to accelerate training and inference, and there are several other hardware accelerators to consider. But throwing hardware at the problem isn’t the only way to solve it, and I’ve written about several of the software techniques, such as model quantization.

I asked Google to search for “ways of speeding up training and inference,” and it came up with a generative AI summary with web links (below) that’s pretty good. I was surprised at how good, but then I remembered that Google recently released an improved version of its Gemini model. Historically, I have largely ignored Google’s AI summaries of searches, since they have tended to mash up different domains into a single response that verges on nonsense; this one seems to be OK.

To speed up both training and inference, you can employ techniques like model quantization, pruning, knowledge distillation, hardware acceleration, and optimizing hyperparameters, data preprocessing, and model architecture. [1, 2, 3]

Here’s a more detailed breakdown of these techniques: [1, 2, 4]

Model optimization: [1, 2, 4]

Model quantization: Reducing the precision of model parameters (e.g., from 32-bit floating point to 8-bit integers) can significantly decrease model size and computational requirements, leading to faster inference. [1, 2, 4]

Model pruning: Removing less important weights or connections in a model can simplify it, reducing its size and computational cost without a significant loss in accuracy. [1, 2, 3]

Knowledge distillation: Training a smaller, faster model (the “student”) to mimic the behavior of a larger, more complex model (the “teacher”) allows for faster inference with comparable accuracy. [1, 2, 3, 5]

Hardware and software optimization: [1, 3]

Hardware acceleration: Utilizing specialized hardware like GPUs, TPUs, or FPGAs can significantly accelerate training and inference, especially for computationally intensive tasks. [1, 3]

Distributed training: Training models across multiple machines or GPUs can drastically reduce training time, especially for large datasets and models. [6]

Mixed precision training: Using a combination of 16-bit and 32-bit floating-point operations during training can lead to faster training speeds on modern GPUs without sacrificing accuracy. [7]

Efficient attention mechanisms (for transformer models): Optimizing attention mechanisms, such as using sparse attention or efficient attention kernels, can speed up training and inference of transformer models. [8]

Compile models: Using model compilation tools like torch.compile in PyTorch can optimize models for faster execution. [9]

Data and hyperparameter optimization: [10]

Data preprocessing: Efficiently preprocessing data can reduce the computational burden during training and inference. [10]

Hyperparameter optimization: Tuning hyperparameters like learning rate, batch size, and optimizer settings can significantly impact training speed and model performance. [11]

Learning rate scheduling: Using a learning rate schedule that dynamically adjusts the learning rate during training can improve convergence speed and model performance. [12]

Batch size optimization: Carefully tuning the batch size can have a significant impact on training efficiency. [12]

Early stopping and checkpointing: Implementing early stopping and checkpointing can help prevent overfitting and save computational resources during training. [13]

Model architecture optimization: Choosing an efficient model architecture tailored to the specific task can reduce computational complexity and improve performance. [3]

Data caching: Storing intermediate results or frequently accessed data in memory to reduce the need for repeated calculations, thereby speeding up model training and inference. [14]

Generative AI is experimental.

[1] https://www.run.ai/guides/cloud-deep-learning/ai-inference

[2] https://nebius.com/blog/posts/inference-optimization-techniques-solutions

[3] https://www.linkedin.com/pulse/how-improve-inference-performance-your-ai-applications-deciai

[4] https://towardsdatascience.com/inference-optimization-for-convolutional-neural-networks-e63b51b0b519/

[5] https://www.signitysolutions.com/tech-insights/speeding-up-inference-openai-models-optimization-techniques

[6] https://news.presearch.io/presearch-ai-269c94259a27

[7] https://medium.com/better-programming/speed-up-llm-inference-83653aa24c47

[8] https://www.linkedin.com/advice/0/how-do-you-optimize-training-inference-speed-transformer

[9] https://lightning.ai/courses/deep-learning-fundamentals/9.0-overview-techniques-for-speeding-up-model-training/

[10] https://www.mathworks.com/help/deeplearning/ug/speed-up-deep-neural-network-training.html

[11] https://www.mdpi.com/2079-9292/14/6/1184

[12] https://www.quora.com/What-methods-can-be-used-to-optimize-training-models-for-faster-and-more-efficient-learning-of-deep-neural-networks

[13] https://www.linkedin.com/advice/1/how-can-you-speed-up-training-ann-skills-machine-learning-l0gtf

[14] https://www.markovml.com/glossary/model-scaling

Artificial general intelligence

The new goal for the cool kids in the AI space is to achieve artificial general intelligence (AGI). That is defined to require a lot more in the way of smarts and generalization ability than Turing’s imitation game. Google Cloud defines AGI this way:

Artificial general intelligence (AGI) refers to the hypothetical intelligence of a machine that possesses the ability to understand or learn any intellectual task that a human being can. It is a type of artificial intelligence (AI) that aims to mimic the cognitive abilities of the human brain.

In addition to the core characteristics mentioned earlier, AGI systems also possess certain key traits that distinguish them from other types of AI:

Generalization ability: AGI can transfer knowledge and skills learned in one domain to another, enabling it to adapt to new and unseen situations effectively.

Common sense knowledge: AGI has a vast repository of knowledge about the world, including facts, relationships, and social norms, allowing it to reason and make decisions based on this common understanding.

The pursuit of AGI involves interdisciplinary collaboration among fields such as computer science, neuroscience, and cognitive psychology. Advancements in these areas are continuously shaping our understanding and the development of AGI. Currently, AGI remains largely a concept and a goal that researchers and engineers are working towards.

The obvious next question is how you might identify an AGI system. As it happens, a new suite of benchmarks to answer that very question was recently released, ARC-AGI-2. The AGI-2 announcement reads:

Today we’re excited to launch ARC-AGI-2 to challenge the new frontier. ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least two humans in under two attempts.

By the way, the comparison is to AGI-1, which was released in 2019.

The other interesting initial finding of ARC-AGI-2 is the cost efficiency of each system, including human panels (see below). CoT means chain of thought, which is a technique for making LLMs think things through. The asterisks flag preliminary numbers.

System	ARC-AGI-1	ARC-AGI-2	Efficiency (cost/task)
Human panel (at least 2 humans)	98%	100%	$17
Human panel (average)	64.2%	60%	$17
o3-low (CoT + Search/Synthesis)	75.7%	4%*	$200
o1-pro (CoT + Search/Synthesis)	~50%	1%*	$200*
ARChitects (Kaggle 2024 Winner)	53.5%	3%	$0.25
o3-mini-high (Single CoT)	35%	0.0%	$0.41
r1 and r1-zero (Single CoT)	15.8%	0.3%	$0.08
gpt-4.5 (Pure LLM)	10.3%	0.0%	$0.29

By the way, there’s a competition with $1 million in prizes.

Some tentative conclusions (September 2025)

Right now, generative AI seems to be a few years away from production quality for most application areas. For example, the best LLMs can currently do a fair to good job of summarizing text, but do a lousy job of writing essays. Students who depend on LLMs to write their papers can expect C’s at best, and F’s if their teachers or professors recognize the tells and quirks of the models used.

Along the same lines, there’s a common description of articles and books generated by LLMs: “AI slop.” AI slop not only powers a race to the bottom in publishing, but it also opens the possibility that future LLMs that train on corpora contaminated by AI slop will be worse than today’s models.

There is research that says that heavy use of AI (to the point of over-reliance) tends to diminish users’ abilities to think critically, solve problems, and express creativity. On the other hand, there is research that says that using AI for guidance or as a supportive tool actually boosts cognitive development.

Generative AI for code completion and code generation is a special case, because code checkers, compilers, and test suites can often expose any errors made by the model. If you use AI code generators as a faster way to write code that you could have written yourself, it can sometimes cause a net gain in productivity. On the other hand, if you are a novice attempting “vibe coding,” the chances are good that all you are producing is technical debt that would take longer to fix than a good programmer would take to write the code from scratch.

Self-driving using AI is currently a mixed bag. Waymo AI, which originated as the Google Self-Driving Car Project, uses lidar, cameras, and radar to synthesize a better image of the real world than human eyes can manage. On the other hand, Tesla Full Self-Driving (FSD), which relies only on cameras, is perceived as error-prone and “a mess” by many users and reviewers.

Meanwhile, AGI seems to be a decade away, if not more. Yes, the CEOs of the major LLM companies publicly predict AGI within five years, but they’re not exactly unbiased, given that their jobs depend on achieving AGI. The models and reasoning systems will certainly keep improving on benchmarks, but benchmarks rarely reflect the real world, no matter how hard the benchmark authors try. And the real world is what matters.

Original Link:https://www.infoworld.com/article/4061121/a-brief-history-of-ai.html
Originally Posted: Mon, 29 Sep 2025 09:00:00 +0000

Upvote0PointsDownvote

0 People voted this article. 0 Upvotes - 0 Downvotes.