Smoothing out AI’s rough edges
Follow the usual AI suspects on X—Andrew Ng, Paige Bailey, Demis Hassabis, Thom Wolf, Santiago Valdarrama, etc.—and you start to discern patterns in emerging AI challenges and how developers are solving them. Right now, these prominent practitioners expose at least two forces confronting developers: amazing capability gains beset by the all-too-familiar (and stubborn) software problems. Models keep getting smarter; apps keep breaking in the same places. The gap between demo and durable product remains the place where most engineering happens.
How are development teams breaking the impasse? By getting back to basics.
Things (agents) fall apart
Andrew Ng has been pounding on a point many builders have learned through hard experience: “When data agents fail, they often fail silently—giving confident-sounding answers that are wrong, and it can be hard to figure out what caused the failure.” He emphasizes systematic evaluation and observability for each step an agent takes, not just end-to-end accuracy. We may like the term “vibe coding,“ but smart developers are forcing the rigor of unit tests, traces, and health checks for agent plans, tools, and memory.
In other words, they’re treating agents like distributed systems. You instrument every step with OpenTelemetry, you keep small “golden“ data sets for repeatable evals, and you run regressions on plans and tools the same way you do for APIs. This becomes critical as we move beyond toy apps and start architecting agentic systems, where Ng notes that agents themselves are being used to write and run tests to keep other agents honest. It’s meta, but it works when the test harness is treated like real software: versioned, reviewed, and measured.
Santiago Valdarrama echoes the same caution, sometimes suggesting a massive step back. His guidance is refreshingly unglamorous: Resist the urge to turn everything into an agent. Although it can be “really tempting to add complexity for no reason,“ it pays to sidestep that temptation. If a plain function will do, use a plain function because, as he says, “regular functions almost always win.“
Fix the data, not just the model
Before you even think about tweaking your model, you need to fix retrieval. As Ng suggests, most “bad answers“ from RAG (retrieval-augmented generation) systems are self-inflicted—the result of sloppy chunking, missing metadata, or a disorganized knowledge base. It’s not a model problem; it‘s a data problem.
The teams that win treat knowledge as a product. They build structured corpora, sometimes using agents to lift entities and relations into a lightweight graph. They grade their RAG systems like a search engine: on freshness, coverage, and hit rate against a golden set of questions. Chunking isn’t just a library default; it’s an interface that needs to be designed with named hierarchies, titles, and stable IDs.
And don’t forget JSON. Teams are increasingly moving from “free-text and pray“ to schema-first prompts with strict validators at the boundary. It feels boring until your parsers stop breaking and your tools stop misfiring. Constrained output turns LLMs from chatty interns into services that can safely call other services.
Put coding copilots on guardrails
OpenAI’s latest push around GPT-5-Codex is less “autocomplete“ and more a matter of AI “robots“ that read your repo, point out mistakes, and open a pull request, suggests OpenAI’s cofounder Greg Brockman. On that note, he has been highlighting automatic code review in the Codex CLI, with successful runs even when pointed at the “wrong“ repo (it found its way), and general availability of GPT-5-Codex in the Responses API. That’s a new level of repo-aware competence.
It’s not without complications, though, and there’s a risk of too much delegation. As Valdarrama quips, “letting AI write all of my code is like paying a sommelier to drink all of my wine.” In other words, use the machine to accelerate code you’d be willing to own; don’t outsource judgment. In practice, this means developers must tighten the loop between AI-suggested diffs and their CI (continuous integration) and enforce tests on any AI-generated changes, blocking merges on red builds (something I wrote about recently).
All of this points to yet another reminder that we’re nowhere near hitting autopilot mode with genAI. For example, Google’s DeepMind has been showcasing stronger, long-horizon “thinking“ with Gemini 2.5 Deep Think. That matters for developers who need models to chain through multistep logic without constant babysitting. But it doesn’t erase the reliability gap between a leaderboard and your uptime service-level objective.
All that advice is good for code, but there’s also a budget equation involved, as Tomasz Tunguz has argued. It’s easy to forget, but the meter is always running on API calls to frontier models, and a feature that seems brilliant in a demo can become a financial black hole at scale. At the same time, latency-sensitive applications can‘t wait for a slow, expensive model like GPT-4 to generate a simple response.
This has given rise to a new class of AI engineering focused on cost-performance optimization. The smartest teams are treating this as a first-class architectural concern, not an afterthought. They‘re building intelligent routers or “model cascades“ that send simple queries to cheaper, faster models (like Haiku or Gemini Flash), and they’re reserving the expensive, high-horsepower models for complex reasoning tasks. This approach requires robust classification of user intent upfront—a classic engineering problem now applied to LLM orchestration. Furthermore, teams are moving beyond basic Redis for caching. The new frontier is semantic caching, where systems cache the meaning of a prompt‘s response, not just the exact text, allowing them to serve a cached result for semantically similar future queries. This turns cost optimization into a core, disciplined practice.
A supermassive black hole: Security
And then there’s security, which in the age of generative AI has taken on a surreal new dimension. The same guardrails we put on AI-generated code must be applied to user input, because every prompt should be treated as potentially hostile.
We‘re not just talking about traditional vulnerabilities. We‘re talking about prompt injection, where a malicious user tricks an LLM into ignoring its instructions and executing hidden commands. This isn’t a theoretical risk; it‘s happening, and developers are now grappling with the OWASP Top 10 for Large Language Model Applications.
The solutions are a blend of old and new security hygiene. It means rigorously sandboxing the tools an agent can use, ensuring minimum privilege. It means implementing strict output validation and, more importantly, intent validation before executing any LLM-generated commands. This isn‘t just about sanitizing strings anymore; it‘s about building a perimeter around the model‘s powerful but dangerously pliable reasoning.
Standardization on its way?
One of the quieter wins of the past year has been the continued march of Model Context Protocol and others toward becoming a standard way to expose tools and data to models. MCP isn’t sexy, but that‘s what makes it so useful. It promises common interfaces with fewer glue scripts. In an industry where everything changes daily, the fact that MCP has stuck around for more than a year without being superseded is a quiet feat.
This also gives us a chance to formalize least-privilege access for AI. Treat an agent‘s tools like production APIs: Give them scopes, quotas, and audit logs, and require explicit approvals for sensitive actions. Define tight tool contracts and rotate credentials like you would for any other service account. It‘s old-school discipline for a new-school problem.
In fact, it’s the staid pragmatism of these emerging best practices that points to the larger meta-trend. Whether we’re talking about agent testing, model routing, prompt validation, or tool standardization, the underlying theme is the same: The AI industry is finally getting down to the serious, often unglamorous work of turning dazzling capabilities into durable software. It’s the great professionalization of a once-niche discipline.
The hype cycle will continue to chase after ever-larger context windows and novel reasoning skills, and that’s fine; that’s the science. But the actual business value is being unlocked by teams applying the hard-won lessons from decades of software engineering. They’re treating data like a product, APIs like a contract, security like a prerequisite, and budgets like they’re real. The future of building with AI, it turns out, looks a lot less like a magic show and a lot more like a well-run software project. And that’s where the real money is.
Original Link:https://www.infoworld.com/article/4064367/smoothing-out-ais-rough-edges.html
Originally Posted: Mon, 29 Sep 2025 09:00:00 +0000
What do you think?
It is nice to know your opinion. Leave a comment.