March 17, 2025

Just slap an LLM on the front end—how hard can it be?

My last post mentioned one of my favorite new-to-me bloggers of late: Ed Zitron, who’s staked out a position for himself as the AI skeptic’s AI skeptic. His recent hits include “The Generative AI Con”, “Open AI is a Bad Business”, and “Sam Altman is Full of Shit”. Even without clicking any of those links, you’ve probably picked up what Ed is laying down.

Zitron certainly isn’t alone in his pessimism, but he’s an outlier among most of the tech pundits and podcasters I follow and listen to. Even as everyone seems to recognize that what Microsoft and Apple and Google are furiously cramming into every available inch of UI real estate isn’t particularly good, there’s an omnipresent vibe of “yeah, but it will get better” and “they’re still working out the kinks”.

And, sure, in the general case, that’s the story of all successful technologies. Lots of them start out kind of sucky and get better over time—sometimes so much better, you couldn’t have reasonably predicted how good they’d get from where they began. This is Amara’s Law:

We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.

Now let me introduce you to Watts’s Corollary to Amara’s Law:

If the technology is fundamentally incapable of reliably doing what you’re promising, it doesn’t matter how long the damn run is.

Oh, God, is he talking about “hallucination” again? Sorry, yes. Yes, I am. To quote from one of Zitron’s earlier pieces, “Godot Isn’t Making It”:

Separate to any lack of a core value proposition, training data drought, or unsustainable economics, generative AI is a dead end due to the limitations of probabilistic models that hallucinate, where they authoritatively state things that aren’t true. The hallucination problem is one that is nowhere closer to being solved—and, at least with the current technology—may never go away.

As I explained last year, LLMs “reason” by performing lexical analysis and interpolation between words represented as points in multidimensional vector space, and deriving how words commonly relate to one another is not the same as deriving what words mean. LLMs are amazing at creating technically original, syntactically correct sentences, but a good text adventure game from the 1980s understands locks, containers, light sources, being open or closed, and so on at a symbolic level that the most powerful OpenAI model does not. The current trend of “reasoning models” that essentially ask LLMs to check their own work doesn’t address this concern, because symbolic logic is fundamentally different. You can’t get there just by LLM-ing harder and faster.

But here’s an idea: use LLMs not for generating final output, but as front ends to other systems. LLMs are arguably the best technology we have now for natural language processing (NLP), and it’s not just an incremental improvement over what came before, it’s a quantum leap. I can easily imagine things I’d put this to use for right now. Natural language queries that searched my own projects (“what color are Autumn’s eyes”) are obvious, but it wouldn’t be a stretch to imagine this being a front-end for automating actions that touch existing pieces in Apple’s ecosystem.

And, behold! The most interesting parts of “Apple Intelligence” promised last year are exactly that, on steroids. Imagine the ability to say “give me directions to the restaurant I’m meeting Agatha at for lunch and text her I’m on my way” and have Siri understand that it needs to find Agatha’s contact information, search recent text messages and email messages from her until it finds one that mentions a restaurant and a lunch meeting, feed that restaurant into Apple Maps for driving directions, and send Agatha a text message. Amazing!

Well, keep imagining it, bucko, because you can’t do it yet, and you may not be able to do it for a long, long time. Apple recently announced that “personalized Siri” won’t be ready “until the coming year,” i.e., 2026. John Gruber of Daring Fireball is genuinely angry about missing the likelihood of this:

What Apple showed regarding the upcoming “personalized Siri” at WWDC was not a demo. It was a concept video. Concept videos are bullshit, and a sign of a company in disarray, if not crisis.

He’s right, and I had to think for a moment about why I wasn’t particularly angry: because I wasn’t particularly surprised. (By the delay. Apple being in disarray—which I also agree with—is another post.)

“LLM as front-end to other systems”, it turns out, is way more work than “LLM does everything”. In the front-end model, the work gets done by the non-LLM parts of the system. So far, so good, at least in theory. But not only are hallucinations still a problem in this scenario, they’re now a compounding problem.

When the LLM takes user input and spits out, say, JSON API calls on the back end, nothing stops it from hallucinating incorrect parameter values, or even imaginary parameters. In certain circumstances, this may be fine. But in the examples Apple promised—ones that take the user’s input, break it down into multiple steps, possibly generate hidden intermediary prompts, and end in actions like “give me driving directions to where you’ve deduced I mean”—all of the steps have to be correctly inferred. Also, any auto-generated internal prompt offers another chance for a hallucination. If any one of those steps produces a bad answer, the whole chain will go off the rails dramatically—or worse, go off the rails subtly. Being less wrong is not good enough; the end result has to be correct, which means every step does, too.

So, we don’t have that yet. The parts of Apple Intelligence we have are the parts that create anodyne, occasionally misleading summaries of an email thread, rewrite a business letter so it has all the personality of cafeteria mashed potatoes, and synthesize images of eldritch horror in the style of a Memoji. You remember the old Steve Jobs mantra of over-promise and under-deliver, don’t you? Wait, is that how it goes? Let’s ask Siri.

“But Amazon’s recently announced Alexa Plus can do all this!” No, it can’t. They were explicit that it won’t be able to do everything it promised when it starts shipping, either. Amazon’s Head of Devices and Services says, “The engineering team had to hook [Alexa’s] API to the LLM and make sure it can do it all without hallucinating”, and I’m going to make two predictions here: one, they haven’t actually gotten that to work yet; two, they’re going to discover that, no matter how hard they think it’s going to be to make it absolutely completely without a doubt work, they are still underestimating the problem. And while Google’s “Gemini with Personalization” is being touted by some as Google beating Apple to the punch, it can’t do what Personalized Siri promises, either (it can utilize your Google search history “to provide relevant and personalized responses”, but right now, that’s it).

Here’s the question we’re dancing around: what if it’s impossible to get LLMs to do this reliably? So far, Ed “Pissing on Sam Altman’s Parade” Zitron’s critiques have mostly been right. There are specific, constrained use cases that I know people insist generative AI helps them with, and even when I have my doubts, I’m not going to insist they’re wrong. But all uses of LLMs are so piggishly resource-intensive that OpenAI, Anthropic, and friends are poster children for the “we lose money on everything we sell, but we’ll make it up in volume” joke. All major LLM training corpuses incorporate copyrighted material that isn’t licensed for free commercial reuse, and everyone’s just gambling that courts won’t tell them they have to pay up.

Yet those issues, we can at least pay lip service to addressing. The issue of what if you just can’t get there from here, the possibility that the entire computing industry is frantically shoveling money into a technology that might be less “next Internet” than “next Minidisc”, is just too much to even contemplate. So we’re not.

I know the common wisdom is that the functionality Apple described with Super Siri will inevitably be on not just Apple’s platforms but Android and Windows, too; I’m not so sanguine. Maybe there’s a workable solution involving giving the LLMs as little to do as possible—i.e., letting them handle the NLP and only the NLP, passing the rest off to a fully deterministic “AI” like Bixby’s planner. (You laugh, but parts of Bixby—mostly, the parts that were Viv—were way ahead of their time.) But I’m not positive even that will do it. Hallucinations will still happen. Errors will still accumulate in multi-step actions. And when Super Siri sends you to meet Agatha at the Red Robin in San Bruno instead of Birdsong in San Francisco, you will have a lot of time to contemplate the difference between “less wrong” and “correct” as you sit, sad and alone, with your barbecue bacon cheeseburger.

To support my writing, throw me a tip on Ko-fi.com!