Beware the Goldfish: Three Major AI Blind Spots

LLMs are like precocious interns today--book-smart, but not context-aware, street-wise, or careful

Jul 02, 2023

More and more people have been freaking out lately about how AI is going to destroy white collar and artistic jobs in the coming years. You see new articles about this almost every day as increasing numbers of people have their first mind-blowing brushes with ChatGPT and feel the same mix of wonder, awe, and discomforting fear that we all did the first time we saw a chatbot do something that, until roughly 2020, most of us never imagined it would be capable of in our lifetime.

How worried should we actually be about this?

The pace of development is undoubtedly unsettling. New tools, applications, and technological advances in AI pop up almost weekly that threaten what many people see as their core competencies. Less than three years ago that AI could barely write coherent paragraphs or render badly-drawn cartoon dogs, and today it can generate stunning artwork, deliver completed working code, compose beautiful poetry, and even render full-motion video. It’s dizzying for even domain experts to try to keep up with all of these advances, and I should know.

Part of my day job is AI product research & development—specifically, technical and production operations applications for LLMs. I’ve been experimenting with tools like GPT and Midjourney across text, code, and media for years now, and I spend a lot of time thinking about how to integrate AI into existing work production streams. I’m well-versed in prompt engineering, LLM API wrapper end-user product applications, and have decades of multi-disciplinary experience with some of the common operational challenges AI might be used to solve.

I’m not going to say that you shouldn’t worry about AI taking your job. With how rapidly new capabilities are developing, that seems more likely to be hubristic than not (and whole or hybrid-AI applications are already putting some competitive pressure on writing and art jobs).

But I will tell you about three key problem areas I see today where AI stumbles, each of which are direct barriers to moving from the “gee whiz, what a cool tool” stage to actual enterprise-level workflow replacement.

I. High-Context Problems

When you boil it down, there are basically two things people use LLMs for:

Generating New Content: stories, articles, code, images, videos, music, etc
Answering Questions: summarizing and interpreting data, structuring plans, providing historical context, recommending approaches or solutions

It’s gotten impressively good at both of these things for specialized low-context applications, but it still struggles with high-context problems. High-context problems are issues that require specialized knowledge from multiple sources that’s relevant to your particular work or product use case.

Most white collar roles involve high-context problems that regularly synthesize a lot of unique data, and without being context-aware it’s hard for one-shot generated output to be useful (even with tuning).

Marketing and sales copy need to be written with a strong awareness of your particular user demographics, product functions, and current trends.
Code needs to be written to integrate with the rest of your particular code base, SQL tables and field data, and custom methods.
Art needs to be generated in a style that meshes with the rest of your design aesthetic, for your particular use case.
Longform writing needs to retain awareness of the entire text body that precedes it in order to not wander away from the point (or lose the plot).

Today, the context data that can be provided to inform questions is limited by model prompt sizes, and this is been one of the biggest barriers to making LLMs more broadly useful at work. You can’t easily hand an LLM your entire code base and ask it to write a script that instantly fits with the rest of it, just like you can’t give an LLM the first three chapters of a novel and ask it to finish it for you. Current model prompt sizes (which are only 2000-8000 character tokens) force you to summarize problems and context to ask questions, when often much more nuanced high-context information is required for good work output.

This is a well-understood problem that many people are working on solving, and it’s ultimately an information compression and processing power issue. ChatGPT, for example, demonstrates a representational summary awareness of your chat history, but it struggles to get specific details from earlier in the conversation right, and it’s hard to make an LLM get the right balance of relying on its training data and referencing new summary information to answer high-context questions in the same way a skilled human would.

While this space is advancing quickly with increasing prompt sizes and p-tuning methods (32k-token models for GPT-4 have been released), it’s still a major limitation for direct application. For references, a standard-length 240-page novel contains ~300k character tokens. Mid-to-large sizes code bases can be significantly bigger, and that’s before you add in relevant reports and design information about the problem you’re trying to solve. Even at 32k (or 128k) tokens, the problem of how to provide enough context for targeted output that directly meets your business needs is tricky.

Additionally, some of the information synthesized for high-context workflows isn’t easily provided directly to the LLM. If you’re using graphical metrics charts, 3rd party data tools, and verbal direction to inform your work, as most of us do, that’s not something that’s easy to point a software app at in order to generate nuanced, high-context prompting even as prompt sizes increase.

This problem will be solved eventually, but until it works easily and consistently for a wide variety of scenarios, LLMs will still function as a better base tool for humans to draw on, cooperate with, or begin their work from than for producing direct-replacement work for most business applications.

II. Low-Information Problems

Another class of related problems that people commonly run into at work are what I’ll call “low-information” problems. These tend to be niche methods or small problems that are specific to people working in a new domain-specific space where experience with the space is the best way to be able to answer the question.

Often these problems can also look like classic “unknown unknowns,” where you don’t know enough about the problem to even frame the question correctly, and often involve information that isn’t present in LLM training data or scannable documents that they could even theoretically have access to for tuning. Sometimes these are simple functionality questions that people learning a new tool for the first time might have but that are so simple for veterans it’s not worth explaining.

This is one of the biggest values of having actual expertise—you not only know what’s technically correct and written down, but you’re familiar with the host of weird day-to-day problems that come up and the solutions that don’t get written down. It’s the hard-to-explain stuff they don’t teach you in school or that would require a pedantic multi-volume encyclopedia to write down that can often be the most useful when doing jobs well.

A simple example of this is that a friend of mine was struggling with the blockquote function in Substack the other day, and they couldn’t figure out how to end the quote without removing the formatting from the whole paragraph. This is trivial to do, but it can be frustrating to figure out if you don’t already know how.

Let’s see how ChatGPT 3.5 handles the problem:

Prompt 1: If I'm typing in a blockquote on Substack, how do I end the blockquote and return to normal paragraph styling without pressing the blockquote button?

ChatGPT: If you're typing in a blockquote on Substack and you want to end the blockquote and return to normal paragraph styling without using the blockquote button, you can use HTML tags to achieve this. Here's how:
Type your text within the blockquote, as you normally would.
When you're ready to end the blockquote and switch back to normal paragraph styling, type the closing HTML tag "</blockquote>" on a new line.
After typing the closing tag, press Enter to move to the next line.
Your text should now be outside of the blockquote and formatted as a normal paragraph.

But maybe the issue here is just that I wasn’t specific enough with my prompt? Let’s try again.

Prompt 2: I'm writing an article in Substack and I have a block quote I can't get out of. If I press "Enter" the blockquote just continues, but if I press the blockquote button on the toolbar it removes the quote from the whole paragraph even on my new line. How do I end the quote and keep typing without removing it?

ChatGPT: If you're using Substack and want to end a block quote but keep typing without removing the quote formatting from the entire paragraph, you can follow these steps:
Place your cursor at the end of the block quote within the paragraph.
Press the "Enter" key to create a new line.
On the new line, press the "Backspace" or "Delete" key to remove any unwanted indentation or block quote formatting that may carry over from the previous line.
Start typing your new text.
By creating a new line and removing any extra indentation or formatting, you should be able to continue typing without the block quote style being applied to the entire paragraph.

Much better! And so confident! Unfortunately, it’s also wrong. Those methods don’t work. What you have to do is press Enter a second time on the new line to escape the quoted section. Super simple, unless you don’t already know that.

It’s worth noting that I pitched the same question a few times at GPT-4 and it got it right two out of three times (although the third time it told me this was impossible). Better, but still inconsistent, and undoubtedly frustrating for someone who gets the bad answers while they’re trying to do a very simple thing in a new space.

There’s a million examples of both small and much larger things like this that we all need to tackle in our jobs every single day, and it’s difficult for any LLM to reproduce these capabilities when it has nothing to train on in order to solve these problems.

III. High-Accuracy Problems

You might have noticed in my last example that GPT not only answered my question wrong, but did so very confidently. This is dangerous in a work setting, and it only gets worse as the accuracy and reliability of LLM output improves. If a tool gives you correct answers even 95% of the time, relying on it for that last 5% is going to really catch you off-guard in potentially very embarrassing ways.

We’re not anywhere near 95% consistent accuracy and reliability yet—especially not with the variability in user prompting skill that affects outputs. It looks a lot more reliable than it actually is, and the cautionary “check my work” warnings that OpenAI has started appending to lots of types of GPT prompt responses does about as much to curb the human tendency to take shortcuts as telling college students not to rely on Wikipedia for their essays.

LLMs not only routinely provide bad instructions or over-confident solutions that don’t fit the problem—they also still make things up as part of their solutions. This recently burned a careless lawyer badly:

Roberto Mata sued Avianca airlines for injuries he says he sustained from a serving cart while on the airline in 2019, claiming negligence by an employee. Steven Schwartz, an attorney with Levidow, Levidow & Oberman and licensed in New York for over three decades, handled Mata’s representation.
But at least six of the submitted cases by Schwartz as research for a brief “appear to be bogus judicial decisions with bogus quotes and bogus internal citations,” said Judge Kevin Castel of the Southern District of New York in an order.
The fake cases source? ChatGPT.

GPT, while extremely useful, still sends me in the wrong direction or gives me bad information all the time. We all like the magic of having someone else provide us an answer to a problem we’re working with, but the natural human inclination to skip the critical step of actually validating the information provided in that answer makes an over-confident assistant who’s usually but not always right incredibly challenging to bring into a workplace setting.

Human nature is such that we feel responsible for the answers we provide. If someone else provides those answers, it’s all too easy to take them at face value, especially in job environments where there’s a lot of time pressure to deliver. “I gave you bad information,” is always an uncomfortable message to send someone, but it’s a lot easier for most people to be careless with that if they’re reliant on a tool that their job is telling them to use and they have a convenient scapegoat to blame.

This “trust but verify” validation step is necessary, but in some cases can end up creating more work and being less useful than simply solving the problem via more conventional methods. This is especially true with things like software development: If GPT is helping you code something complicated and uses a pattern you’re unfamiliar with but which doesn’t work (or worse, introduces hard to notice and troubleshoot bugs), it can sometimes require hours of troubleshooting to realize that it’s the completely wrong approach.

This is another flavor of a high-context problem, where you need to know enough about what you’re being given and have enough time to be able to assess whether it’s correct or not. There are cases where LLMs can dramatically speed up your workflow, but over-reliance on their output can be both personally and financially damaging for you and your business. This makes them dangerous to operationalize too heavily without tight validation controls, especially for low-skill workers or people asking questions about problems outside of their domain expertise (which of course is one of the cases where you most want an LLM to step in and help you get answers).

Unfortunately, output validation controls are still in their infancy. Because it’s a new class of problem and there are so many possible use cases and edge cases to think about, it’s going to be a while before we have the type of LLM output we can confidently drop into most business settings without oversight. This is especially true in fields where it really matters that the guidance is correct and accurate—like law, medicine, finance, education, and… well, most industries, to be honest. The list goes on. Accuracy almost always matters in operations.

Human nature is fighting against us here, and people are right to be cautious about slotting too much LLM-reliance into key business processes until we crack this problem.

IV. In Summary: Be Cautious of the Goldfish

When I think about practical business applications for LLMs today, dozens of potential examples come to mind. They’re quite obviously useful for any number of brainstorming, templating, ideation, bootstrapping, and summarizing tasks. They can be great to point you in the right direction, help understand a document, or speed up a complex content generation task. For individuals that have the skill and ability to check the output and are good with prompting, they can be a massive productivity accelerator.

But until these practical problems of operational integration have clear, consistently reliable methods to provide more context, access the right context, and validate the output, businesses will struggle to directly integrate them into their core work functions at scale. Most jobs, even accelerated by AI, are likely secure in the near term 5-10 year window. Change takes time and trust takes even longer to build—even when it’s an industry that’s accelerating as rapidly as AI is.

Despite the advancements of the last few years, AI still practically operates like an unmedicated ADHD infovore intern fresh out of college who knows a lot less than they think they do: Lots of information and impressive speed and energy, but sloppy on reliability, accuracy, and domain knowledge—smart but not wise, overconfident about its given answers to problems, and demonstrating the approximate working memory of a goldfish.

You might feel comfortable giving this person a large variety of possible tasks, but you’d make sure you had someone more responsible in charge of their work output when it mattered. That’s exactly why I don’t anticipate near-term widespread job elimination, even as we see more and more AI tooling making its way into our work lives.

Don’t breathe a sigh of relief, go back to your day job, and rest on your laurels, though. These problems do have solutions, and AI will continue to advance, possibly even faster than I’m anticipating here. If there’s one thing I’ve learned about this space, it’s that you should expect to be surprised. Yesterday’s intern is today’s manager and tomorrow’s CEO, after all.

As with the commercial boom of the internet two decades ago, AI is going to dramatically overhaul the way we work and the way that we relate to our work in the coming decades, and adjustments will be required from everyone. Legality and ownership of training content are also major questions on everyone’s mind right now that I haven’t even touched on here, but most businesses see the obvious potential for more efficient workstreams and faster development of new products.

It’s just going to take a while for us to collectively figure out how to actually do that in a way that’s consistent, ethical, safe, and reliable.

Liminal Reflections