Saturday, June 14, 2025

Flying Without a Net: Requirements to Code (via Codex)

In my previous blog post about Jira automation and Claude, I created a sample application and was able to investigate the code on GitHub to compare it to the PRD that I wrote. Based on that analysis, I had Claude write several Jira epics. One of them was pretty generic about implementing a user authentication and profile system. Because this functionality is required for almost any application you might write, I decided to start there.

Here is a screenshot of the epic that Claude wrote:

Note that this epic is HUGE. If a junior PM had written this epic, I would advise them to trim it down a bit. Pull out things like Google OAuth from things like RBAC, for example. This could be done by making smaller child stories against this epic or by breaking it into multiple epics.

However, for the purpose of this test, I just passed the output of Claude into Codex. Codex is the relatively new GenAI-based code tool from OpenAI. Here is the summary Codex produced after attempting to implement the epic:

Codex purports to be really good at building code directly from requirements. In the demos they give, you can just toss complex requirements at Codex and it will do the heavy lifting. Notice that tossing Codex a random GitHub repo may also mean that it cannot test because it doesn’t understand the code base. In this example, lint failed because of a missing dependency. Codex didn’t detect and fix that automatically nor did it suggest a way to fix it.

In addition, as you can see from the output in the example, Codex only focused on part of the epic. It’s implementing just the unsafe handling of API keys and such. which Lovable hard coded into the source code. To be fair, this is a pretty important issue and definitely should be addressed in the code as soon as possible, but it’s a very small subset of the epic. Codex didn’t come back to me and say, “Hey, this epic is way too big, please make it smaller.”

Again and again we see this failure mode in GenAI systems. They are enthusiastic but not experienced. If you compare this to people, a very junior dev might just follow instructions, not knowing how bad things are or how epics should be written. A more senior dev would go back to PM and tell us that we need to break this work down into smaller chunks. A principal-level dev would just fix the epic themselves and tell us that they fixed it.

Please note that I’m looking at this from the product management perspective. I won’t evaluate the quality of the code coming out of these systems. I’m just investigating how functional they are just like I would evaluate any eng partner I am working with.

In the end, a feature team only needs two things to be successful: quality and velocity.

If you are delivering on epics at a very high quality and doing so very quickly, almost any other problem can be addressed by PM. Assuming PM is doing their job, this means that we are building the correct features and that the product is solving problems for the target persona. The same thing goes for AI. We know that GenAI-based systems like Codex are much faster than traditional coding methods, but are they executing at high quality?

So far, the answer is no. They require close human supervision to make them work correctly.

Going back to our junior employee example, this shouldn’t be surprising. If you hired a dozen new college grads and let them loose on your code base, what do you think would happen? Yes, chaos would ensue. At the moment, the same is true for AI-based toolchain. You can get them to do an amazing amount of work for you but you do need to supervise them and monitor their progress to ensure quality work.

GenAI: Eager, fast, well educated. Not experienced or self critical.

Why Autonomy Doesn’t Matter (Yet)

As I’ve discussed before, I am not terribly concerned about how autonomous my AI agents are. Most of what you read online focuses on the autonomy aspect of Agentic AI and I really think that’s the wrong approach.

My background is in enterprise-class software. Specifically, enterprise infrastructure. I’ve been working on enterprise-grade automation since 1996. In the end, an agent is simply an automation platform. You are asking the agent to do work for you. The advent of LLMs means that there are entire classes of work that computers can do now that we couldn’t dream of in 1996, but the core business problem of ensuring that the computer does the work for you remains.

If you think about any automation project, the first question is always the same. Will the system be accurate? That is to say, will it achieve the business result?

The very first production system I developed and deployed was a system that automated email accounts. The business result was that everyone who worked for the company had to have a working email address and that email address had to be mapped to the correct server where their mail was provisioned. Simple to say, but difficult to do for 100,000 people. Later, I built a system that provisioned Windows Servers at scale. Automated provisioning wasn’t really a thing back then and we had to build a complete running Windows Server host from bare metal in just an hour. This used to be manual work.

As a PM, I worked on systems like DRS, which automatically places VMs inside an ESXi cluster, and HashiCorp cloud, which automatically deploys customer environments.

Etc. Etc. Etc.

Over time, technologies change. The techniques we use change. But the business goals, the process and the underlying issues remain evergreen. The system must solve the problem, and it must solve the correct problem at the correct time. An agent, by implementing a business process, is simply another, more modern, automation platform. It’s no different in concept than software that deploys servers or places VMs correctly. Thus, the underlying problems are the same even though the implementation is completely different.

For a modern LLM-based agent, there are two primary concerns:

Context. The agent must have the correct context. When solving a business problem for the user, the context of that problem is critical.
Accuracy. If the agent claims to have solved the problem, that problem must be solved for the user a significant percentage of the time (probably 95% or better).

Yes, but what about autonomy? Does the agent solve problems on its own?

It turns out that autonomy is a byproduct of context and accuracy. If the agent is very accurate and has the proper context, then you will allow the agent to solve the problem. However, this only occurs AFTER you have confidence in the accuracy and context of the solution.

Let’s take a hypothetical example. Let’s say you are running a business and you decide to buy an agent that approves home loans. The purpose of this software is to evaluate each loan, apply the company’s loan standards and either approve or reject this loan. There are two vendors who have loan approval agents; you have to decide which one to buy.

Company A has a “master agent” loan system that takes each loan and automatically approves or rejects the loan. You give it a document describing your policies and it takes all further action.
Company B has a “loan automation” system that investigates your current process, documents it where necessary and then makes loan recommendations. Those recommendations can either be manually approved by a loan officer or automatically approved. The default is manual approval.

Which company do you hire?

Of course, you hire company B. Company A has too much risk and there is no way to manually intervene. Company A may have an amazing system, but you don’t know for sure how well it will work in your environment. On the other hand, Company B allows you to start out manual and then automate later. Company B also has a way to discover your process which may be different than what’s actually documented.

And here’s the thing. When I was a vSphere PM working on the DRS feature, we had the EXACT SAME PROBLEM. When DRS was initially released, we were very confident that the VM placement decisions that the system made were correct. We had done YEARS of testing and we knew that we were better at placing VMs using this system than when humans placed VMs. We had papers about this, we had patents—all kinds of stuff.

And what happened? Customers balked. They didn’t know what was happening so they didn’t turn the system on. So, we always lead with “Manual” mode where the system would make recommendations but not actually make changes. Today, there are actually three modes: “Partial” for initial placement only, “Full” for complete automated placement and “Manual” for recommendations only. The vast majority of customers start with Manual and most of them eventually move to Full (automated). DRS today is one of the most widely adopted vSphere features. vSphere also introduced the idea of VM overrides and host affinity. This is context that allows the system to make better decisions by letting it know that VM1 and VM2 need to be on the same physical machine or that V3 cannot be vMotion’d.

The details of how vSphere works aren’t terribly important here. The point is that these types of accuracy and context issues have been around for a very long time. We can look back at these systems and understand how they used context to improve accuracy and how those two factors led to customer adoption. It’s easy to think “GenAI changes everything” and just ignore the last thirty years of enterprise automation, but that would probably be a mistake. We know how to solve these problems, we just need to look at them in the abstract and pay less attention to the implementation details which change over time.

This takes us to context.

The lesson of the last 30 years of automation is that context is king. If the system knows what is happening and it knows what’s supposed to happen, the odds are higher that the system will take the correct action. Yes, context leads to accuracy which leads to autonomy. This is yet another software virtuous circle.

As you plan your AI agents, think about context. Does the context that the agent needs exist already in an online system? Is that context correct? Are there secret rules that your business actually uses that aren’t written down? Start there. If I am a very junior employee and I know nothing, will I do the right thing if I just follow the documentation? If not, your agents don’t have the correct context and won’t reach the correct result.

Agents Don’t Think— That’s OK

Recently, we have seen quite a bit of discussion online about computers becoming intelligent. In fact, some people have claimed that the holy grail of AI—the ability of a computer to truly think and reason—is now possible. This is referred to as “Artificial General Intelligence” (AGI). Sam Altman (CEO of OpenAI) for example said:

We are now confident we know how to build AGI as we have traditionally understood it.

https://www.forbes.com/sites/johnkoetsier/2025/01/06/openai-ceo-sam-altman-we-know-how-to-build-agi/

However, others are quite doubtful. Some have said AGI is simply not possible. Apple recently released a paper pointing out that current LLM-based solutions don’t actually think at all:

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

For those of us working in the field, that’s not much of a surprise. An LLM is simply using a very large statistical model (hence the term Large Language Model) to predict what the next word of their response should be. Thus, we are seeing an amazing display of math, not thinking.

But, that’s OK. There is nothing wrong with really amazing math.

Let’s talk about why LLMs are so amazing, and more specifically about why LLM-based agents are so transformational.

They’re amazing because they deal well with ambiguity. They have enough context to largely figure out what we mean even if we ourselves are not being precise.

For example, let’s say I gave an instruction to a traditionally-coded microservice. I wanted to find all entries in a database that contained the word “California.” However, I don’t know if the database uses “California” or “CA” or perhaps even the older “Calif”. What to do? Well, what we normally do is list out all the possible values. So I would say something like:

SELECT * FROM your_table_name WHERE state = 'CA' OR state = 'California' OR state = 'Calif';

Notice that I need to be specific and I need to handle all cases. You could also write some regex or use some other techniques, but in the end, it’s up to the programmer to deal with the ambiguity.

But what if I forget one? What if other people refer to California as Cali? Then my query would be wrong.

However, for an LLM, this isn’t an issue. For example, you can use Perplexity to answer this question even if you use “Calif” instead of the more common CA or California in your query:

It’s a trivial example, but notice that even though nobody says “Calif” any more, it understood me and correctly answered. I could have said CA (although CA also means Canada) or any other variation. Of course, it only knows what’s in the database. Notice the error for the A’s who used to play in Oakland but currently play in Sacramento. LLMs are not perfect; they simply reflect the data they’ve been given.

You may not be impressed by this—after all Google has been using similar techniques for many years to make search work. However, doing this yourself for an internal application used to be really, really hard. Google had the size to do this, but your four-person feature team? Probably not.

With LLM-based agents, you get the capability basically for free. That’s amazing.

Well, it’s not free. Nothing in software is free.

While being able to handle ambiguity is amazing, it comes with a cost: uncertainty.

Because the model is guessing at the correct answer, the model can be wrong (as we saw with the previous example). It could also go down a completely incorrect path. This is so common, there’s a standard term for this: hallucination. Hallucinations are very common for LLM-based software, and those of us building agents spend a huge proportion of our time building gates and evaluators to try and avoid these tangents.

However, the gates we build to voide these things are not perfect.

My personal test for agent control is the “History of the Peloponnesian War” test. Whenever I build an agent that’s supposed to do something really specific, I always check by asking it to discuss the history of the war. Ideally, it should not do this. It’s been instructed not to, but sometimes LLMs (like puppies and junior PMs) just can’t help themselves. They get excited and go off track.

So, let’s use an example. We can create an agent that is specifically instructed to only discuss insurance:

The above example is a custom GPT I instructed to ignore non-insurance topics. Notice the correct “I can’t help you” response to the Peloponnesian War question. All good, so far. But will it remember those instructions all the time? Will it persist and ONLY answer insurance questions?

Remember that LLMs are smart, but inexperienced. They’re easy to fool as I demonstrate here:

Whoops. I did get it to tell me why the war happened by asking a hypothetical. A human agent would know that I’m just punking them and would probably refuse to answer or just laugh at the silly question. If you are an insurance company, you probably don’t want your agents giving theoretical advice to soldiers in the Peloponnesian War.

Once again, we see the issue around managing and coaching. Just like a junior employee, you need to watch them a bit, give them coaching and help them gain experience in a safe way.

Monday, June 9, 2025

Is Claude Smarter than your Intern?

As part of my investigation into how GenAI will change the way PMs work, I’ve done some experiments with LLMs and their ability to integrate with Jira. Because Jira has a query-based interface and a native MCP-based integration with Claude, it’s pretty simple to connect Claude to your Jira environment:

Jira is one of the pre-built integrations. You can see the full list here:

https://support.anthropic.com/en/articles/11176164-pre-built-integrations-using-remote-mcp

I was able to connect Claude to my Jira account pretty easily and after doing so was able to ask it questions about my Jira project. As an example, I dumped a list of possible future blog posts into the project and then asked Claude to stack rank them based on a business outcome:

Notice that Claude made up specific factors that would “drive readership”. (It also ignored my mis-spelling of Label and correctly figured out what I meant.) I could then re-prompt with different criteria, but I think that this exercise shows that Claude is capable of using pretty significantly detailed business criteria to make decisions. So, that part is good and this is a win for the “quality” check mark that we discussed in my previous blog about agents. Does Claude deliver answers at a high quality? Yes.

However, we are still missing context. The examples above were complete divorced from the real world. Let’s say I had an actual application and I wanted to compare my possible epics against the code base. Is that possible?

Well, in theory, yes. Claude also has a GitHub connector. I set that up and asked it to review a private repo. This was the result:

Foiled. It turns out that even though Claude can access my private GDrive, it cannot access private GiHub repos. That’s pretty disappointing. I hope the Anthropic team will remedy that.

So, let’s try a public repo. There are several issues already in the AGNTCY ACP repo—could Claude automatically write Jira stories to resolve them? Using the Jira integration for Claude, I was able to get Claude to talk to another repo that had issues in it already:

Aaand no. When it failed to read the actual issues page, it just made some stuff up. That’s actually worse than doing nothing. Back to zero on the correct scale. Anthropic folks, if you’re listening, don’t make stuff up please.

Changing my approach, I was able to manually import GitHub content. I went back to my original private repo. I was able to import the code into Claude using the built-in “Add from GitHub” function. Strangely, this worked fine even though Claude refused to read the repo via the chat interface. As the next step, I asked it to compare the current code to a PRD that I wrote and received this result:

I used this flow to analyze the sample PRD and compare it to the actual running code of a sample app that I built with Lovable. It did a decent job.

I then had it upload those items into Jira. It correctly created Jira epics and assigned them based on criteria that I gave it.

So, interesting. Overall, I would say that Claude was capable of acting as a very junior PM on my team. I was able to issue specific instructions around how to write Jira epics and how to prioritize, and it was able to investigate the current code base to compare requirements to the implementation. The latter is interesting because most junior PMs can’t just look at a repo and understand the code well enough to do that analysis. So, in some ways, this is superior to the work that a junior PM would do.

Wednesday, June 4, 2025

Avoid AI Regret

We are starting to see the results of early AI experiments in enterprise and not surprisingly, they’re not amazing. A recent study by Orgvue says that 55% of enterprises regret decisions to replace people with AI for example. Of course, by some estimates up to 65% of all IT projects fail, so AI is not actually worse than other technology areas. It’s not much better either. Disappointingly, there is no magic in the world. Every single tech decision—every piece of software, every hardware upgrade, every shiny new AI tool—needs to start with a specific problem, aim for a defined outcome, be measured, and then iterated upon.

This isn't just my soapbox; we've seen the same pattern before with things like cloud adoption. And we're seeing them again with artificial intelligence, including the emerging concept of agentic AI, where software acts on your behalf. The hype is incredible, but the rubber hits the road when you ask: "What is this actually doing for the business?" and “How do we measure success?”

Remember that report from Orgvue on AI and workforce transformation we talked about? It laid out some pretty stark realities that underscore exactly this point. The initial, perhaps overly enthusiastic, dive into AI by some organizations is leading to some hard lessons learned.

Here's what those findings tell us, screaming the need for a solid business plan:

The Regret of Rushing Redundancies: The report highlighted that a significant chunk of leaders decided AI made some employees redundant, only for more than half of those to regret the decision. This isn't just an HR issue, it's a business failure. It means they didn't fully understand how AI would integrate, what roles were actually impacted, or what the ripple effects on productivity, morale, and institutional knowledge would be. That's a direct consequence of not starting with a clear business benefit and a detailed plan for achieving it. Were those redundancies genuinely necessary to achieve a quantifiable business goal, or were they based on a premature assumption?
The Skills Gap Isn't Closing Itself: Despite pouring money into AI, organizations are realizing they don't have the internal skills to make it work effectively. Leaders are boosting training budgets and seeking external help. This proves that deploying the tech is only step one. The business benefit doesn't magically appear; it requires people who know how to leverage the AI to improve workflows, analyze data, or interact with customers. If your AI strategy doesn't include a workforce strategy focused on skill development, you won't capture the value.
Lack of Clarity on Impact: Many leaders simply don't have a good grasp of how AI will truly affect their business or specific roles. They can't identify which jobs will benefit most or which jobs are most susceptible to automation. This is particularly true for more complex applications like agentic AI, where many leaders admit they don't know how to implement it effectively. Without understanding the how, you can't define a meaningful business goal or measure success. It's like buying a complex piece of machinery without knowing what product it's supposed to help you make.

Think about an AI agent. At its core, it's a tool designed to perform specific tasks. Successfully integrating it is akin to hiring a very specialized employee. You wouldn't just hire someone and tell them to "go be productive." You'd give them clear objectives, define their responsibilities, provide training, and set up ways to measure their performance. Does the agent writing first drafts save your team time (a business benefit)? Does the agent managing customer inquiries actually improve satisfaction scores? Is the quality as good or better than if a human did the work? How do you know that? How often have you reviewed and iterated on the solution?

The trends show that AI is being adopted and delivering value in specific areas. We see examples like virtual assistants handling billions of customer interactions, content creation tools being used hundreds of millions or billions of times, and specialized AI software driving significant revenue growth in industries like financial services. These successes likely stem from identifying specific problems these tools can solve and measuring the results across multiple iterations.

The Orgvue findings are a cautionary tale against the alternative—deploying AI blindly, chasing the hype without a grounded business case. Businesses focused purely on simplistic cost-cutting through premature layoffs, without a deep understanding of AI's role and the necessary workforce adjustments, are encountering regret.

The real, sustainable value from AI, or any technology, comes from strategic integration tied directly to achieving specific business outcomes. Every single technology investment must pass the "So what?" test. So you have a new AI tool? So what does it do for the business? Does it increase revenue? Reduce costs? Improve efficiency? Enhance customer experience? Enable innovation?

Stop treading water with generic deployments that offer little competitive advantage. Instead, focus on identifiable business problems where AI offers a unique solution that traditional software can't provide. Work with small, focused teams, tackle manageable problems, define your desired outcome upfront, and measure everything.

The technology landscape is constantly evolving, with new models and techniques emerging rapidly. Relying solely on rigid, centralized evaluations might mean you miss opportunities. Flexibility, focused experimentation, and quick iteration based on measured business results are key.

And I'll say it one more time, because it's that critical: Start with the business goal and work backwards towards technology. If you don't, you're not investing in the future; you're just setting money on fire.

Lovable Micro Review

Following up from my prior blog post about AI toolchain, I’ve continued the experiment of giving AI a “one shot” prompt. As I discussed in my prior blog, this isn’t an attempt to actually make a production grade SaaS product, it’s a test of the platform’s ability to produce running code somewhat independently. This means that this is something of a “worst case” example but it shows the limits of what these platforms can do on their own.

For this attempt, I used Lovable. At this point, Lovable, Replit and V0 are in a pretty tight market battle as shown by Google Trends data:

Lovable has made big strides over the past six months or so, passing V0 and challenging Replit. This also matches my personal experience. I am hearing about Lovable more and more from developers and other non-technical founders as a tool that they have had success with. I gave it the same prompt as before, “Build an app that stack ranks issues in a GitHub repo based on user-defined criteria.”

This is what it came up with:

The first prompt was very promising. Notice that it assumes that I need to enter a GitHub API token and just puts it right there in the UI. No need to put an environment variable, no hard coding. Just copy/paste your token.

It also correctly masks the API token once entered which is great.

And then, SHAZAM! It just works.

First time, one shot, working app. Pretty impressive.

No, the UX isn’t the most amazing UX I ever saw and no, this is not a production ready application. However, it actually works and you can click on it. From a one shot prompt.

Like V0, it also allows me to sync with GitHub or run it on Lovable’s infra:

So, what does this all mean to PM’s?

It means that you need to stop writing complex requirements. Stop writing PRD’s.

Instead, write user stories and go directly to prototype. Do not mess around with lengthy planning/debating/arguing over theoretical features or products. Just go build a prototype and discuss that working prototype with the team.

I cannot tell you how much easier it is to debate a feature or a product when you can actually see the thing running. Why mess around writing a long complex document? Just build the thing and play with it. Your team will probably hate it. That’s fine. Take that input and change the prototype. Fix it. Iterate.

Yes, you still need a designer. There is no way the application above would pass muster with a software company like Cisco. You will have to design a real UX that looks correct, uses the design language correctly, etc. You will also want to have a deeper conversation about user journey which I haven’t talked about at all yet.

Yes, you still need engineering. The application will need to be run in production. It will need to be secure, it will still need all the things that a proper SaaS application needs. If you are just fooling around with a passion project, sure, go ahead. But if you are actually running a business or building a business you’ll need someone to ensure that the app is written correctly, can be supported in production, can scale, is secure, etc. etc. etc. All the important things that eng does still need to be done.

However.

You don’t need a massive team.

Is there such a thing as a single person Unicorn startup? No. Will there be single person startups in the future? Yes, I think so. Can you do this now? Doesn’t seem like it.

Going forward, I am going to continue in this theme and see how far I get. Where will I run into walls? What functions are easily automatable? What works today? What is just hype?

Let’s find out.

v0 Micro Review

Following up from my prior blog post about AI toolchain, I’ve continued the experiment of giving AI a “one-shot” prompt. As I discussed in my prior blog, this isn’t an attempt to actually make the application work, it’s a test of the platform’s ability to produce running code somewhat independently. This means that this test is something of a “worst case” example but it shows the limits of what these platforms can do on their own.

This time I used v0 from Vercel. I used the same one-shot prompt, “Build an app that stack ranks issues in a GitHub repo based on user-defined criteria.” Unlike Replit, v0 assumed I was building a SaaS application and gave me a login screen:

That’s pretty nice because you’re going to need this with any real app unless you’re running it on your laptop. So that was a nice surprise that I didn’t ask for.

The actual app is pretty similar to the one that Replit built:

A bit boring but functional. I would give the nod to Replit here; the Replit UX was more colorful and an overall better visual experience, but v0 is completely competent and rational.

Unlike Replit, v0 figured out on its own that I couldn’t actually use this thing until I connected it to GitHub:

Which is pretty impressive. Not only did it figure out that it couldn’t talk to GitHub, it suggested how to fix the problem. Also note that it plumbed in the GitHub call on its own and just said, “Do this to make it actually work.” Very nice.

However, then this happened:

Yeah, the solution that v0 suggested doesn’t actually work because v0 can’t set the environment variable for whatever reason (and no, this isn’t actually my token).

As I pointed out in my previous blog, this is not entirely unexpected. AI-based toolchain tools often struggle with security and identity issues. I assume this is because there is so much industry variability here that they can’t just hard code the solution.

Of course Vercel is primarily a hosting company so the next step of their flow is to publish to Vercel and not to create your own code repo to deploy on your own.

That’s what happens if you hit the “Deploy” button. However, they have a new beta feature to sync to GitHub which is also quite nice. My preferred option is always to push it to a GitHub repo because then I can fool around with it using whatever other tools I want. The reality is that you’re not going to get a simple prompt like this to create a real application, so you’ll need to have a code repo (GitHub or other) anyway.

To summarize, so far we have found that getting AI to build you a clickable prototype is pretty easy. Getting AI to build a working app takes a bit more work.

Perspectives from the Field