Thursday, June 26, 2025

Sink or swim: Making the call



I’ve been getting requests lately for guidance on how to become a product management leader.  While there are tons of folks out there who talk about how to be a product manager (PM), there are relatively few people talking about PM leadership and the difference between good PM organizations and great PM organizations.


I’ve been in the software business since 1996 and I’ve been a PM for over 12 years including my current job as Senior Director of PM.  When joining a new PM organization, I focus on a couple of things:


  1. Does the organization ship the right thing?  The PM team needs to be focused on the what and the why.  What are we going to build and why?  When I look at the current roadmap, I want to know why they have that roadmap.  Asking the why question will tell you if they have a good decision making process.

  2. Is the organization growing?  I don’t mean adding headcount.  I mean are they learning from their experience?  If they make a mistake, is that mistake recognized, addressed and corrected?  In many floundering organizations you will see failures covered up or blame games being played.  Neither of these things is healthy.

  3. Are they focused on the customer?  When decisions get made, do they focus on internal issues or are they focused on the customer?  In my experience, an amazingly high number of product decisions get made inside the company with little reference to the actual customer.

  4. Are they data driven?  This is related to #3, but when they make decisions, are these decisions backed up by good data?  If you think you know what’s going on but don’t bother to measure it, you’re probably wrong.


So, you’ve joined a new team and you’re not happy with where they are.  Where do you start?


At the top of the list, of course.


In the end, a PM organization is a decision making organization.  Unlike engineering, PMs don’t need to focus on delivery.  Yes, we are involved in delivery, but no, we are not the ones writing the code, creating the marketing copy, etc.  Our job is to make the call.  We sink or swim based on our ability to make good decisions as a team.  We do this in big ways by deciding to take on new products, but we also do this in small ways by tweaking user stories to help get a feature out the door.  The daily decisions involving small trade-offs are at least as important as the big “let’s build this new thing” decisions.


An unhealthy organization makes poor decisions.  Those poor decisions are violently defended because poor organizations punish people for being wrong.  Thus, even if they are wrong, they’ll claim they’re right.  


When my daughter was small, we knew that she was REALLY tired and ready for bed when she loudly proclaimed she was not tired.  The more strenuous her denial, the more likely that she was overtired and should have been put to bed already.  


Same thing with product teams.  Ask any product team, “Why did you make that decision?” and you can tell an amazing amount about them just by the tone of the answer.  Are they defensive?  Bad sign.  Do they have specific evidence?  That’s a great sign.  Are they highly introspective and self-critical?  Even better.


In the vast majority of organizations I have worked for, specific decision criteria were rare.  What I mean by this is you should always know the basis for a decision.  That decision criteria should be openly discussed in advance.  I am amazed at how often people don’t actually know how a decision will be made.  I have sat in countless meetings debating an idea only to find out that nobody agrees how we will decide.  What happens is that everyone in the room states their opinion.  However, since we don’t know the basis for the decision, all those opinions are meaningless.  


Try asking things like, “If we build this feature, what is different six months from now compared to today?”  Or “How will we know next quarter that this was the correct decision today?”


Usually, the team can’t answer questions like this because they haven’t really thought things through, and they don’t really know why they are building what they are building.  It’s your job as a PM leader to focus on these “why” questions.  Why are we doing this?  How do we measure success?  


Here are some quick techniques you can use to help get the team focused on the correct decision criteria:


  1. Focus on measures.  Any claimed benefit must have a measurable result.  If a team member says, “Customers want this feature,” always ask, “How do we measure that desire?”

  2. Focus on outcomes.  Teams can get caught up in things like stack ranks, feature lists, and bugs but it’s the outcome that matters.  Questions like, “What will be different next quarter if we do this?” help drive the team to focus on positive outcomes.  For example, I would expect an answer like “we expect workflow abandonment to drop from 30% to 10% as a result of this change” or something similar. Specific, customer focused and measurable is the goal here.

  3. Encourage bad news.  Bad news travels fast in healthy organizations.  If someone brings you bad news, don’t shoot the messenger.  You want to encourage them to come to you.  “Thanks for bringing that up.  It sounds important; let’s get into that detail in our one-on-one” is a great response.  Don’t let the meeting rat-hole but recognize that the bearer of bad news is trying to help.  For example, when customer usage of a feature is super low, that’s bad news and you have to talk about it.  But if your PM messed up and wrote the requirements wrong, you don’t want to discuss their failure in a team meeting.  Corrective feedback to your PM is best in a 1:1 setting. 

  4. Hold them accountable.  I tend to be a “praise publicly, critique privately” kind of manager.  This means that during one-on-ones, you should be discussing outcomes you expect and giving direct feedback when you’re not getting that outcome.  I don’t advise dressing down staff in a large meeting—it tends to make people defensive.  The tone you should set in a group meeting is “how are we going to achieve this goal?”  Focus on the team.

  5. Don’t manage the product.  You are no longer a product manager.  The product managers who work for you need to manage the product.  It’s tempting to just change the stack rank or talk to engineering yourself, but that’s almost always the wrong answer.  Work with the team, give them coaching, but don’t do the work yourself.


In the end, it’s your job as a PM leader to build a good team and let them do their job.  If the team is making good decisions, you have succeeded in your most important task as a PM leader.


Saturday, June 14, 2025

Flying Without a Net: Requirements to Code (via Codex)

 


In my previous blog post about Jira automation and Claude, I created a sample application and was able to investigate the code on GitHub to compare it to the PRD that I wrote.  Based on that analysis, I had Claude write several Jira epics.  One of them was pretty generic about implementing a user authentication and profile system.  Because this functionality is required for almost any application you might write, I decided to start there.


Here is a screenshot of the epic that Claude wrote:



Note that this epic is HUGE.  If a junior PM had written this epic, I would advise them to trim it down a bit.  Pull out things like Google OAuth from things like RBAC, for example.  This could be done by making smaller child stories against this epic or by breaking it into multiple epics.


However, for the purpose of this test, I just passed the output of Claude into Codex. Codex is the relatively new GenAI-based code tool from OpenAI. Here is the summary Codex produced after attempting to implement the epic:

Codex purports to be really good at building code directly from requirements.  In the demos they give, you can just toss complex requirements at Codex and it will do the heavy lifting.  Notice that tossing Codex a random GitHub repo may also mean that it cannot test because it doesn’t understand the code base.  In this example, lint failed because of a missing dependency.  Codex didn’t detect and fix that automatically nor did it suggest a way to fix it.


In addition, as you can see from the output in the example, Codex only focused on part of the epic.  It’s implementing just the unsafe handling of API keys and such. which Lovable hard coded into the source code.  To be fair, this is a pretty important issue and definitely should be addressed in the code as soon as possible, but it’s a very small subset of the epic.  Codex didn’t come back to me and say, “Hey, this epic is way too big, please make it smaller.”


Again and again we see this failure mode in GenAI systems.  They are enthusiastic but not experienced.  If you compare this to people, a very junior dev might just follow instructions, not knowing how bad things are or how epics should be written.  A more senior dev would go back to PM and tell us that we need to break this work down into smaller chunks.  A principal-level dev would just fix the epic themselves and tell us that they fixed it.


Please note that I’m looking at this from the product management perspective.  I won’t evaluate the quality of the code coming out of these systems. I’m just investigating how functional they are just like I would evaluate any eng partner I am working with.


In the end, a feature team only needs two things to be successful: quality and velocity.


If you are delivering on epics at a very high quality and doing so very quickly, almost any other problem can be addressed by PM.  Assuming PM is doing their job, this means that we are building the correct features and that the product is solving problems for the target persona.  The same thing goes for AI.  We know that GenAI-based systems like Codex are much faster than traditional coding methods, but are they executing at high quality?


So far, the answer is no.  They require close human supervision to make them work correctly.  


Going back to our junior employee example, this shouldn’t be surprising.  If you hired a dozen new college grads and let them loose on your code base, what do you think would happen?  Yes, chaos would ensue.  At the moment, the same is true for AI-based toolchain.  You can get them to do an amazing amount of work for you but you do need to supervise them and monitor their progress to ensure quality work.


GenAI: Eager, fast, well educated.  Not experienced or self critical. 


Why Autonomy Doesn’t Matter (Yet)



As I’ve discussed before, I am not terribly concerned about how autonomous my AI agents are.  Most of what you read online focuses on the autonomy aspect of Agentic AI and I really think that’s the wrong approach.


My background is in enterprise-class software. Specifically, enterprise infrastructure.  I’ve been working on enterprise-grade automation since 1996.  In the end, an agent is simply an automation platform.  You are asking the agent to do work for you.  The advent of LLMs means that there are entire classes of work that computers can do now that we couldn’t dream of in 1996, but the core business problem of ensuring that the computer does the work for you remains.


If you think about any automation project, the first question is always the same.  Will the system be accurate?  That is to say, will it achieve the business result?  


The very first production system I developed and deployed was a system that automated email accounts.  The business result was that everyone who worked for the company had to have a working email address and that email address had to be mapped to the correct server where their mail was provisioned.  Simple to say, but difficult to do for 100,000 people.  Later, I built a system that provisioned Windows Servers at scale.  Automated provisioning wasn’t really a thing back then and we had to build a complete running Windows Server host from bare metal in just an hour.  This used to be manual work.


As a PM, I worked on systems like DRS, which automatically places VMs inside an ESXi cluster, and HashiCorp cloud, which automatically deploys customer environments.


Etc. Etc. Etc.


Over time, technologies change.  The techniques we use change.  But the business goals, the process and the underlying issues remain evergreen.  The system must solve the problem, and it must solve the correct problem at the correct time.  An agent, by implementing a business process, is simply another, more modern, automation platform.  It’s no different in concept than software that deploys servers or places VMs correctly.  Thus, the underlying problems are the same even though the implementation is completely different. 


For a modern LLM-based agent, there are two primary concerns:


  • Context.  The agent must have the correct context.  When solving a business problem for the user, the context of that problem is critical.

  • Accuracy.  If the agent claims to have solved the problem, that problem must be solved for the user a significant percentage of the time (probably 95% or better).


Yes, but what about autonomy?  Does the agent solve problems on its own?


It turns out that autonomy is a byproduct of context and accuracy.  If the agent is very accurate and has the proper context, then you will allow the agent to solve the problem.  However, this only occurs AFTER you have confidence in the accuracy and context of the solution.


Let’s take a hypothetical example.  Let’s say you are running a business and you decide to buy an agent that approves home loans.  The purpose of this software is to evaluate each loan, apply the company’s loan standards and either approve or reject this loan.  There are two vendors who have loan approval agents; you have to decide which one to buy.


  • Company A has a “master agent” loan system that takes each loan and automatically approves or rejects the loan.  You give it a document describing your policies and it takes all further action.

  • Company B has a “loan automation” system that investigates your current process, documents it where necessary and then makes loan recommendations.  Those recommendations can either be manually approved by a loan officer or automatically approved.  The default is manual approval.


Which company do you hire?


Of course, you hire company B.  Company A has too much risk and there is no way to manually intervene.  Company A may have an amazing system, but you don’t know for sure how well it will work in your environment.  On the other hand, Company B allows you to start out manual and then automate later.  Company B also has a way to discover your process which may be different than what’s actually documented.


And here’s the thing.  When I was a vSphere PM working on the DRS feature, we had the EXACT SAME PROBLEM.  When DRS was initially released, we were very confident that the VM placement decisions that the system made were correct.  We had done YEARS of testing and we knew that we were better at placing VMs using this system than when humans placed VMs.  We had papers about this, we had patents—all kinds of stuff.


And what happened?  Customers balked.  They didn’t know what was happening so they didn’t turn the system on.  So, we always lead with “Manual” mode where the system would make recommendations but not actually make changes.  Today, there are actually three modes: “Partial” for initial placement only, “Full” for complete automated placement and “Manual” for recommendations only.  The vast majority of customers start with Manual and most of them eventually move to Full (automated).  DRS today is one of the most widely adopted vSphere features.  vSphere also introduced the idea of VM overrides and host affinity.  This is context that allows the system to make better decisions by letting it know that VM1 and VM2 need to be on the same physical machine or that V3 cannot be vMotion’d.  


The details of how vSphere works aren’t terribly important here.  The point is that these types of accuracy and context issues have been around for a very long time.  We can look back at these systems and understand how they used context to improve accuracy and how those two factors led to customer adoption.  It’s easy to think “GenAI changes everything” and just ignore the last thirty years of enterprise automation, but that would probably be a mistake.  We know how to solve these problems, we just need to look at them in the abstract and pay less attention to the implementation details which change over time. 


This takes us to context.


The lesson of the last 30 years of automation is that context is king.  If the system knows what is happening and it knows what’s supposed to happen, the odds are higher that the system will take the correct action.  Yes, context leads to accuracy which leads to autonomy.  This is yet another software virtuous circle.


As you plan your AI agents, think about context.  Does the context that the agent needs exist already in an online system?  Is that context correct?  Are there secret rules that your business actually uses that aren’t written down?  Start there.  If I am a very junior employee and I know nothing, will I do the right thing if I just follow the documentation?  If not, your agents don’t have the correct context and won’t reach the correct result.


Agents Don’t Think— That’s OK


Recently, we have seen quite a bit of discussion online about computers becoming intelligent.  In fact, some people have claimed that the holy grail of AI—the ability of a computer to truly think and reason—is now possible.  This is referred to as “Artificial General Intelligence” (AGI).  Sam Altman (CEO of OpenAI) for example said:


We are now confident we know how to build AGI as we have traditionally understood it.


https://www.forbes.com/sites/johnkoetsier/2025/01/06/openai-ceo-sam-altman-we-know-how-to-build-agi/


However, others are quite doubtful.  Some have said AGI is simply not possible.  Apple recently released a paper pointing out that current LLM-based solutions don’t actually think at all:


https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf


For those of us working in the field, that’s not much of a surprise.  An LLM is simply using a very large statistical model (hence the term Large Language Model) to predict what the next word of their response should be.  Thus, we are seeing an amazing display of math, not thinking.


But, that’s OK.  There is nothing wrong with really amazing math.


Let’s talk about why LLMs are so amazing, and more specifically about why LLM-based agents are so transformational.


They’re amazing because they deal well with ambiguity.  They have enough context to largely figure out what we mean even if we ourselves are not being precise.


For example, let’s say I gave an instruction to a traditionally-coded microservice.  I wanted to find all entries in a database that contained the word “California.”  However, I don’t know if the database uses “California” or “CA” or perhaps even the older “Calif”.  What to do?  Well, what we normally do is list out all the possible values.  So I would say something like:


SELECT * FROM your_table_name WHERE state = 'CA' OR state = 'California' OR state = 'Calif';


Notice that I need to be specific and I need to handle all cases.  You could also write some regex or use some other techniques, but in the end, it’s up to the programmer to deal with the ambiguity. 


But what if I forget one?  What if other people refer to California as Cali?  Then my query would be wrong.


However, for an LLM, this isn’t an issue.  For example, you can use Perplexity to answer this question even if you use “Calif” instead of the more common CA or California in your query:


It’s a trivial example, but notice that even though nobody says “Calif” any more, it understood me and correctly answered.  I could have said CA (although CA also means Canada) or any other variation.  Of course, it only knows what’s in the database.  Notice the error for the A’s who used to play in Oakland but currently play in Sacramento.  LLMs are not perfect; they simply reflect the data they’ve been given.


You may not be impressed by this—after all Google has been using similar techniques for many years to make search work.  However, doing this yourself for an internal application used to be really, really hard.  Google had the size to do this, but your four-person feature team?  Probably not.


With LLM-based agents, you get the capability basically for free.  That’s amazing. 


Well, it’s not free.  Nothing in software is free.


While being able to handle ambiguity is amazing, it comes with a cost: uncertainty.


Because the model is guessing at the correct answer, the model can be wrong (as we saw with the previous example).  It could also go down a completely incorrect path.  This is so common, there’s a standard term for this: hallucination.  Hallucinations are very common for LLM-based software, and those of us building agents spend a huge proportion of our time building gates and evaluators to try and avoid these tangents.


However, the gates we build to avoid these things are not perfect.


My personal test for agent control is the “History of the Peloponnesian War” test.  Whenever I build an agent that’s supposed to do something really specific, I always check by asking it to discuss the history of the war.  Ideally, it should not do this.  It’s been instructed not to, but sometimes LLMs (like puppies and junior PMs) just can’t help themselves.  They get excited and go off track.


So, let’s use an example.  We can create an agent that is specifically instructed to only discuss insurance:


The above example is a custom GPT I instructed to ignore non-insurance topics.  Notice the correct “I can’t help you” response to the Peloponnesian War question.  All good, so far.  But will it remember those instructions all the time?  Will it persist and ONLY answer insurance questions?


Remember that LLMs are smart, but inexperienced.  They’re easy to fool as I demonstrate here:



Whoops.  I did get it to tell me why the war happened by asking a hypothetical.  A human agent would know that I’m just punking them and would probably refuse to answer or just laugh at the silly question.  If you are an insurance company, you probably don’t want your agents giving theoretical advice to soldiers in the Peloponnesian War.  


Once again, we see the issue around managing and coaching.  Just like a junior employee, you need to watch them a bit, give them coaching and help them gain experience in a safe way.


Monday, June 9, 2025

Is Claude Smarter than your Intern?


As part of my investigation into how GenAI will change the way PMs work, I’ve done some experiments with LLMs and their ability to integrate with Jira.  Because Jira has a query-based interface and a native MCP-based integration with Claude, it’s pretty simple to connect Claude to your Jira environment:



Jira is one of the pre-built integrations.  You can see the full list here:


https://support.anthropic.com/en/articles/11176164-pre-built-integrations-using-remote-mcp


I was able to connect Claude to my Jira account pretty easily and after doing so was able to ask it questions about my Jira project.  As an example, I dumped a list of possible future blog posts into the project and then asked Claude to stack rank them based on a business outcome:



Notice that Claude made up specific factors that would “drive readership”. (It also ignored my mis-spelling of Label and correctly figured out what I meant.)  I could then re-prompt with different criteria, but I think that this exercise shows that Claude is capable of using pretty significantly detailed business criteria to make decisions.  So, that part is good and this is a win for the “quality” check mark that we discussed in my previous blog about agents.  Does Claude deliver answers at a high quality?  Yes.


However, we are still missing context.  The examples above were complete divorced from the real world.  Let’s say I had an actual application and I wanted to compare my possible epics against the code base.  Is that possible?


Well, in theory, yes.  Claude also has a GitHub connector.  I set that up and asked it to review a private repo. This was the result:

Foiled.  It turns out that even though Claude can access my private GDrive, it cannot access private GiHub repos.  That’s pretty disappointing.  I hope the Anthropic team will remedy that.


So, let’s try a public repo.  There are several issues already in the AGNTCY ACP repo—could Claude automatically write Jira stories to resolve them? Using the Jira integration for Claude, I was able to get Claude to talk to another repo that had issues in it already:


Aaand no.  When it failed to read the actual issues page, it just made some stuff up.  That’s actually worse than doing nothing.  Back to zero on the correct scale.  Anthropic folks, if you’re listening, don’t make stuff up please. 


Changing my approach, I was able to manually import GitHub content.  I went back to my original private repo.  I was able to import the code into Claude using the built-in “Add from GitHub” function.  Strangely, this worked fine even though Claude refused to read the repo via the chat interface.  As the next step, I asked it to compare the current code to a PRD that I wrote and received this result:



I used this flow to analyze the sample PRD and compare it to the actual running code of a sample app that I built with Lovable.  It did a decent job.


I then had it upload those items into Jira.  It correctly created Jira epics and assigned them based on criteria that I gave it.  


So, interesting.  Overall, I would say that Claude was capable of acting as a very junior PM on my team.  I was able to issue specific instructions around how to write Jira epics and how to prioritize, and it was able to investigate the current code base to compare requirements to the implementation.  The latter is interesting because most junior PMs can’t just look at a repo and understand the code well enough to do that analysis.  So, in some ways, this is superior to the work that a junior PM would do.


Wednesday, June 4, 2025

Avoid AI Regret



We are starting to see the results of early AI experiments in enterprise and not surprisingly, they’re not amazing.  A recent study by Orgvue says that 55% of enterprises regret decisions to replace people with AI for example.  Of course, by some estimates up to 65% of all IT projects fail, so AI is not actually worse than other technology areas.  It’s not much better either.  Disappointingly, there is no magic in the world.  Every single tech decision—every piece of software, every hardware upgrade, every shiny new AI tool—needs to start with a specific problem, aim for a defined outcome, be measured, and then iterated upon.


This isn't just my soapbox; we've seen the same pattern before with things like cloud adoption. And we're seeing them again with artificial intelligence, including the emerging concept of agentic AI, where software acts on your behalf. The hype is incredible, but the rubber hits the road when you ask: "What is this actually doing for the business?" and “How do we measure success?”


Remember that report from Orgvue on AI and workforce transformation we talked about? It laid out some pretty stark realities that underscore exactly this point. The initial, perhaps overly enthusiastic, dive into AI by some organizations is leading to some hard lessons learned.


Here's what those findings tell us, screaming the need for a solid business plan:


  • The Regret of Rushing Redundancies: The report highlighted that a significant chunk of leaders decided AI made some employees redundant, only for more than half of those to regret the decision. This isn't just an HR issue, it's a business failure. It means they didn't fully understand how AI would integrate, what roles were actually impacted, or what the ripple effects on productivity, morale, and institutional knowledge would be. That's a direct consequence of not starting with a clear business benefit and a detailed plan for achieving it. Were those redundancies genuinely necessary to achieve a quantifiable business goal, or were they based on a premature assumption?

  • The Skills Gap Isn't Closing Itself: Despite pouring money into AI, organizations are realizing they don't have the internal skills to make it work effectively. Leaders are boosting training budgets and seeking external help. This proves that deploying the tech is only step one. The business benefit doesn't magically appear; it requires people who know how to leverage the AI to improve workflows, analyze data, or interact with customers. If your AI strategy doesn't include a workforce strategy focused on skill development, you won't capture the value.

  • Lack of Clarity on Impact: Many leaders simply don't have a good grasp of how AI will truly affect their business or specific roles. They can't identify which jobs will benefit most or which jobs are most susceptible to automation. This is particularly true for more complex applications like agentic AI, where many leaders admit they don't know how to implement it effectively. Without understanding the how, you can't define a meaningful business goal or measure success. It's like buying a complex piece of machinery without knowing what product it's supposed to help you make. 


Think about an AI agent. At its core, it's a tool designed to perform specific tasks. Successfully integrating it is akin to hiring a very specialized employee. You wouldn't just hire someone and tell them to "go be productive." You'd give them clear objectives, define their responsibilities, provide training, and set up ways to measure their performance. Does the agent writing first drafts save your team time (a business benefit)? Does the agent managing customer inquiries actually improve satisfaction scores?  Is the quality as good or better than if a human did the work?  How do you know that?  How often have you reviewed and iterated on the solution?


The trends show that AI is being adopted and delivering value in specific areas. We see examples like virtual assistants handling billions of customer interactions, content creation tools being used hundreds of millions or billions of times, and specialized AI software driving significant revenue growth in industries like financial services. These successes likely stem from identifying specific problems these tools can solve and measuring the results across multiple iterations. 


The Orgvue findings are a cautionary tale against the alternative—deploying AI blindly, chasing the hype without a grounded business case. Businesses focused purely on simplistic cost-cutting through premature layoffs, without a deep understanding of AI's role and the necessary workforce adjustments, are encountering regret.


The real, sustainable value from AI, or any technology, comes from strategic integration tied directly to achieving specific business outcomes. Every single technology investment must pass the "So what?" test. So you have a new AI tool? So what does it do for the business? Does it increase revenue? Reduce costs? Improve efficiency? Enhance customer experience? Enable innovation?


Stop treading water with generic deployments that offer little competitive advantage. Instead, focus on identifiable business problems where AI offers a unique solution that traditional software can't provide. Work with small, focused teams, tackle manageable problems, define your desired outcome upfront, and measure everything.


The technology landscape is constantly evolving, with new models and techniques emerging rapidly. Relying solely on rigid, centralized evaluations might mean you miss opportunities. Flexibility, focused experimentation, and quick iteration based on measured business results are key.

And I'll say it one more time, because it's that critical: Start with the business goal and work backwards towards technology. If you don't, you're not investing in the future; you're just setting money on fire.