In my previous blog post about Jira automation and Claude, I created a sample application and was able to investigate the code on GitHub to compare it to the PRD that I wrote. Based on that analysis, I had Claude write several Jira epics. One of them was pretty generic about implementing a user authentication and profile system. Because this functionality is required for almost any application you might write, I decided to start there.
Here is a screenshot of the epic that Claude wrote:
Note that this epic is HUGE. If a junior PM had written this epic, I would advise them to trim it down a bit. Pull out things like Google OAuth from things like RBAC, for example. This could be done by making smaller child stories against this epic or by breaking it into multiple epics.
However, for the purpose of this test, I just passed the output of Claude into Codex. Codex is the relatively new GenAI-based code tool from OpenAI. Here is the summary Codex produced after attempting to implement the epic:
Codex purports to be really good at building code directly from requirements. In the demos they give, you can just toss complex requirements at Codex and it will do the heavy lifting. Notice that tossing Codex a random GitHub repo may also mean that it cannot test because it doesn’t understand the code base. In this example, lint failed because of a missing dependency. Codex didn’t detect and fix that automatically nor did it suggest a way to fix it.
In addition, as you can see from the output in the example, Codex only focused on part of the epic. It’s implementing just the unsafe handling of API keys and such. which Lovable hard coded into the source code. To be fair, this is a pretty important issue and definitely should be addressed in the code as soon as possible, but it’s a very small subset of the epic. Codex didn’t come back to me and say, “Hey, this epic is way too big, please make it smaller.”
Again and again we see this failure mode in GenAI systems. They are enthusiastic but not experienced. If you compare this to people, a very junior dev might just follow instructions, not knowing how bad things are or how epics should be written. A more senior dev would go back to PM and tell us that we need to break this work down into smaller chunks. A principal-level dev would just fix the epic themselves and tell us that they fixed it.
Please note that I’m looking at this from the product management perspective. I won’t evaluate the quality of the code coming out of these systems. I’m just investigating how functional they are just like I would evaluate any eng partner I am working with.
In the end, a feature team only needs two things to be successful: quality and velocity.
If you are delivering on epics at a very high quality and doing so very quickly, almost any other problem can be addressed by PM. Assuming PM is doing their job, this means that we are building the correct features and that the product is solving problems for the target persona. The same thing goes for AI. We know that GenAI-based systems like Codex are much faster than traditional coding methods, but are they executing at high quality?
So far, the answer is no. They require close human supervision to make them work correctly.
Going back to our junior employee example, this shouldn’t be surprising. If you hired a dozen new college grads and let them loose on your code base, what do you think would happen? Yes, chaos would ensue. At the moment, the same is true for AI-based toolchain. You can get them to do an amazing amount of work for you but you do need to supervise them and monitor their progress to ensure quality work.
GenAI: Eager, fast, well educated. Not experienced or self critical.