Exploring Generative AI

Generative AI and particularly LLMs (Large Language Models) have exploded
into the public consciousness. Like many software developers I am intrigued
by the possibilities, but unsure what exactly it will mean for our profession
in the long run. I have now taken on a role in Thoughtworks to coordinate our
work on how this technology will affect software delivery practices.
I’ll be posting various memos here to describe what my colleagues and I are
learning and thinking.

## Latest Memo: Median – A tale in three functions

27 July 2023

This is a little story of generating a median function, and how it illustrates the usefulness and limitations of LLM-assisted coding.

I needed to calculate the median of a list of numbers in a Typescript codebase. Instead of what I would usually do, which is start an internet search for “median function javascript”, I tried to get GitHub Copilot to assist me.

``````// calculate the median of a list of numbers
``````

Copilot first generated a proper method signature for me:

``````public static median(values:number[]): number {
``````

And then suggested three different implementations of that function.

### Suggestion 1

``````public static median(values:number[]): number
if(values.length === 0) return 0;

values.sort((a,b) => a-b);

var half = Math.floor(values.length / 2);

if (values.length % 2)
return values[half];

return (values[half - 1] + values[half]) / 2.0;

``````

This implementation works fine. However, the input parameter `values` is being changed with the `sort` function. When showing some colleagues this example, I got mixed responses – one said “that’s pretty bad, something we would usually forget to write a test about, and then it can become an error that’s really hard to debug”. Another said, “it’s not necessarily always a bad thing, it depends”. I was wondering if this would be something that a static code analysis tool in the IDE would mark as a problem, but neither ESLint nor Sonar flagged it.

### Suggestion 2

``````function median2(values: number[])
const sorted = values.slice().sort((a, b) => a - b);
const middle = Math.floor(sorted.length / 2);
if (sorted.length % 2 === 0)
return (sorted[middle - 1] + sorted[middle]) / 2;
else
return sorted[middle];

``````

This one works, and does not have the problem of changing the input parameter, as `slice()` returns a copy of the input array.

### Suggestion 3

``````function median3(values: number[])
const sorted = values.slice().sort((a, b) => a - b);
const middle = Math.floor(sorted.length / 2);
return sorted[middle];

``````

At first sight, this looks better than the others – it’s shorter, and has zero cyclomatic complexity, we tend to think of that as “more elegant”. The test I wrote for it also worked, great!

But I later realised that it is important to test it with both an even and an uneven number of values – the test I had written used an uneven length input, but this implementation fails for the case of an even length. This example is great to illustrate that it is still important to understand what we are doing, to be able to write reasonable tests for the code that got generated. We need to know how median works to write tests that cover the full functionality.

### Isn’t this the same though as copying from the internet?

A use case like this is a perfect example for one of the benefits people see in this “autocomplete on steroids”: You don’t have to leave your flow and tool chain to get answers to questions like this. And otherwise we’d copy & paste it from somewhere anyway, and would then have to review that code thoroughly as well, and write the tests. So it’s the same risk – right?

The only difference is that with Copilot, we don’t know the source of the code. In the case of StackOverflow e.g., we have an additional data point about the quality of a snippet: The number of upvotes.

Incidentally, “Suggestion 1” is almost exactly the code suggested by the most highest voted response to a StackOverflow question on the topic, in spite of the little flaw. The mutation of the input parameter is called out by a user in the comments though.

### Generate the tests, or the code? Or both?

What about the other way around then, what if I had asked Copilot to generate the tests for me first? I tried that with Copilot Chat, and it gave me a very nice set of tests, including one that fails for “Suggestion 3” with an even length.

``````it("should return the median of an array of odd length", () =>  ...

it("should return the median of an array of even length", () =>  ...

it("should return the median of an array with negative numbers", () =>  ...

it("should return the median of an array with duplicate values", () =>  ...
``````

In this particular case of a very common and small function like median, I would even consider using generated code for both the tests and the function. The tests were quite readable and it was easy for me to reason about their coverage, plus they would have helped me remember that I need to look at both even and uneven lengths of input. However, for other more complex functions with more custom code I would consider writing the tests myself, as a means of quality control. Especially with larger functions, I would want to think through my test cases in a structured way from scratch, instead of getting partial scenarios from a tool, and then having to fill in the missing ones.

### Could the tool itself help me fix the flaws with the generated code?

I asked Copilot Chat to refactor “Suggestion 1” in a way that it does not change the input parameter, and it gave me a reasonable fix. The question implies though that I already know what I want to improve in the code.

I also asked ChatGPT what is wrong or could be improved with “Suggestion 3”, more broadly. It did tell me that it does not work for an even length of input.

### Conclusions

• You have to know what you’re doing, to judge the generated suggestions. In this case, I needed an understanding of how median calculation works, to be able to write reasonable tests for the generated code.
• The tool itself might have the answer to what’s wrong or could be improved in the generated code – is that a path to make it better in the future, or are we doomed to have circular conversation with our AI tools?
• I’ve been skeptical about generating tests as well as implementations, for quality control reasons. But, generating tests could give me ideas for test scenarios I missed, even if I discard the code afterwards. And depending on the complexity of the function, I might consider using generated tests as well, if it’s easy to reason about the scenarios.

Thanks to Aleksei Bekh-Ivanov and Erik Doernenburg for their insights