Investigating the relationship between estimation accuracy and task size
Yesterday on StackOverflow Johannes Hansen asked
What is the acceptable upper limit of time allocated to a single development task?
I answered with
If you track your estimate/actual history, you can probably plot hours by accuracy and figure out exactly what number is appropriate for your team.
My advice sounded so good I thought I’d try it myself. So I opened bug tracker where I keep track of my probable and actual times and exported my closed bugs to Excel. I cleaned up a bit, by removing any rows with either a 0 probable or actual time, then created a chart.
Now when I conceived of this idea, I was expecting something like
Well I wasn’t expecting the plots to be that dense, or to accelerate above 200% so fast, but let’s just say, that general look would have been pleasing to my eye.
Now, I’ve got to say, is NOT what I was expecting at all. You can kind of see a very dense block under 4 hours and 100%, but doesn’t tell us very much with regards to the relationship between estimation accuracy and size of the tasks. So, I then threw a Linear Regression Trendline on the chart hoping it would illuminate an ascending trend. Instead it contradicted my assumptions by declining, suggesting the larger the task, the more accurate I am … which isn’t true at all.
Maybe it’s the outliers. Maybe it’s the weird changes outside of normality causing it to look so horrible. So I sorted the data by the accuracy percentage, dropped the top and bottom 5 percent, redrew the chart and got this.
Still obvious relationship between the estimated task size and estimation accuracy. But at least my trendline is no longer declining. By flat lining, it’s now suggesting there is no relationship between estimation accuracy and task size.
… hmmm … bugs are included in my data. I wonder if that could be having an effect? I’ve been estimating approximate times bugs will take to resolve for my manager. Most of these bugs have been estimated before even investigating the cause, so that’s not really the same as estimating a defined task. What if I remove them?
I went back to my original data dump, removed all bugs, tickets, and questions so I was left with only new tasks and changes. I again removed the bottom & top 5% and recharted.
Well, I’ve finally got an ascending trendline suggesting my estimates are weaker as tasks get bigger, which is what we expected to happen.
Conclusion: I’m still not very happy with the scatter chart. I still believe it should look closer to my initial assumptions of what this chart should have looked like. This suggests to me that I need to take another look at my data collection if it’s going to be useful to me at all.
Feed back and constructive criticism welcome.