It doesn't know what it doesn't know
In my continuing exploration of the general capabilities of LLM AI, I keep finding a particularly vexing failure mode: confident bullshit.
I recently put a smart clock in my kid's room, so that he would know when it was an appropriate time to wake me up on the weekends. Since it was a smart clock, I decided I'd try to make a Google Home routine that would use Gemini to find the day's school lunch menu offering and announce it, so that he could decide whether or not to have me make his lunch.
First failure mode: silence. The routine threw an error, and absolutely nothing was announced by the smart device. So I started playing with Gemini directly.
Since I wanted this to be generalizable, I pointed Gemini to the school district page which links to the current month's menu. I asked Gemini to find the relevant menu and tell me what today's choice was. It was able to navigate the page, find the right menu, and apparently access the menu PDF. But it answered a different random meal choice every time. I tried many ways of prompting to help it ("find the white square marked with today's day-of-the-month, report the text in that square", etc). None of it really helped. It chooses a random entree somewhere in the document.
So I asked Gemini how I can prompt it to answer more correctly. At this time, for the first time, it says that PDFs are difficult to parse and that the text in the document near the right day-of-month, might not be nearby in the rendered document. Okay. So I ask it to render the PDF, then use visual reasoning to answer the questions. It still fails, still giving random day's menu choices.
I download the PDF myself. Take a screenshot of the rendered page, and save it as PNG. Upload the PNG and ask Gemini what is for lunch today? Boom. No problems. Ask it for the rest of the week, no problems. Ask it which days a picky child that doesn't like things to have sauces might want school lunch. Boom. No problems. So it's clearly able to answer these questions when using visual reasoning.
I upload the PDF itself to Gemini, ask it to render then use visual reasoning to answer the questions. Perfect, no notes.
But, if I ask it to find the PDF from the menus page, render it, then use visual reasoning, it won't. It claims to, but answers randomly again.
When it's wrong, it's not just wrong; it's confidently wrong. It's wrong structured the same way as when it's right. It was even able to identify parsing the PDF directly as a problem (when asked, woulda been nice to volunteer that)! But it won't follow the instructions to answer visually based on a document it has downloaded. Perhaps that's one level of indirection too far.
It has all the capabilities necessary to accomplish the task I set it. But it won't put them together on its own. I tried the thinking model too, it failed as well.
How could I trust this to develop good code? Not just code that compiles. Not just code that passes tests (that in all honesty, it probably wrote itself). How can I have any confidence that it will solve the right problems with the right tools, instead of just claiming to have?
When I had Gemini help me write my GADDAG implementation, it did something similar. When trying to USE the GADDAG to find words that fit a board constraint, it completely missed the point of the GADDAG and used it as a plain ol' trie. I rewrote the crucial code to take advantage of the reversed prefix data. Had I trusted AI, my game's computer opponent would be needlessly inefficient, AND take longer to initialize.