=================
== Snappyl.com ==
=================
Welcome to my corner of the internet!

AI Review

AI

Aritificial Intelligence Review

Recently I decided to dive into the literature on the current-ish state of Aritificial Intelligence (AI), and more specifically large language models (LLM), to see what the current potential pitfalls may be. I was interested in the accuracy of outputs from LLMs and what effects that accuracy, or lack thereof, may result in. In my review I’ve seen a few interesting things that I’m going to just summarize here – I’m not a writer or journalist. I’m just a software developer, so bear with me.

Interesting Subject Patterns

In my brief review, it seems that AI is more accurate in topics related to health and biology and lower accuracy in other topic areas. News and especially business questions seem to produce much more inaccurate answers than anything else. In software devlopment in particular, accuracy varies with language. Verilog, for example, was not as good a candidate for LLM assistance, where C was much better.

Speaking to Software Vulnerabilities in Particular

Depending on the language, an LLM seems to produce as vulnerable code as a human at best. Additionally, and something I did not expect myself, is that even when being monitored and reviewed by a human the LLM-human system produces less secure code than the human alone. As a software developer, this is good information for me for sure. I will need to monitor the situation on code quality from LLMs, probably avoid them in my own work, and implement mitigation strategies for any code I need to review.

So I Asked the AI from an AI Company how bad AI is

Out of curiosity, I also signed up for Gemini Advanced and asked Gemini 1.5 Pro with Deep Research it’s opinion.

What are the harms and benefits of large language systems? I am interested in finding out specifically what effects of their accuracy, or lack thereof, have on the conclusions people draw. Also what harms or benefits those conclusions result in.

I did make one modification to its suggested research plan. Originally it had a few topics of research, including a review of harms related to the law. I suggested it modify it’s law review specifically to analyze judgements, rates of incarceration, and lengths of incarceration.

Overall actually, I think Gemini produced an ok summary of the current pros and cons of LLMs. That said, it doesn’t go too deep into details. What I might do is continue querying it to see how it behaves and what responses I can get. In this case and for this document, it does seem to produce results that are in-line with other research I have done so accuracy-wise I’d say it’s “high”.

I did also notice from the sources that it actually selected what I call “ok” in terms of quality. In some youtube reviews of Gemini Deep Research, I was seeing the source selection tended toward things like reddit and YouTube videos – what I would call “low quality”. So I was pleasantly surprised by that.

If you want to read the output I got from Gemini, see this link: Google Doc

Conclusion

I think this is hard actually. I think these systems are kind of neat and fun to play with. For example, last week I was playing with Phi4 and asked it questions about ice cream and had it respond as a pirate. That resulted in me getting a recipe for rum-flavored ice cream with a golden caramel topping. Truly a treasure worthy of any pirate, yarr.

However, it’s hard to also ignore the fact that we are seeing at best 92% accuracy in some diciplines and complete inaccuracy in others. Additionally, seasoned professionals acting in a review role are unable to catch these inaccuracies. Given these two facts, it seems like poor judgement to use any of these systems in critical applications. Just ask yourself: “can I accept an 8% failure rate in my process?” If the answer is “no” then you simply cannot use LLM system output as part of your process.

However, as an assistant for “rubber ducking”, I have personally found value in an LLM. As opposed to using an LLM as the final output, you might use the LLM for providing advice or direction. Ask it questions about your direction or process, rather than asking it for direct output. I didn’t check whether there was any literature on this topic specifically, though so I can’t make any difinitive conclusions – I can only say how I feel.

But with all things computery, advances are being made daily. So things may get better or they may get worse. And I say “worse” is an option because there is also literature that says things have gotten worse in some respects on some topics over the last few years. We can only monitor and see how things develop.

And, obviously, the greatest harm is the replacement of us as workers. I feel like I can’t do anything about fixing this problem so I guess I’ll just suffer like everyone else. Fight for more social safety nets, I guess? Otherwise we’ll all be out of jobs because Gemini can do this analysis as good as I can and then we don’t get money because capitalism and jobs and how it all works and ugh….

Now, if you’ll excuse me, I have some ice cream to make.

Sources

  1. https://blog.getbind.co/2025/02/16/is-deep-research-useful-comparing-gemini-vs-chatgpt-vs-perplexity/
  2. https://pmc.ncbi.nlm.nih.gov/articles/PMC11128619/
  3. https://www.nature.com/articles/s41598-024-83575-1
  4. https://arxiv.org/abs/2409.12183

I have other sources in an Obsidian workbook here I want to review as well, but that takes time.