When should you use GenAI? Insights from an AI Engineer.
A evaluation checklist by a GenAI engineer to help you avoid bad projects.
After working on dozens of projects, from early prototypes to full-scale generative AI deployments we at Firebird have seen what works and what doesn’t.
If you’re exploring GenAI for your product or business, this guide will help you cut through the hype.
There’s a lot of pressure right now. Companies feel the need to adopt AI fast, fearing they’ll be left behind. But moving too quickly, or using GenAI just for the sake of it, often leads to wasted time and money.
It’s great to stay ahead of the curve but the generative AI space is also full of overpromises, half-baked solutions, and use cases where GenAI simply doesn’t fit.
Before committing to a adding a Large Language model, run your idea through this evaluation checklist. It will help you avoid bad use cases and build more responsible, effective applications.
Checklist question # 1
Does a deterministic solution already exist?
A simple yet often neglected question. Large Language Models are ultimately predicting the next token based on a learned probability distribution. If you ask an LLM what the weather is like right now, you will see it will give a different response.
Ideally, don’t use an LLM for things that have a clear, straightforward answer. If you need weather data, for example, it’s better to use a trusted API or just check outside yourself.
Building on that idea, here are some examples where clear rule-based solutions already exist, but people still use LLMs even though they are not really needed.
Calculations
Using an LLM for arithmetic is unreliable.
Data Mapping
If you have a known list of key and value pairs, like names and salaries, a deterministic lookup function is the better choice. An LLM adds unnecessary complexity and risk of incorrect mapping.
Validating Information
LLMs cannot reliably validate phone numbers, email addresses, physical locations, or business registrations. They are also not dependable for verifying historical facts unless connected to a web search tool, and even then, they can still make things up.
Cleaning and Parsing Information
When you need to clean a string or fix formatting errors, use tools like regular expressions. If the issue follows a predictable pattern, a defined set of rules will work better than asking an LLM to guess.
Regressing or Predicting Numbers
For predicting numeric values or trends, a simple linear regression model is often more accurate and transparent than using an LLM.
Note: Some readers might be confused as many of these things are done with high accuracy if you ask chat interface like ChatGPT. The answer is that the chatbot uses deterministic tools to answer those queries and later asks the LLM to relay that information. ChatGPT calling a calculator to answer calculations is different than asking the LLM to do the calculation.
The difference is subtle when developing your own solution by calling the API you need to make sure the request is relayed to a deterministic function not the LLM.
Looking for experts to help identify and implement the correct solution for you?
Reach out here: https://tally.so/r/3x9bgo
Checklist question # 2
What is the cost? Is this economically viable? Are cheaper solutions available?
After clearing question one, you have found no deterministic solution to your problem. The next important question to ask are Large Language models economical for this problem.
One example of a non-deterministic problem for which LLM might not be economically feasible are classification problems. Many people are using LLM for text classification. In certain cases that makes sense but in many it absolutely does not. An obvious way to evaluate is looking at how much additional $ are you paying for an improvement in classification accuracy.
Most traditional classifiers might incur some compute cost but let’s assume that is 0.
After figuring out cost per improvement you’re paying the next step is evaluating what benefit you are getting per improvement of accuracy.
If the $ benefit per improvement due to LLM is greater than $ cost per improvement in LLM, you can use an LLM for that task, if not it is economically unviable.
Note: You can replace accuracy with any metric that captures the performance of the task you are doing. Also calculating $ benefit might be tough, so it is not uncommon to use some estimate. You ideally use a number that helps you stay inside your budget. We at Firebird are specialists in figuring this out for you: Reach out for a consultation!
Here are some other examples where LLMs are being used but it might be wise to figure out if they are cost effective:
OCR and parsing images: If you have black & white text on an image. Traditional OCR tools can be much more precise & cost effective. As vision capabilities of LLM are not immune to hallucination.
Recommendation Algorithms: LLMs can be used to recommend items if you give text descriptions, but whether the benefit (if any) is worth the additional cost needs to be calculated
Sentiment Analysis: It is likely that for most sentiment analysis use cases an LLM will be more ‘accurate’ than other NLP techniques but is the additional accuracy worth it? Often neglected as the benefit might be small, but the cost could be high.
Categorization: Assigning tags and categories based on text descriptions. Common nowadays but if you need to do this in bulk, it is advisable to use a mixed (where an LLM is only used on hard examples) or simpler solution.
Checklist question # 3
What is the risk & damage to my brand/business in the case of hallucination?
The last question is a risk assessment question, similar to the financial concept of value at risk. In simple terms, when making an investment it is advisable to look at the damage or loss in the worst case scenario.
Similarly, we should assume that whatever application we are building will malfunction due to the LLM hallucinating. In the absolute worst-case scenario what is the potential damage to the company brand.
In certain cases, your company and brand might be liable for damages if their application malfunctions. You should definitely come up with a framework to assess damages.
Here are some obvious ones where you should be careful:
Data Accuracy: If your business relies on providing factually correct information, and even a small error could trigger damages. You should definitely be careful with using LLMs. This is common in the finance industry.
Legal: If you are a lawyer and you rely heavily on AI for preparing legal documents, you might be held liable for hallucinations.
Content generation: Character.AI was sued for the suicide of a minor, because their AI allows users to generate NSFW content. That may be damaging to children. If you work with adult/sensitive content you must be ensure AI does not produce harmful material.
Hiring HR: Many countries have laws against discrimination in employment; AI is not fair and can be discriminatory. Which could result in legal penalties.
The important thing to remember is that you should assume that AI will hallucinate and cause your product to malfunction. Even after adding guardrails, you cannot be completely sure that your application will not fault. If the risk to high, you should avoid or limit AI use.
Hopefully you found the information in the blog post helpful. Please follow us on Medium, Substack & Linkedin to stay up to date on AI.