Turning Data into AI-Powered Insights: My Journey with an Insurance Premium Estimator Bot

LLMs excel in conversation, not in complex decisions—prioritize data and preprocessing for true success in data science. Find out how we at LumenData are helping our clients get their data ready for LLM/AI Projects.

Share this on:

LinkedIn
X

What You'll Learn

Home Insurance Agents spend a considerable amount of time gathering information about their clients.  If the client isn’t prepared with the necessary details, it can lead to wasted time during the interview process. However, automating the intake process with a chatbot can ensure that all required information is collected upfront, streamlining the policy quote generation.

To help facilitate this, let’s build a conversational insurance quote advising bot. I used:

I wanted it to be a conversational input retrieval system that makes it convenient and helpful for the user to give input to a chatbot template. It was the right choice.

After we gather the user’s data, we generate more valuable real-time data, then generate the quotation prices.

The bot’s output also contains the quotation price, the reasoning behind the suggested price, and several price comparisons.

This is my first time building something that is a “chatbot.” Chatbots are a widely sought tool in the field of AI, and with LLM at its inflection point and Open AI’s rapid adoption within the industry, combining those two technologies was a logical decision.

Personally, it was not just where I had to learn what these terms mean but also to understand the level of research that has been done in the field.

Researching how an insurance quote is generated, I learned it boils down to a formula. My intuition was that a black box model would not be a good idea to work in calculating the insurance, the more opaque the formula was to the agent or the client, the more questions would arise and reduce the value of the chatbot in helping to generate the quote.

Getting Started

Before I could even get to the implementation stage, I had to solve several data-related issues.

Data Dilemmas

Problem 1: Finding Suitable Data

I hoped to get datasets from the web and some real-time data using APIs. After much searching and many dead ends, I had to find alternative solutions to obtain the data.

I chose to get real time data using Google search. I also found real time housing price data in Zillow, a real-estate marketplace company’s website.  

Working with the OpenAI API

Problem A: Understanding Documentation

I wanted to make a conversational bot using the Open AI API. Open AI uses “threads” to build conversational loops and this implementation is the basis of chatbots like ChatGPT, but all the available documentation on the website only gave examples of how to ask a question and get an answer.  Retaining the data between steps is super critical, as we need all of the information from the chat to build the quote.

The resources available on the web used outdated cases and required a lot of trial and error to bridge. To solve the problem, I created a Python loop within the chat interface to simulate a conversational thread.

Problem B: Crafting Effective Prompts

The second prompt resulted in a much better output, highlighting the importance of prompt engineering. The clarity and specificity of the second prompt made it more effective, whereas the first prompt was vague and less targeted. It took several attempts to arrive at the most effective prompt. Understanding the significance of prompt engineering is essential for improving the quality of AI-generated responses.

Writing good prompts is an art. My initial prompts were too vague or too complex, leading to unsatisfactory results. Here’s a look at my evolution:

  • First Prompt: Too broad, yielding irrelevant responses.

“give the output of the insurance quote suggested. give me the reasoning behind it and how you came up with it as a short summary.”

Response:

The insurance quote suggested for the home value of $149,108 in St. Louis, MO with a crime rate of 87% and marital status as married is $1390.67.

The reasoning behind this quote is as follows:

  1. The base price for the quote is determined by the home value and the base rate.
  2. The ownership status is owner, which leads to an additional ownership factor being applied to the quote.
  3. The high crime rate in St. Louis contributes to a significant crime rate factor being added to the quote.
  4. The marital status factor for being married is also factored into the quote calculation.

Comparing this quote to nearby zip codes (47403 and 47404) and considering different ownership status (renter), the quote for the given scenario appears to be justified based on the higher crime rate and ownership status factors in St. Louis, MO. The quote takes into account the specific factors of the location, crime rate, and ownership status to arrive at a reasonable insurance cost for the given home value.

  • Second Prompt: More specific and gets the job done.

“Give the output of the insurance quote suggested. Provide the reasoning behind it and how you came up with it as a short summary. Do not explain the formula. Explain how much the quote is for other nearby zip codes as well in the summary. Also, explain the current ownership status and provide the quote for other ownership statuses while suggesting the quote for the other ownership. Just explain the story behind the quotation in a polite way. Mention the area name where the data is in the summary as well. “

Response:

The insurance quotation for your home, with a value of $149108, comes to $1390.67. This quote takes into account the high crime rate in your area of St. Louis, MO, which stands at 87%. Given that St. Louis is notorious for its elevated crime index, including both violent and property crimes, the quote reflects the risk associated with insuring a property in this location.

Comparing your quote with nearby zip codes like 47403 and 47404, which have quotes of $1744.00 and $1681.00 respectively, it is evident that your quote is comparatively lower due to the specific factors of your location and home value. The current ownership status as an owner also impacts the quote, as indicated by the $1390.67 price. If you were to be a renter instead, the quote would be $1457.00, illustrating the difference ownership status makes in the insurance premium.

Overall, the quote for your home in St. Louis, MO, is justified based on the home value, high crime rate, and ownership status, all of which influence the insurance cost in this area.

Integrating Crime Data

Problem C: Google Search API Limitations

Using Google Search API for crime data presented its own set of challenges. A straightforward search such as “Crime report in 47401” didn’t yield useful results. A better search query is “Crime rate in Bloomington, Indiana.”

The data had to be mapped with zip codes, counties, towns, and states to refine the search queries effectively.

I ended up using an additional dataset to bridge this gap which allowed for more accurate and relevant searches. This additional dataset included the zip code, their corresponding town, county, city, and state name.

Problem D: Data Structuring and Validation

Once I scraped the data from the websites, I wanted the following; 

1.) The crime rate in the particular area in %; 

2.) A short summary of the crime rate in the area.

I used Open AI API again to obtain the summary and to better organize the data for later processing. The problem I faced was with parsing the output to a float and string since the crime rate is in float data type and the summary is in string data type.

To improve the output, I used the Pydantic library, which helped in structuring and validating the data I was working with. Pydantic ensured that the responses were well-formatted and adhered to the expected schema, making the chatbot’s output more reliable and easier to work with.

Scoping and Final Thoughts

Building this insurance premium estimator bot was a learning experience in many ways. It taught me the importance of good data, the intricacies of working with APIs, and the art of crafting effective prompts. These insights are crucial for anyone looking to bring their LLM applications into production.

The main takeaway is the critical importance of having good data in any data science or machine learning project. High-quality and well-prepared data is foundational to the success of these projects. Most of the work in data science involves data engineering and preprocessing. This includes tasks such as cleaning and transforming raw data, handling missing values, normalizing data formats, and integrating data from various sources. These steps are crucial because they ensure the data is accurate, consistent, and suitable for analysis. Without proper attention to these tasks, even the most advanced algorithms and sophisticated models may fail to produce useful results. 

Therefore, investing time and effort in data preparation is essential for achieving successful outcomes in data-driven projects.

While LLMs have become increasingly powerful, their strengths lie in summarizing information and engaging in human-like conversations, rather than performing complex mathematical tasks or making critical decisions like suggesting insurance rates. For tasks requiring high accuracy and expertise, relying on LLMs is not advisable at this stage. The emphasis should remain on the foundational importance of good data and effective preprocessing, which are the true drivers of success in any data science project.

About LumenData

LumenData is a leading provider of Enterprise Data Management, Cloud, and Analytics solutions and helps businesses handle data silos, discover their potential, and prepare for end-to-end digital transformation. Founded in 2008, the company is headquartered in Santa Clara, California, with locations in India.

With 150+ Technical and Functional Consultants, LumenData forms strong client partnerships to drive high-quality outcomes. Their work across multiple industries and with prestigious clients like Versant Health, Boston Consulting Group, FDA, Department of Labor, Kroger, Nissan, Autodesk, Bayer, Bausch & Lomb, Citibank, Credit Suisse, Cummins, Gilead, HP, Nintendo, PC Connection, Starbucks, University of Colorado, Weight Watchers, KAO, HealthEdge, Amylyx, Brinks, Xylem, Clara Analytics, and Royal Caribbean Group, speaks to their capabilities.

For media inquiries, please contact: marketing@lumendata.com.

Authors

Picture of Aathirai Thamaraiselvi
Aathirai Thamaraiselvi

Data Science Intern

Picture of Andrew Crider
Andrew Crider

Director, Analytics