How to run RAG projects for better data analytics results

How to run RAG projects for better data analytics results

NewsOctober 13, 2025Artifice Prime

The arrival of generative AI-enhanced business intelligence (GenBI) for enterprise data analytics has opened up access to insights, while also increasing the speed, relevance and accuracy of those insights.

But that’s in best-case scenarios. Often, AI-powered analytics leads data teams to the same challenges: Hallucinations, security and governance snafus, outdated or incorrect answers, low familiarity with niche areas of expertise, and an inability to deliver answers grounded in proprietary data. Many of these challenges stem from a single factor: The LLMs that form the foundation for GenBI can only draw on their training data for answers, and this training data is largely static and inflexible.

While retrieval-augmented generation (RAG) offers a solution, it isn’t always implemented in ways that yield exemplary results. Some experts are extremely skeptical about the technology, estimating that real-world RAG implementations only produce successful outputs 25% of the time.

A recent research paper from Google in partnership with the University of Southern California found that RAG-enhanced model output only included direct answers to users’ questions 30% of the time, with problematic output most commonly attributed to perceived conflicts between internal information and retrieved information.

When done correctly, RAG enhances LLM knowledge by augmenting it with data retrieved from external sources, including internal knowledge bases, proprietary databases and documentation repositories. This allows AI analytics engines to enhance generic market knowledge with precise, up-to-date data about the topic at hand.

The challenge of RAG GenBI applications

The primary reason data analytics teams adopt RAG is to enhance the accuracy of insights, which is particularly crucial in self-service scenarios, where line-of-business users lack the same human gatekeepers to verify output quality.

Indeed, RAG accuracy can unfortunately be limited, which is a significant drawback.

On top of that, teams can also struggle to:

Ensure data privacy, data governance and regulatory compliance when connecting LLM to proprietary, sensitive data.
Find data across complex landscapes.
Prevent drift for RAG architecture.

That said, there are many success stories. For example, pharma researchers evaluated PaperQA, a RAG connector that draws on scientific papers to answer biomedical questions. They found an 86.3% accuracy rate, compared with 57.9% for GPT-4. Another resource, tested on more challenging questions, achieved 69.5% accuracy and ~87.9% precision, with zero hallucinations, similar to biomedical domain experts. In contrast, non-RAG LLMs reported hallucination rates of 40%-60 %.

What RAG success looks like

The combination of broad LLM knowledge and targeted RAG data is crucial for many industries.

For example, healthcare and finance both demand high accuracy based on inputs focused on one individual, as compared to relevant cohorts. Scientific research, legal and compliance companies search enormous volumes of literature to answer complex queries. Manufacturing and supply chain companies must interpret company-specific situations in light of global data signals, which can vary significantly.

As a result, RAG is quickly becoming the non-negotiable foundation for data analytics. “RAG is basically the business of taking proprietary corporate data sets, let’s call it information, and being able to query it and use it as part of the LLM flow,” explains Avi Perez, CTO at Pyramid Analytics, in an interview with Dave Sobel for The Business of Tech.

“Let’s assume I wanted to ask a question about my company, Acme Corporation. It’s not like the LLM has the details on Acme. What’s happened is that someone has taken all of Acme’s documents, put them into a graph database and then applied RAG techniques. When I ask the question as an end user, it takes a piece of the database, vectorize and the LLM, merges it together and gives me an intelligent answer.”

But RAG itself is hardly a magic wand. There are many reports of low accuracy rates even for RAG-based answers, and data analytics teams often struggle to set up RAG processes in ways that deliver ROI.

Here are five clear tips to help you triumph.

1. Get your data house in order.

“A RAG system is only as good as the knowledge library,” says Dr. Judy W. Gichoya, of the Emory University School of Medicine, about her department’s adoption of RAG for radiology analytics. “If that library overrepresents certain topics and underrepresents others, the system’s performance will reflect that bias.”

RAG accuracy is determined by the data it draws on, but much of that data tends to be siloed, scattered, housed in legacy infrastructure and, in some cases, laden with biases.

There’s a tendency to dump enterprise information signals in a data lake and leave it unstructured, unstandardized and poorly labeled and indexed until it’s needed. In these situations, the RAG architecture is unable to interpret this ambiguous data and cannot make sense of the information it needs to generate coherent analyses.

As with all AI projects, when implementing RAG for analytics, it’s vital to first tidy all your data up:

Identify the most valuable data sources.
Clean out irrelevant or out-of-date information.
Standardize text formats.
Verify and clean up metadata.
Enhance and augment as necessary.

Then, you must turn all this into a repeatable process so that it becomes a reiterative data preparation pipeline. This is necessary because more data will continue to arrive, existing data will become irrelevant and old data will need to be removed.

“The quality of your data is foundational to your RAG pipeline’s success,” advises Ryan Siegler, data scientist at KX. “By preprocessing, you can proactively eliminate or mitigate potential issues.”

2. Take vectorization seriously

Vectorization is what powers the RAG process, converting complex data into numerical vectors called vector embeddings to make searches more precise and swift.

RAG architecture uses vectors in chunks to retrieve the most relevant data for the LLM to use in its responses. With effective vectorization, proprietary sources can be merged more efficiently with the broader LLM data.

This means that your vectorization choices can be critical for RAG success. The main options are:

A vector database, which stores document embeddings, scales quickly and supports distributed storage for advanced indexing and vector querying.
A vector library, which is a faster, lighter way to hold vector embeddings.
Vector support integrated into the existing database to store vector embeddings and support querying.

The best choice depends on your specific circumstances. For example, a vector-native database is the most robust method, but it’s too expensive and resource-heavy to be practical for smaller organizations. A vector library is faster and best for times when latency is the enemy, while integrating vector capabilities is easiest but doesn’t scale well enough for heavy enterprise needs.

3. Build a solid retrieval process.

It’s right there in the name – RAG is all about retrieving the right data to build accurate responses. However, you can’t simply point your RAG infrastructure at data sources and expect it to retrieve the best answers. You need to teach RAG systems how to retrieve relevant information, with a strong emphasis on relevance. Too often, RAG systems over-collect data, resulting in excessive noise and confusion.

“Experimental research showed that retrieval quality matters significantly more than quantity, with RAG systems that retrieve fewer but more relevant documents outperforming in most cases those that try to retrieve as much context as possible, resulting in an overabundance of information, much of which might not be sufficiently relevant,” observes Iván Palomares Carrascosa, a deep learning and LLM project advisor.

Best practices for RAG retrieval include:

Using hierarchical retrieval and dynamic context compression to optimize processes.
Setting up metadata filtering pipelines to automatically flag and/or exclude questionable content.
Sdding validation layers between retrieval and querying.
Managing chunking carefully, including contextual chunking and late chunking (chunks that are too short lack context, too long and there can be too much noise).
Investing in hand-labeled training datasets to teach the ranking algorithms to recognize relevance.
Evaluating retrieval performance metrics such as precision, recall and F1-score to drive continuous improvement.

4. Bake in control

Whether or not RAG is involved, data privacy and security, data governance and regulatory compliance can spell death to any data project, especially when it involves connecting multiple systems.

RAG connects LLMs, which can be external-facing, with proprietary data, thereby increasing the risk of exposure to sensitive information. Some regulatory standards outright forbid this step for PII data, while others require complex guardrails and protections to achieve compliance.

At the same time, restrictions are vital to ensure data governance and prevent information from becoming outdated or biased. Control over data intake, access, management and storage practices is crucial to both to protect data from leaks and to defend it from contamination.

Best practices for RAG-related data controls include:

Role-based access management.
Tracking data lineage to verify data sources.
Keeping an audit trail (logging) of all queries, responses and retrieved sources.
Setting guardrails around access to sensitive data.
Establishing and enforcing clear policies for data like PII.
Actively uncovering and addressing potential biases in responses.
Implementing accuracy measures specific to GenAI.

5. Give prompt engineering the respect it deserves

Data engineers and analytics project leaders often overlook the need for prompt engineering skills, considering it to be something that LOB users have to learn for themselves. This can be a serious oversight, because well-designed queries are crucial for accuracy in any BI query, especially when RAG is involved.

Confusing or ambiguous prompts can lead LLMs to draw on incorrect information, misunderstand the context and provide the wrong answer. Users who don’t know how to write good prompts may blame the GenBI setup and ultimately the RAG process, when the fault lies in the prompt itself.

“Don’t treat prompt engineering like an end-user issue,” warns Donald Farmer from TreeHive Strategy. “Well-designed prompts help the LLM properly contextualize retrieved information and generate appropriate responses,” he explains.

Best practices for prompt engineering include:

Standardizing prompt templates for the most common use cases.
Developing clear instructions for source citation.
Testing and iterating prompt templates.
Offering training for employees.
Specifying the level of detail and format for responses.

The right approach to RAG underpins impactful analytics projects.

As AI-powered data analytics becomes increasingly critical for enterprise decision-making, ensuring the accuracy, relevance and timeliness of insights is all the more crucial. RAG holds the key, but only when it’s implemented with care and knowledge. Taking the right approach to data processing and management, retrieval and prompt engineering lays the foundations for success.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Original Link:https://www.infoworld.com/article/4070498/how-to-run-rag-projects-for-better-data-analytics-results.html
Originally Posted: Mon, 13 Oct 2025 09:03:00 +0000

Upvote0PointsDownvote

0 People voted this article. 0 Upvotes - 0 Downvotes.