Chat with data the easy way in R or Python

Chat with data the easy way in R or Python

NewsSeptember 4, 2025Artifice Prime

Even if you love to code, there probably are times when you’d rather ask a question like, “What topics generated the highest reader interest this year?” than write an SQL query with phrases like STRFTIME(‘%&Y’, Date) = STRFTIME(‘%Y’, ‘now’). And, if your data set has dozens of columns, it’s nice to be able to avoid looking up or remembering the exact name of each one.

Generative AI and large language models (LLMs) ought to be able to help you “talk” to your data, since they excel at understanding language. But as any serious genAI user knows by now, you can’t necessarily trust a chatbot’s results—especially when it’s doing calculations.

One way around that is to ask a model to write the code to query a data set. A human can then check to make sure the code is doing what it’s supposed to do. There are various ways to get an LLM to write code for data analysis. One of the easiest for R and Python users is with the querychat package available for both languages.

Also see my quick look at Databot, an AI assistant for querying and analyzing data in R and Python.

Natural language querying with querychat

Querychat is a chatbot data-analysis component for the Shiny web framework. It translates a plain-language request into SQL, runs that SQL on your data, then displays both its result and the SQL code that generated it. Showing raw SQL code might not be ideal for the average user, but it’s great for programming-literate users who want to verify a model’s response.

A key advantage of the querychat workflow is that you never have to send your data to an LLM, or to the cloud. Instead, you define your data structure, the LLM writes SQL code based on your definition of the data’s columns, and then that code is run on your machine and displayed for you to review. That keeps your data private. Another benefit is that this can work equally well whether your data set has 20 rows or 2 million—the limit is what your local system can handle, not the LLM’s context window.

The querychat web app can answer questions about the data or filter the data display, all based on plain-language requests.

I’ll demonstrate how to use querychat with public data; namely, National Football League game results from last season (and this year when available). You can see examples for both R and Python. Let’s get started!

R with querychat

The first thing we’ll do is get and process the NFL data. For this example, I’ll use the nflverse suite of R packages to import the NFL game data into R. You can install it from CRAN with install.packages("nflverse").

In the code below, I load the nflreadr package from the nflverse as well as dplyr for basic data wrangling and feather for a file format used by both R and Python. You should also install these packages if you want to follow along and they’re not already on your system. Also, install querychat for R from GitHub with pak::pak("posit-dev/querychat/pkg-r").

I use nflreadr’s load_schedules() function to import data for the 2024 and 2025 NFL seasons, removing all games where the result is NA (representing games that are on the schedule but haven’t been played yet):

library(nflreadr)
library(dplyr)
library(feather)
game_data_all <- load_schedules(seasons = c(2024, 2025)) |>
  filter(!is.na(result))

Note that the nflverse game data is also available in CSV format for seasons since 1999. So, you could import it with your favorite CSV import function instead of using nflreadr. For example:

 game_data_all <- rio::import(“https://raw.githubusercontent.com/nflverse/nfldata/refs/heads/master/data/games.csv”) |>    filter(season %in% c(2024, 2025) & !is.na(result))

The load_schedules() function returns a data frame with 46 variables for metrics including game time, temperature, wind, playing surface, outdoor or dome, point spreads, and more. Run print(dictionary_schedules) to see a data frame with a data dictionary of all the fields.

Now that I have the data, I need to process it. I’m going to remove some ID fields I know I don’t want and keep everything else:

cols_to_remove <- c("old_game_id", "gsis", "nfl_detail_id", "pfr", "pff", 
                    "espn", "ftn", "away_qb_id", "home_qb_id", "stadium_id")

games <- game_data_all |>
  select(-all_of(cols_to_remove))

Although it’s obvious from the scores which teams won and lost, there aren’t actually columns for the winning and losing teams. In my tests, the LLM didn’t always write appropriate SQL when I asked about winning percentages. Adding team_won and team_lost columns makes that clearer for a model and simplifies the SQL queries needed. Then, I save the results to a feather file, a fast format for either R or Python:

games <- games |>
  mutate(
    team_won = case_when(
      home_score > away_score ~ home_team,
      away_score > home_score ~ away_team,
      .default = NA
    ),
    team_lost = case_when(
      home_score > away_score ~ away_team,
      away_score > home_score ~ home_team,
      .default = NA
    )
  )

write_feather(games, "games.feather")

Querying your data with querychat in R

You’ll likely get more accurate querychat results if you add a couple of optional arguments when setting up your chat, such as a description of your data columns. It’s better to explain what each of your data fields represents than have the model try to guess—the same as when you’re working with human colleagues!

The data description can be a Markdown or a text file, and you can use your judgment on how to structure the text. The advice from the documentation is, “Just put whatever information, in whatever format, you think a human would find helpful.”

Since nflreadr comes with a data dictionary for this data set, I started with that built-in dictionary_schedules data frame. I then deleted definitions of columns I removed from my data, converted the data frame into plain text, and saved that to a data_dictionary.txt file:

data_dictionary <- dictionary_schedules |>
  filter(!(field %in% cols_to_remove) )

text_for_file <- paste0(data_dictionary$field, " (", data_dictionary$data_type, "): ", data_dictionary$description)

cat(text_for_file, sep = "\n", file = "data_dictionary.txt")

Next, I opened and edited that file manually to add “This is data about recent National Football League games, one row per game” and definitions for my team_won and team_lost columns:

team_won (character): Name of winning team
team_lost (character): Name of losing team

You should also include a greetings file with an initial greeting for the chatbot and a few sample questions. While that’s not required, greetings and sample questions will be generated for you if you don’t provide them, which costs both time and tokens. In fact, I received a warning in the console when I created an app without a greeting file: “For faster startup, lower cost, and determinism, please save a greeting and pass it to querychat_init().”

Sample questions use the format Your question here.

This is my greeting.md file, which includes how I want my chatbot to greet users and what sample questions I want to display:

I can analyze data from the 2024 NFL season (and soon 2025). When asking about teams, you can use 2- or 3-letter abbreviations such as NYG and SF. You can ask me questions such as

- **Filter and Sort Table**
    - Show only SF games
    - Show only PHI home games
    - Sort table by largest difference in home & away scores    

- **Answer Questions**
    - How did KC do in night games?
    - What was MIA's won-loss record in games under 40 degrees?

The sample questions are formatted so that if you click on them in the app, the text pops into the chat text input.

Data to query, a data dictionary file, and a greeting file are enough to create a working chat infrastructure. The R code below uses querychat in a simple Shiny web app that lets you ask plain-language questions about the NFL data. Install any additional needed libraries not already on your system, such as shiny, bslib, and DT. It doesn’t look very pretty, but this version only needs approximately 20 lines of code:

library(shiny)
library(bslib) # Standard if you want updated Bootstrap 
library(querychat)
library(DT)
library(feather)

games <- read_feather("games.feather")

# Create querychat configuration object 
querychat_config <- querychat_init(
  games,
  data_description = readLines("data_dictionary.txt"),
  greeting = readLines("greeting.md"),
  # This is the syntax for setting a specific model:
  create_chat_func = purrr::partial(ellmer::chat_openai, model = "gpt-4.1") )

ui <- page_sidebar(title = "NFL Game Data 🏈",
  sidebar = querychat_sidebar("chat"),
  DT::DTOutput("dt") )

server <- function(input, output, session) {
  # Creates a querychat object using the configuration above
  querychat <- querychat_server("chat", querychat_config)
  
  output$dt <- DT::renderDT({
    #    querychat$df() is the filtered/sorted reactive data built into querychat
    DT::datatable(querychat$df(), filter = 'none') # Set filter = 'top' if you want to filter table manually
  })
}

shinyApp(ui, server)

Sharon Machlis

The app typically either filters/sorts the data or answers questions, but you can ask it to do both with a query like, “First show all PHI home games. Then tell me PHI home winning percentage.”

As with any Shiny app, you can customize the look and feel with cards, CSS, and more.

I preferred GPT-4.1 for this app compared with early tests using GPT-5. GPT-5 was slower and sometimes gave errors saying my account needed to be verified to use the model, even though it had already been approved.

Python with querychat

As we did in the previous example, the first step is to get and process the NFL data. If you work in both R and Python, you can use the feather data file from the previous example in the Python app, too. If you’d rather pull the data with Python, there’s a Python package nfl-data-py that imports nflverse data.

Below, I import data for the 2024 and 2025 seasons, drop ID columns I know I don’t want, delete rows without result values (scheduled but not played yet), and add columns for team_won and team_lost. The data wrangling script needs the nfl_data_py, pandas, numpy, and pyarrow packages. The app also needs chatlas, querychat, shiny, and python-dotenv.

During my testing, I occasionally had trouble installing querychat with uv add querychat, but uv pip install git+https://github.com/posit-dev/querychat.git was reliable.

This is the Python data wrangling code:

import nfl_data_py as nfl
import pandas as pd
import numpy as np

game_data_all_py = nfl.import_schedules(years=[2024, 2025])

columns_to_drop = [
    'old_game_id', 'gsis', 'nfl_detail_id', 'pfr', 'pff', 'espn', 'ftn',
    'away_qb_id', 'home_qb_id', 'stadium_id'
]
games_py = game_data_all_py.drop(columns=columns_to_drop)
games_py = games_py.dropna(subset=['result'])

conditions = [
    games_py['home_score'] > games_py['away_score'],
    games_py['away_score'] > games_py['home_score']
]

# Define the outcomes for the 'team_won' column
winner_outcomes = [
    games_py['home_team'],
    games_py['away_team']
]

# Define the outcomes for the 'team_lost' column
loser_outcomes = [
    games_py['away_team'],
    games_py['home_team']
]

# Create the new columns based on the just-defined conditions.
games_py['team_won'] = np.select(conditions, winner_outcomes, default=None)
games_py['team_lost'] = np.select(conditions, loser_outcomes, default=None)

games_py.reset_index(drop=True).to_feather("games_py.feather")

There’s nothing specific to the programming language about the greeting.md file; you can either use the one I wrote for the R example or create your own.

Querying your data using querychat and Python

As mentioned previously, you’ll have better results if you provide greeting and data dictionary text files to your querychat.

Whereas R’s NFL package has a function to print out a data dictionary, I didn’t see one in the Python version. However, you can access the R function’s raw CSV dictionary data on GitHub. Here’s the Python code to generate a data dictionary text file from that CSV:

dictionary_url = "https://raw.githubusercontent.com/nflverse/nflreadr/1f23027a27ec565f1272345a80a208b8f529f0fc/data-raw/dictionary_schedules.csv"
dictionary_df = pd.read_csv(dictionary_url)

# Remove the columns not in the data frame
filtered_dictionary_df = dictionary_df[~dictionary_df['field'].isin(columns_to_drop)].copy()

# 3. Add the new team_won and team_lost columns
new_rows_data = {
    'field': ['team_won', 'team_lost'],
    'type': ['character', 'character'],
    'description': ['Name of winning team', 'Name of losing team']
}
new_rows_df = pd.DataFrame(new_rows_data)
final_dictionary_df = pd.concat([filtered_dictionary_df, new_rows_df], ignore_index=True)

lines_for_file = [
    f"{row['field']} ({row['type']}): {row['description']}"
    for index, row in final_dictionary_df.iterrows()
]

with open("data_dictionary_py.txt", "w") as f:
    f.write("\n".join(lines_for_file))

Now we’re ready to use querychat in a Python Shiny app.

Below is code for a simple querychat Python Shiny app based on the querychat Python docs. I used an .env file to store my OpenAI API key.

Save the following code to app.py and launch the app with shiny run app.py. I ran it inside the Positron IDE with the Shiny extension installed; VS Code should work the same:


import chatlas
import querychat as qc
import pandas as pd
from shiny import App, render, ui
from dotenv import load_dotenv
from pathlib import Path

load_dotenv()

games = pd.read_feather("games_py.feather")


def use_openai_models(system_prompt: str) -> chatlas.Chat:
   return chatlas.ChatOpenAI(
       model="gpt-4.1",
       system_prompt=system_prompt,
   )  

querychat_config = qc.init(
    data_source=games,
    table_name="games",
    greeting=Path("greeting.md"),
    data_description=Path("data_dictionary_py.txt"),
    create_chat_callback= use_openai_models 
)

# Create UI
app_ui = ui.page_sidebar(
    qc.sidebar("chat"),
    ui.output_data_frame("data_table"),
)


# Shiny server logic
def server(input, output, session):
    # This create a querychat object using the configuration from above
    chat = qc.server("chat", querychat_config)

    #    chat.df() is the filtered/sorted reactive data frame
    @render.data_frame
    def data_table():
        return chat.df()

# Create Shiny app
app = App(app_ui, server)

You should now have a basic working Shiny app that answers questions about NFL data.

There’s a lot more you can do with these packages, including connecting to an SQL database and enhancing the Shiny application. Posit CTO Joe Cheng, the creator of querychat, demo’d an expanded Python version as a full-blown querychat dashboard app. There’s a template for it in the Shiny for Python gallery:

Screenshot of the Python sidebot for querychat.

Sharon Machlis

There is also a demo repo for an R sidebot you can run locally by cloning the R sidebot GitHub repository. Or, you can examine the app.R file in the repository to get ideas for how you might create one.

Rather than do the work myself, I had the Claude and Gemini chatbots build me a more robust version of the basic NFL Shiny app I’ve demo’d here. You can see the code for that application, shown below, in my querychat app repository.