7 newer data science tools you should be using with Python
Python’s rich ecosystem of data science tools is a big draw for users. The only downside of such a broad and deep collection is that sometimes the best tools can get overlooked.
Here’s a rundown of some of the best newer or less-known data science projects available for Python. Some, like Polars, are getting more attention but still deserve wider notice. Others, like ConnectorX, are hidden gems.
ConnectorX
Most data sits in a database somewhere, but computation typically happens outside of it. Getting data to and from the database for actual work can be a slowdown. ConnectorX loads data from databases into many common data-wrangling tools in Python, and it keeps things fast by minimizing the work required. Most of the data loading can be done in just a couple of lines of Python code and an SQL query.
Like Polars (which I’ll discuss shortly), ConnectorX uses a Rust library at its core. This allows for optimizations like being able to load from a data source in parallel with partitioning. Data in PostgreSQL, for instance, can be loaded this way by specifying a partition column.
Aside from PostgreSQL, ConnectorX also supports reading from MySQL/MariaDB, SQLite, Amazon Redshift, Microsoft SQL Server and Azure SQL, and Oracle. The results can be funneled into a Pandas or PyArrow DataFrame, or into Modin or Dask (via Pandas), or Polars (via PyArrow). General support for reading from ODBC is a work in progress.
DuckDB
Data science folks who use Python ought to be aware of SQLite—a small, but powerful and speedy relational database packaged with Python. Since it runs as an in-process library, rather than a separate application, SQLite is lightweight and responsive.
DuckDB is a little like someone answered the question, “What if we made SQLite for OLAP?” Like other OLAP database engines, it uses a columnar datastore and is optimized for long-running analytical query workloads. But DuckDB gives you all the things you expect from a conventional database, like ACID transactions. And there’s no separate software suite to configure; you can get it running in a Python environment with a single pip install duckdb
command.
DuckDB can directly ingest data in CSV, JSON, or Parquet format, as well as a slew of other common data sources. The resulting databases can also be partitioned into multiple physical files for efficiency, based on keys (e.g., by year and month). Querying works like any other SQL-powered relational database, but with additional built-in features like the ability to take random samples of data or construct window functions.
DuckDB also has a small but useful collection of extensions, including full-text search, accelerated vector similarity search, Excel import/export, direct connections to SQLite and PostgreSQL, Parquet file export, and support for many common geospatial data formats and types.
Optimus
One of the least enviable jobs you can be stuck with is cleaning and preparing data for use in a DataFrame-centric project. Optimus is an all-in-one tool set for loading, exploring, cleansing, and writing data back out to a variety of data sources.
Optimus can use Pandas, Dask, CUDF (and Dask + CUDF), Vaex, or Spark as its underlying data engine. Data can be loaded in from and saved back out to Arrow, Parquet, Excel, a variety of common database sources, or flat-file formats like CSV and JSON.
The data manipulation API resembles Pandas, but adds .rows()
and .cols()
accessors to make it easy to do things like sort a DataFrame, filter by column values, alter data according to criteria, or narrow the range of operations based on some criteria. Optimus also comes bundled with processors for handling common real-world data types like email addresses and URLs.
One possible issue with Optimus is that it’s still under active development but its last official release was in 2020. This means it might not be as current as other components in your stack.
Polars
If you spend much time working with DataFrames and you’re frustrated by the performance limits of Pandas, reach for Polars. This DataFrame library for Python offers a convenient syntax similar to Pandas.
Unlike Pandas, though, Polars uses a library written in Rust that takes maximum advantage of your hardware out of the box. You don’t need to use special syntax to take advantage of performance-enhancing features like parallel processing or SIMD; it’s all automatic. Even simple operations like reading from a CSV file are faster. Rust developers can craft their own Polars extensions using pyo3.
Polars provides eager and lazy execution modes, so queries can be executed immediately or deferred until needed. It also provides a streaming API for processing queries incrementally. Streaming isn’t available yet for many functions, although Polars can always fall back to the in-memory engine for such operations if need be. You can also plot execution graphs for queries, streaming or otherwise, if you want to get an idea of what memory or CPU consumption is like for the query (via the external Graphviz library).
DVC
A major and pervasive issue with data science experiments is version control—not of the project’s code, but its data. DVC, short for Data Version Control, lets you attach version descriptors to datasets, check them into Git as you would the rest of your code, and keep versions of data and code consistent together.
DVC can track most any kind of dataset as long as they can be expressed as a file, whether kept in local storage or in a remote storage service like an Amazon S3 bucket. You can describe how data models are managed and used by way of a “pipeline,” which DVC’s documentation describes as being like “a Makefile system for machine learning projects.”
The use cases for DVC are intended to be more than just allowing data to be versioned alongside code. It also works as a fast data cache for remotely hosted data, a methodology for tracking experiments conducted with data, and a registry or catalog for machine learning models created with the data. Visual Studio Code users can integrate DVC workflows into the editor by way of the DVC VS Code extension.
Cleanlab
Good machine learning datasets are hard to come by, because it’s expensive and time-consuming to create clean, properly labeled data. Sometimes, though, you have no choice but to use data that’s raw and inconsistent. Cleanlab (as in, “cleans labels”) was made for this scenario.
Cleanlab uses existing, high-quality machine learning datasets to analyze lower-quality, unlabeled (or poorly labeled) datasets. You create a model based on the original dataset, use Cleanlab to figure out what needs to be improved in the original dataset, then re-train using your automatically cleaned and adjusted dataset to see the difference.
Cleanlab is data-model and data-framework agnostic, a powerful aspect of its design. It doesn’t matter if you’re running PyTorch, OpenAI, scikit-learn, or Tensorflow; Cleanlab can work with any classifier. It does, however, have specific workflows for common tasks like token classification, multi-labeling, regression, image segmentation and object detection, outlier detection, and so on. It’s worth perusing the example set to see for yourself how the process works and what results you can expect.
Snakemake
Data science workflows are hard to set up, and that’s even harder to do in a consistent, predictable way. Snakemake was created to automate the process, setting up data analysis workflows in ways that ensure everyone gets the same results. Many existing data science projects rely on Snakemake. The more moving parts you have in your data science workflow, the more likely you’ll benefit from automating that workflow with Snakemake.
Snakemake workflows resemble GNU Make workflows—you define the steps of the workflow with rules, which specify what they take in, what they put out, and what commands to execute to accomplish that. Workflow rules can be multithreaded (assuming that gives them any benefit), and configuration data can be piped in from JSON or YAML files. You can also define functions in your workflows to transform data used in rules, and write the actions taken at each step to logs.
Snakemake jobs are designed to be portable—they can be deployed on any Kubernetes-managed environment, or in specific cloud environments like Google Cloud Life Sciences or Tibanna on AWS. Workflows can be “frozen” to use a specific set of packages, and successfully executed workflows can have unit tests automatically generated and stored with them. And for long-term archiving, you can store the workflow as a tarball.
Original Link:https://www.infoworld.com/article/2338444/7-newer-data-science-tools-you-should-be-using-with-python.html
Originally Posted: Wed, 15 Oct 2025 09:00:00 +0000
What do you think?
It is nice to know your opinion. Leave a comment.