6 essential Python tools for data science—now improved

SciPy, Numba, Cython, Dask, Vaex, and Intel SDC all have new versions that aid big data analytics and machine learning projects.

5 essential Python tools for data science—now improved
Thinkstock

If you want to master, or even just use, data analysis, Python is the place to do it. Python is easy to learn, it has vast and deep support, and most every data science library and machine learning framework out there has a Python interface.

Over the past few months, several data science projects for Python have released new versions with major feature updates. Some are about actual number-crunching; others make it easier for Pythonistas to write fast code optimized for those jobs.

Python data science essential: SciPy 1.7

Python users who want a fast and powerful math library can use NumPy, but NumPy by itself isn’t very task-focused. SciPy uses NumPy to provide libraries for common math- and science-oriented programming tasks, from linear algebra to statistical work to signal processing.

How SciPy helps with data science

SciPy has long been useful for providing convenient and widely used tools for working with math and statistics. But for the longest time, it didn’t have a proper 1.0 release, although it had strong backward compatibility across versions.

The trigger for bringing the SciPy project to version 1.0, according to core developer Ralf Gommers, was chiefly a consolidation of how the project was governed and managed. But it also included a process for continuous integration for the MacOS and Windows builds, as well as proper support for prebuilt Windows binaries. This last feature means Windows users can now use SciPy without having to jump through additional hoops.

Since the SciPy 1.0 release in 2017, the project has delivered seven major point releases, with many improvements along the way:

  • Deprecation of Python 2.7 support, and a subsequent modernization of the code base.
  • Constant improvements and updates to SciPy’s submodules, with more functionality, better documentation, and many new algorithms — e.g., a new fast Fourier transform module with better performance and modernized interfaces.
  • Better support for functions in LAPACK, a Fortran package for solving common linear equation problems.
  • Better compatibility with the alternative Python runtime PyPy, which includes a JIT compiler for faster long-running code.

Where to download SciPy

SciPy binaries can be downloaded from the Python Package Index, or by typing pip install scipy. Source code is available on GitHub.

Python data science essential: Numba 0.53.0

Numba lets Python functions or modules be compiled to assembly language via the LLVM compiler framework. You can do this on the fly, whenever a Python program runs, or ahead of time. In that sense, Numba is like Cython, but Numba is often more convenient to work with, although code accelerated with Cython is easier to distribute to third parties.

How Numba helps with data science

The most obvious way Numba helps data scientists is by speeding operations written in Python. You can prototype projects in pure Python, then annotate them with Numba to be fast enough for production use.

Numba can also provide speedups that run even faster on hardware built for machine learning and data science applications. Earlier versions of Numba supported compiling to CUDA-accelerated code, but the most recent versions sport a new, far-more-efficient GPU code reduction algorithm for faster compilation, as well as support for both Nvidia CUDA and AMD ROCm APIs.

Numba can also optimize JIT compiled functions for parallel execution across CPU cores whenever possible, although your code will need a little extra syntax to accomplish that properly.

Where to download Numba

Numba is available on the Python Package Index, and it can be installed by typing pip install numba from the command line. Prebuilt binaries are available for Windows, MacOS, and generic Linux. It’s also available as part of the Anaconda Python distribution, where it can be installed by typing conda install numba. Source code is available on GitHub.

Python data science essential: Cython 3.0 (beta)

Cython transforms Python code into C code that can run orders of magnitude faster. This transformation comes in most handy with code that is math-heavy or code that runs in tight loops, both of which are common in Python programs written for engineering, science, and machine learning.

How Cython helps with data science

Cython code is essentially Python code, with some additional syntax. Python code can be compiled to C with Cython, but the best performance improvements—on the order of tens to hundreds of times faster—come from using Cython’s type annotations.

Before Cython 3 came along, Cython sported a 0.xx version numbering scheme. With Cython 3, the language dropped support for Python 2 syntax. Despite Cython 3 still being in beta, Cython’s maintainers encourage people to use it in place of earlier versions. Cython 3 also emphasizes greater use of “pure Python” mode, in which many (although not all) of Cython’s functions can be made available using syntax that is 100% Python-compatible.

Cython also supports integration with IPython/Jupyter notebooks. Cython-compiled code can be used in Jupyter notebooks via inline annotations, as if Cython code were any other Python code.

You can also compile Cython modules for Jupyter with profile-guided optimization enabled. Modules built with this option are compiled and optimized based on profiling information generated for them, so they run faster. Note that this option is only available for Cython when used with the GCC compiler; MSVC support isn’t there yet.

Where to get Cython

Cython is available on the Python Package Index, and it can be installed with pip install cython from the command line. Binary versions for 32-bit and 64-bit Windows, generic Linux, and MacOS are included. Source code is on GitHub. Note that a C compiler must be present on your platform to use Cython.

Python data science essential: Dask 2021.07.0

Processing power is cheaper than ever, but it can be tricky to leverage it in the most powerful way—by breaking tasks across multiple CPU cores, physical processors, or compute nodes.

Dask takes a Python job and schedules it efficiently across multiple systems. And because the syntax used to launch Dask jobs is virtually the same as the syntax used to do other things in Python, taking advantage of Dask requires little reworking of existing code.

How Dask helps with data science

Dask provides its own versions of some interfaces for many popular machine learning and scientific-computing libraries in Python. Its DataFrame object is the same as the one in the Pandas library; likewise, its Array object works just like NumPy’s. Thus Dask allows you to quickly parallelize existing code by changing only a few lines of code.

Dask can also be used to parallelize jobs written in pure Python, and it has object types (such as Bag) suited to optimizing operations like map, filter, and groupby on collections of generic Python objects.

Where to download Dask

Dask is available on the Python Package Index, and can be installed via pip install dask. It’s also available via the Anaconda distribution of Python, by typing conda install dask. Source code is available on GitHub.

Python data science essential: Vaex 4.30 

Vaex allows users to perform lazy operations on big tabular datasets—essentially, dataframes as per NumPy or Pandas. “Big” in this case means billions of rows, with all operations done as efficiently as possible, with zero copying of data, minimal memory usage, and buillt-in visualization tools.

How Vaex helps with data science

Working with large datasets in Python often involves a good deal of wasted memory or processing power, especially if the work only involves a subset of the data—e.g., one column from a table. Vaex performs computations on demand, when they’re actually needed, making the best use of available computing resources.

Where to download Vaex

Vaex is available on the Python Package Index, and can be installed with pip install vaex from the command line. Note that for best results, it’s recommended that you install Vaex in a virtual environment, or that you use the Anaconda distribution of Python.

Python data science essential: Intel SDC

Intel’s Scalable Dataframe Compiler (SDC), formerly the High Performance Analytics Toolkit, is an experimental project for accelerating data analytics and machine learning on clusters. It compiles a subset of Python to code that is automatically parallelized across clusters using the mpirun utility from the Open MPI project.

How Intel SDC helps with data science

HPAT uses Numba, but unlike that project and Cython, it doesn’t compile Python as is. Instead, it takes a restricted subset of the Python language—chiefly, NumPy arrays and Pandas dataframes—and optimizes them to run across multiple nodes.

Like Numba, HPAT has the @jit decorator that can turn specific functions into their optimized counterparts. It also includes a native I/O module for reading from and writing to HDF5 (not HDFS) files.

Where to download Intel SDC

SDC is available only in source format at GitHub. Binaries are not provided.

Copyright © 2021 IDG Communications, Inc.