In my August 2020 post, “How to Choose a Cloud Machine Learning Platform,” my first recommendation for selecting a platform was to “Be close to your data.” Since the speed of light restricts transfer rates, keeping the code close to the data is needed to keep latency low. After all, machine learning, especially deep learning, has a habit of going through all of the data multiple times (each time through is called an epoch).
I said at the time that the best scenario for very large data sets is to construct the model where the data already exists, eliminating the need for mass data transfer. Several libraries, to varying degrees, accept this. The logical next question is, which databases help internal machine learning, and how? I’ll go over those databases alphabetically.
Amazon Redshift is a powered, petabyte-scale data warehouse solution built to make analysing all of your data from your existing business analytics software easy and cost-effective. It is designed for datasets ranging in size from a few hundred gigabytes to a petabyte or more, and it costs less than $1,000 each terabyte per year.
Amazon Redshift ML is designed to make it easy for SQL users to create, train, and deploy machine learning models using SQL commands. The CREATE MODEL command in Redshift SQL defines the data to use for training and the target column, then passes the data to Amazon SageMaker Autopilot for training via an encrypted Amazon S3 bucket in the same zone.
Redshift ML compiles the best model after AutoML training and registers it as a prediction SQL feature in your Redshift cluster. The model can then be used for inference by calling the predictor function inside a SELECT argument.
Summary: Redshift ML employs SageMaker Autopilot to generate prediction models automatically from data specified through a SQL statement and extracted to an S3 bucket. The best prediction function discovered is saved in the Redshift cluster.
BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem; it exists as an open-source project and a paid service. RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format. CuDF, part of RAPIDS, is a Pandas-like GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
Dask is a free and open-source platform for scaling Python packages across several machines. Dask can spread data and computation across several GPUs, either within the same device or across multiple nodes in a multi-node cluster. Dask works with RAPIDS cuDF, XGBoost, and RAPIDS cuML to allow GPU-accelerated data analytics and machine learning.
Summary: BlazingSQL can run GPU-accelerated queries on data lakes in Amazon S3, pass the resulting DataFrames to cuDF for data manipulation, and finally perform machine learning with RAPIDS XGBoost and cuML, and deep learning with PyTorch and TensorFlow.
Google Cloud BigQuery
BigQuery is Google Cloud’s managed, petabyte-scale data warehouse that lets you run analytics over vast amounts of data in near real time. BigQuery ML lets you create and execute machine learning models in BigQuery using SQL queries.
BigQuery ML facilitates forecasting with linear regression, grouping with binary and multi-class logistic regression, data segmentation with K-means clustering, and product recommendation systems with matrix factorization. time series for forecasting time series, including anomalies, seasonality, and holidays; Classification and regression models based on XGBoost; deep neural networks based on TensorFlow for classification and regression models; AutoML Tables; and TensorFlow model importing. You can train and forecast a model using data from several BigQuery datasets. The data from the data warehouse is not extracted by BigQuery ML. Using the TRANSFORM clause in your CREATE MODEL sentence, you can conduct function engineering with BigQuery ML.
Summary: BigQuery ML brings much of the power of Google Cloud Machine Learning into the BigQuery data warehouse with SQL syntax, without extracting the data from the data warehouse.
IBM Db2 Warehouse
IBM Db2 Warehouse on Cloud is a managed public cloud service. You can also set up IBM Db2 Warehouse on premises with your own hardware or in a private cloud. As a data warehouse, it includes features such as in-memory data processing and columnar tables for online analytical processing. Its Netezza technology provides a robust set of analytics that are designed to efficiently bring the query to the data. A range of libraries and functions help you get to the precise insight you need.
In-database deep learning in Python, R, and SQL is supported by Db2 Warehouse. The IDAX module contains analytical stored procedures such as analysis of variance, association rules, data transformation, decision trees, diagnostic measures, discretization and moments, K-means clustering, k-nearest neighbours, linear regression, metadata management, nave Bayes classification, principal component analysis, probability distributions, random sampling, regression trees, sequent analysis, and sequent analysis.
Summary: IBM Db2 Warehouse includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for R and Python.
Kinetica Streaming Data Warehouse combines historical and streaming data analysis with location intelligence and AI in a single platform, all accessible via API and SQL. Kinetica is a very fast, distributed, columnar, memory-first, GPU-accelerated database with filtering, visualization, and aggregation functionality.
Kinetica combines machine learning models and algorithms with your data to allow real-time predictive analytics at scale. It enables you to stream-calculate functionality and streamline your data streams and the lifecycle of your analytics, machine learning models, and data engineering. Kinetica offers a complete lifecycle approach for machine learning accelerated by GPUs, including guided Jupyter notebooks, RAPIDS algorithm training, and automatic model implementation and inference in the Kinetica platform.
Summary: Kinetica provides a full in-database lifecycle solution for machine learning accelerated by GPUs, and can calculate features from streaming data.
Microsoft SQL Server
Microsoft SQL Server Machine Learning Services supports R, Python, Java, the PREDICT T-SQL command, and the rx_Predict stored procedure in the SQL Server RDBMS, and SparkML in SQL Server Big Data Clusters. In the R and Python languages, Microsoft includes several packages and libraries for machine learning. You can store your trained models in the database or externally. Azure SQL Managed Instance supports Machine Learning Services for Python and R as a preview.
Microsoft R has extensions that allow it to process data both from disc and from memory. SQL Server has an extension module that allows R, Python, and Java code to access SQL Server data and functions. SQL Server Big Data Clusters use Kubernetes to run SQL Server, Spark, and HDFS. As SQL Server invokes Python code, Azure Machine Learning can be invoked and the resulting model saved in the database for use in predictions.
Summary: Current versions of SQL Server can train and infer machine learning models in multiple programming languages.
Oracle Cloud Infrastructure (OCI) Computer Science is a managed and serverless platform that enables data science teams to develop, train, and maintain machine learning models on Oracle Cloud Infrastructure, including Oracle Autonomous Database and Oracle Autonomous Data Warehouse. It includes open source Python-centric tools, libraries, and packages, as well as the Oracle Accelerated Data Science (ADS) Library, which facilitates the end-to-end lifecycle of predictive models:
- Data acquisition, profiling, preparation, and visualization
- Feature engineering
- Model training (including Oracle AutoML)
- Model evaluation, explanation, and interpretation (including Oracle MLX)
- Model deployment to Oracle Functions
OCI Data Science integrates with the rest of the Oracle Cloud Infrastructure stack, including Functions, Data Flow, Autonomous Data Warehouse, and Object Storage.
Models currently supported include:
ADS also supports machine learning explainability (MLX).
Summary: Oracle Cloud Infrastructure can host data science resources integrated with its data warehouse, object store, and functions, allowing for a full model development lifecycle.
Vertica Analytics Platform is a scalable columnar storage data warehouse. It runs in two modes: Enterprise, which stores data locally in the file system of nodes that make up the database, and EON, which stores data communally for all compute nodes.
Vertica handles petabytes of data with massively parallel computing and employs data parallelism for internal machine learning. It includes eight data preparation algorithms, three regression algorithms, four sorting algorithms, two clustering algorithms, multiple model management features, and the ability to import TensorFlow and PMML models that have been trained elsewhere. Once you’ve fitted or imported a sample, you can use it to make predictions. User-defined extensions written in C++, Java, Python, or R are also supported by Vertica. SQL syntax is used for both preparation and inference.
Summary: Vertica has a nice set of machine learning algorithms built-in, and can import TensorFlow and PMML models. It can do prediction from imported models as well as its own models.
If your database doesn’t already support internal machine learning, it’s likely that you can add that capability using MindsDB, which integrates with a half-dozen databases and five BI tools. Supported databases include MariaDB, MySQL, PostgreSQL, ClickHouse, Microsoft SQL Server, and Snowflake, with a MongoDB integration in the works and integrations with streaming databases promised later in 2021. Supported BI tools currently include SAS, Qlik Sense, Microsoft Power BI, Looker, and Domo.
MindsDB features AutoML, AI tables, and explainable AI (XAI). You can invoke AutoML training from MindsDB Studio, from a SQL INSERT statement, or from a Python API call. Training can optionally use GPUs, and can optionally create a time series model.
You can save the model as a database table and then access it with a SQL SELECT statement against the saved model, MindsDB Studio, or a Python API call. MindsDB Studio allows you to compare, illustrate, and imagine model quality.
MindsDB Studio and the Python API can both be linked to local and remote data sources. MindsDB also provides Lightwood, a streamlined deep learning system that runs on PyTorch.
Summary: MindsDB adds valuable machine learning features to a variety of databases that lack built-in machine learning support.
A increasing number of datasets allow for internal machine learning. The precise process varies, with some being more capable than others. If you have so many data that you’d otherwise have to fit models on a sampled subset, some of the eight databases mentioned above—and others with the assistance of MindsDB—might be able to help you create models from the whole dataset without incurring significant data export overhead.