The Apache Software Foundation (ASF) has presented a new major release of Apache SystemDS. Version 3.0.0 of the open-source machine learning and data science software offers a number of improvements and introduces a number of new features, including a federated backend that enables multi-tenancy.
Higher performance and multi-tenancy
The SystemDS, which is designed to work with the big data framework Apache Spark, is now also adapting to Spark 3.0 and delivering an update to Java 11. Probably the most important innovation in the major release are the options for multi-client capability. To this end, the development team has added a federated backend to SystemDS 3.0.0, which offers support for multi-tenancy so that different clients can be isolated from one another. SystemDS users should also benefit from higher performance and stability – among other things through compression of the workloads in the network and the use of a cost-based scheduler.
With the new release, data scientists now have full access to all functions of the Top-K cleaning framework, which can be used to automatically create data cleansing pipelines based on the top-k algorithm. In SystemDS 3.0.0, the development team also provides a new unified memory manager for the first time, which can be used at least to a limited extent. Performance-oriented improvements include compressed linear algebra and multi-threaded feature transformations. The release notes for SystemDS 3.0.0 and the changelog on the project page on GitHub provide an overview of all changes.
From ML system to end-to-end data science lifecycle
The project dates back to SystemML, originally developed by IBM, which offered the DML (Declarative Machine Learning) language that allowed algorithms to be written in an R- or Python-like syntax. SystemML also required a MapReduce or Spark environment. IBM, as a supporter of the Spark community, initially released the machine learning system, which is comparable to Google’s TensorFlow, as open source in 2015 and later officially transferred it to the ASF. In just a year and a half, SystemML made the leap from the probationary phase in the incubator to a top-level project at the ASF.
Since then, the system has been continuously developed by the community and has taken the step from pure machine learning to software for the entire data science lifecycle, which is also reflected in the renaming of SystemML to SystemDS since version 2.x. Using a variety of declarative languages with syntax inspired by R, ML practitioners and data scientists can create their scripts and pipelines for everything from data integration and cleansing, feature engineering and ML model training to deployment. These can then be executed via Apache Spark – for example in the Spark MLContext and Spark Batch modes.
SystemDS builds on DataTensors as the underlying data model. These are tensors (multidimensional arrays) whose first dimension can have a heterogeneous and nested scheme. SystemDS wants to differentiate itself from comparable ML systems that are limited to homogeneous tensors or 2D data sets.