Machine Learning: Apache Flink ML 2.1 brings ten new algorithms

Machine Learning: Apache Flink ML 2.1 brings ten new algorithms

The Apache Flink team under the umbrella of the Apache Software Foundation (ASF) has released Apache Flink ML 2.1. The second release of the year brings innovations to the infrastructure, implements ten more algorithms and adds example programs for Python and Java for each of the new algorithms.

One of the main goals of Apache Flink ML is to promote the development of online machine learning applications. Version 2.0 added the Apache Flink ML API setModelData() and getModelData()to allow users of online learning algorithms to submit and persist model data as unbounded data streams.

The current release ties in with the two new prototypes for online learning algorithms OnlineKMeans and OnlineLogisticRegression there on. They introduce concepts like a global batch size and model version, as well as metrics and APIs for setting and getting those values. Tests for the entire life cycle of the algorithms are also on board. However, the prototypes have not yet been optimized for prediction accuracy, but rather are meant to be a step towards best practices for building such algorithms in Apache Flink ML. The community is called upon to participate in this process.

In version 2.1, the library will receive a total of ten new algorithms, which, in addition to the online learning algorithms described, will focus on validating the functions and performance of the Apache Flink ML infrastructure. [Link auf https://nightlies.apache.org/flink/flink-ml-docs-release-2.1/]

The new algorithms can be grouped into five categories: Feature Engineering (MinMaxScaler, StringIndexer, VectorAssembler, StandardScaler, Bucketizer), online learning (OnlineKMeans, OnlineLogisticRegression), regression (LinearRegression), classification (LinearSVC) and evaluation (BinaryClassificationEvaluator).

The project’s website shows example programs for using these algorithms with Python and Java to help users get started.

The Apache Flink ML team has revised the infrastructure in various areas. For example, the amount of managed memory that an operator can use can now be specified, and a new benchmark framework outputs benchmark results in JSON format, among other things. The data can be visualized using a script with Matplotlib.

In addition, a new feature in the Python SDK allows operators in the Python library to call the corresponding operators in the Java library. The Python operator is then a thin wrapper around the Java operator and offers the same performance during execution. This should make the Python and Java algorithm libraries easier to manage, since it is not necessary to implement the algorithms twice.

The Apache Flink version required has increased from 1.14 to 1.15. This is also accompanied by the breaking changes in the framework for processing data streams.

All further information about the new release of Apache Flink ML can be found in the Apache Flink blog.

C++ Special Member Functions: The Webinar by Heise Previous post C++ Special Member Functions: The Webinar by Heise
How to evaluate network recordings with Python and Scapy Next post How to evaluate network recordings with Python and Scapy