Apache Spark 1.6, which shipped on 4th jan.2016, offers performance enhancements that range from faster processing of the Parquet data format to better overall performance for streaming state management.
Apache Spark 1.6 is a large scale data processing system, which includes better memory management and also has untethered itself from the Hadoop platform. As a result, Spark can be used against key-value stores and other types of databases. Still, Hadoop remains a large part of the Spark target ecosystem, so the increased performance for processing data in the Apache Parquet format will accelerate the performance of Apache Spark when working with Hadoop systems.
The 1.6 release of Apache Spark introduces a new Parquet reader, bypassing the existing parquet-mr record assembly routines, which had previously been eating up a lot of processing cycles. The change promises an almost 50% improvement in speed.
Version 1.6 is the first to include the new mapWithState API. This new API scales linearly with the number of updates rather than the total number of records. This allows it to track the deltas instead of constantly rescanning the full dataset.
For data scientists, Apache Spark 1.6 has improved its machine-learning pipeline. The Pipeline API offers functionality to save and reload pipelines in persistent storage. Apache Spark 1.6 also increases algorithm coverage in machine learning by adding support for univariate and bivariate statistics, bisecting k-means clustering, online hypothesis testing, survival analysis, and non-standard JSON data.