If you are a Pandas or NumPy user and have ever tried to create a Spark DataFrame from local data, you might have noticed that it is an unbearably slow process. In fact, the time it takes to do so usually prohibits this from any data set that is at all interesting. Starting from Spark 2.3, the addition of SPARK-22216 enables creating a DataFrame from Pandas using Arrow to make this process much more efficient. You can now transfer large data sets to Spark from your local Pandas session almost...| BryanCutler.github.io
Tuning machine learning models in Spark involves selecting the best performing parameters for a model using CrossValidator or TrainValidationSplit. This process uses a parameter grid where a model is trained for each combination of parameters and evaluated according to a metric. Prior to Spark 2.3, running CrossValidator or TrainValidationSplit will train and evaluate one model at a time in serial, until each combination in the parameter grid has been evaluated. Spark of course will perform d...| BryanCutler.github.io
With the introduction of Apache Arrow in Spark, it makes it possible to evaluate Python UDFs as vectorized functions. In addition to the performance benefits from vectorized functions, it also opens up more possibilities by using Pandas for input and output of the UDF. This post will show some details of on-going work I have been doing in this area and how to put it to use.| BryanCutler.github.io
Cross-Validation with Apache Spark Pipelines is commonly used to tune the hyperparameters of stages in a PipelineModel. But what do you do if you want to evaluate more than one pipeline with different stages, e.g. using different types of classifiers? You would probably just run cross-validation on each pipeline separately and compare the results, which would generally work fine. You might not know that stages are actually a parameter in the PipelineModel and can be evaluated just like any ot...| BryanCutler.github.io
It’s probably the engineer in me, but I’d much rather be programming that writing a dumb blog. I generally don’t care for them, and never had any interest in writing one. They’re mostly full of noise and fluff, or rehashed ideas but… these days I’ve been working a lot in open source, and I’ve seen some great posts. I guess sometimes a blog is the best way to spread useful information or build up some interest around a good idea. Hopefully, this will accomplish that and give back...| bryancutler.github.io
The upcoming release of Apache Spark 2.3 will include Apache Arrow as a dependency. For those that do not know, Arrow is an in-memory columnar data format with APIs in Java, C++, and Python. Since Spark does a lot of data transfer between the JVM and Python, this is particularly useful and can really help optimize the performance of PySpark. In my post on the Arrow blog, I showed a basic example on how to enable Arrow for a much more efficient conversion of a Spark DataFrame to Pandas. Follow...| bryancutler.github.io