The upcoming release of Apache Spark 2.3 will include Apache Arrow as a dependency. For those that do not know, Arrow is an in-memory columnar data format with APIs in Java, C++, and Python. Since Spark does a lot of data transfer between the JVM and Python, this is particularly useful and can really help optimize the performance of PySpark. In my post on the Arrow blog, I showed a basic example on how to enable Arrow for a much more efficient conversion of a Spark DataFrame to Pandas. Follow...| bryancutler.github.io