If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.| www.startdataengineering.com
There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.| www.startdataengineering.com
This post covers key techniques to optimize your Apache Spark code. You will know exactly what distributed data storage and distributed data processing systems are, how they operate and how to use them efficiently. Go beyond the basic syntax and learn 3 powerful strategies to drastically improve the performance of your Apache Spark project.| www.startdataengineering.com
Installing from Source| git-scm.com