Setting up Spark locally is not easy! Especially if you are simultaneously trying to learn Spark. If you > Don't know how to start working with Spark locally > Don't know what the recommended tools are to work with Spark (like which IDE or data storage table formats) > Try and try, and then give up, only to end up trying to use one of the cloud providers or give up altogether. This post is for you! You can have a fully functioning local Spark development environment with all the bells and whi...| www.startdataengineering.com
If you have worked at a company that moves fast (or claims to), you've inevitably had to deal with your pipelines breaking because the upstream team decided to change the data schema! If you are > Frequently in meetings, fixing pipeline issues due to schema changes > Stressed, unable to deliver quality work, always in a hurry to put out the next fire > Working with teams who have to prioritize speed over everything This post is for you. Constantly dealing with broken pipelines due to upstream...| www.startdataengineering.com
System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being a...| www.startdataengineering.com
You know Python is essential for a data engineer. Does anyone know how much one should learn to become a data engineer? When you're in an interview with a hiring manager, how can you effectively demonstrate your Python proficiency? Imagine knowing exactly how to build resilient and stable data pipelines (using any language). Knowing the foundational ideas for data processing will ensure you can quickly adapt to the ever-changing tools landscape. In this post, we will review the concepts you n...| www.startdataengineering.com
Working on a large codebase without any tests can be nerve-wracking. One wrong line of code or an in-conspicuous library update can bring down your whole production pipeline! Data pipelines start simple, so engineers skip tests, but the complexity increases rapidly after a while, and the lack of tests can grind down your feature delivery speed. It can be especially tricky to start testing if you are working on a large legacy codebase with few to no tests. In long-running data pipelines, bad c...| www.startdataengineering.com
Are you part of an under-resourced team where adding time-saving dbt (data build tool) features take a back seat to delivering new datasets? Do you want to incorporate time (& money) saving dbt processes but need more time? While focussing on delivery may help in the short term, the delivery speed will suffer without proper workflow! A good workflow will save time, prevent bad data, and ensure high development speed! Imagine the time (& mental pressure) savings if you didn't have to validate ...| www.startdataengineering.com
Struggling with setting up a local development environment for your python data projects? Then this post is for you! In this post, you will learn how to set up a local development environment for data projects using docker. By the end of this post, you will know how to set up your local development environment the right way with docker. You will be able to increase developer ergonomics, increase development velocity and reduce bugs.| www.startdataengineering.com
Struggling to come up with a data engineering project idea? Overwhelmed by all the setup necessary to start building a data engineering project? Don't know where to get data for your side project? Then this post is for you. We will go over the key components, and help you understand what you need to design and build your data projects. We will do this using a sample end-to-end data engineering project.| www.startdataengineering.com