System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being a...| www.startdataengineering.com
1. Introduction 2. Setup 3. Parts of data engineering 3.1. Requirements 3.1.1. Understand input datasets available 3.1.2. Define what the output dataset will look like 3.1.3. Define SLAs so stakeholders know what to expect 3.1.4. Define checks to ensure the output dataset is usable 3.2. Identify what tool to use to process data 3.3. Data flow architecture 3.| Start Data Engineering
Preparing for data engineering interviews can be stressful. There are so many things to learn. In this 'Data Engineering Interview Series', you will learn how to crack each section of the data engineering interview. If you have felt > That you need to practice 100s of Leetcode questions to crack the data engineering interview > That you have no idea where/how to start preparing for the data structures and algorithms interview > That you are not good enough to crack the data structures and alg...| www.startdataengineering.com
Data quality is such a broad topic. There are many ways to check the data quality of a dataset, but knowing what checks to run and when can be confusing and unclear. In this post, we will review the main types of data quality checks, where to use them, and what to do if a DQ check fails. By the end of this post, you will not only have a clear understanding of the different types of DQ checks and when to use them, but you'll also be equipped with the knowledge to prioritize which DQ checks to ...| www.startdataengineering.com
Do you use SQL or Python for data processing? Every data engineer will have their preference. Some will swear by Python, stating that it's a Turing-complete language. At the same time, the SQL camp will restate its performance, ease of understanding, etc. Not using the right tool for the job can lead to hard-to-maintain code and sleepless nights! Using the right tool for the job can help you progress the career ladder, but every advice online seems to be 'Just use Python' or 'Just use SQL.' U...| www.startdataengineering.com
You know Python is essential for a data engineer. Does anyone know how much one should learn to become a data engineer? When you're in an interview with a hiring manager, how can you effectively demonstrate your Python proficiency? Imagine knowing exactly how to build resilient and stable data pipelines (using any language). Knowing the foundational ideas for data processing will ensure you can quickly adapt to the ever-changing tools landscape. In this post, we will review the concepts you n...| www.startdataengineering.com
Do you need clarification about what Open Table Formats (OTF) are? Is it more than just a pointer to some metadata files that helps you sift through the data quickly? What is the difference between table formats (Apache Iceberg, Apache Hudi, Delta Lake) & file formats (Parquet, ORC)? How do OTFs work? Then this post is for you. Understanding the underlying principle behind open table formats will enable you to deeply understand what happens behind the scenes and make the right decisions when ...| www.startdataengineering.com
Stream processing differs from batch; one needs to be mindful of the system's memory, event order, and system recovery in case of failures. However, understanding the fundamental concepts of time attributes, cluster memory, time-bounded joins, and system monitoring will enable you to build resilient and efficient streaming pipelines. If you are looking for an end-to-end streaming tutorial or a project to understand the foundational skills required to build streaming pipelines, this post is fo...| www.startdataengineering.com
Frustrated trying to pry data pipeline requirements out of end users? Is scope creep preventing you from delivering data projects on time? You assume that the end-users know (and communicate) exactly what they want, but that is rarely the case! Adhoc feature/change requests throw off your delivery timelines. You want to deliver on time and make an impact, but you are interrupted constantly! Requirements gathering is rough, but it doesn't have to be! We go over five steps you can follow, to wo...| www.startdataengineering.com
Worried about introducing data pipeline bugs, regressions, or introducing breaking changes? Then this post is for you. In this post, you will learn what CI is, why it is crucial to have data tests as part of CI, and how to create a CI pipeline that automatically runs data tests on pull requests using Github Actions.| www.startdataengineering.com
Are you disappointed with online SQL tutorials that aren't deep enough? Are you frustrated knowing that you are missing SQL skills, but can't quite put your finger on it? This post is for you. In this post, we go over a few topics that can take your SQL skills to the next level and help you be a better data engineer.| www.startdataengineering.com
Wondering how to store a dimension table's history over time and how to join these historical dimension tables with fact tables for analytical querying ? Then this post is for you. In this post, we will go over a popular dimension modeling technique called SCD2, which preserves historical changes. We will also see how to join a fact table with an SCD2 table to get accurate point in time information.| www.startdataengineering.com
Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.| www.startdataengineering.com
If you are trying to improve your data engineering skills or are the sole data person in your company, it can be hard to know how well your technical skills are developing. Questions like Am I building pipelines the right way? How do I measure up to DEs at bigger tech companies? How do I get feedback on my pipeline design? It can cause a lot of uncertainty in career development! Imagine if you know that your code is on par (or even better than) with pipelines at tech-forward companies and tha...| www.startdataengineering.com
Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your data pipelines are elegant and enable you to deliver features quickly? An easy-to-maintain and extendable data pipeline significantly increase developer morale, stakeholder trust, and the business bottom line! Using the correct design pattern will increase feature delivery speed and developer valu...| www.startdataengineering.com
Unclear what a data warehouse is or when to use one? Then this post is for you. In this post, we go over what a data warehouse is, the need for it, and the differences between using an OLTP and OLAP database as a data warehouse.| www.startdataengineering.com