Over the past decade, every department has wanted to be data-driven, and data engineering teams are under more pressure than ever. If you have been an engineer for over a few years, you would have seen your world change from a 'well-planned data model' to a 'dump everything in S3 and get some data for the end-user'. Data engineers are under a lot of stress caused by : > The Business is becoming too complex, and every department wants to become data-driven; thus, expectations from the data tea...| www.startdataengineering.com
As a data engineer, you would have spent hours trying to figure out the right place to make a change in your repository—I know I have. > You think, "Why is it so difficult to make a simple change?". > You push a simple change (with tests, by the way), and suddenly, production issues start popping up! > Dealing with on-call issues when your repository is spaghetti code with multiple layers of abstracted logic is a special hell that makes data engineers age in dog years! > Messy code leads to...| www.startdataengineering.com
If you've been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago. However, no one dares to touch them, as they may be powering some important part of the data pipeline, and everyone is scared of accidentally breaking them. If you feel > Rough SQL is a good place to start, but it cannot scale after a certain limit > That dogmatic KISS approach leads to unmaintainable systems > The simplest solution that takes the shortest time ...| www.startdataengineering.com
If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist. The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets us...| www.startdataengineering.com
System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being a...| www.startdataengineering.com
Data quality checks are critical for any production pipeline. While there are many ways to implement data quality checks, the greatexpectations library is one of the popular ones. If you have wondered 1. How can you effectively use the greatexpectations library? 2. Why is the greatexpectations library so complex? 3. Why is the greatexpectations library so clunky and has many moving pieces? Then this post is for you. In this post, we will go over the key concepts you’ll need to get up and ru...| www.startdataengineering.com
Do you use SQL or Python for data processing? Every data engineer will have their preference. Some will swear by Python, stating that it's a Turing-complete language. At the same time, the SQL camp will restate its performance, ease of understanding, etc. Not using the right tool for the job can lead to hard-to-maintain code and sleepless nights! Using the right tool for the job can help you progress the career ladder, but every advice online seems to be 'Just use Python' or 'Just use SQL.' U...| www.startdataengineering.com
Are you a data engineer(or new to data space) wondering why one may need to use Apache Airflow vs. just using cron? Does Apache Airflow feel like an over-optimized solution for a simple problem? Then this post is for you. Understanding the critical features necessary for a data pipelining system will ensure that your output is high quality! Imagine knowing exactly what a complex orchestration system brings to the table; you can make the right tradeoffs for your data architecture. This post wi...| www.startdataengineering.com
Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors ch...| www.startdataengineering.com
Imagine this scenario: You are on call when suddenly an obscure alert pops up. It just says that your pipeline failed but has no other information. The pipelines you inherited (or didn't build) seem like impenetrable black boxes. When they break, it's a mystery—why did it happen? Where did it go wrong? The feeling is palpable: frustration and anxiety mount as you scramble to resolve the issue swiftly. It's a common struggle, especially for new team members who have yet to unravel the system...| www.startdataengineering.com
Whether you are a new Data Engineer or someone with a few years of experience, you inevitably would have encountered messy data systems that seemed impossible to fix. Working at such a company usually comes with multiple pointless meetings, no clear work expectations, frustration, career stagnation, and ultimately no satisfaction from work! The reasons can be Managerial: Such as politics, red tape, cluelessness of management, influential people dictating roadmap, etc or Technical: Such as no ...| www.startdataengineering.com