We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones. At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backup...| Scribd Technology
Discover how to identify and resolve column mismatches between Delta Lake tables and SQL Endpoint in Microsoft Fabric| Sandeep Pawar | Microsoft Fabric
Oxbow is a project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application. We are excited to introduce terraform-oxbow, an open-source Terraform module that simplifies the deployment and management of AWS Lambda and its supporting components. Whether you’re working with AWS Glue, Kinesis Data Firehose, SQS, or DynamoDB, this module provides a streamlined approach to infras...| Scribd Technology
One of the major themes for Infrastructure Engineering over the past couple years has been higher reliability and better operational efficiency. In a recent session with the Delta Lake project I was able to share the work led Kuntal Basu and a number of other people to dramatically improve the efficiency and reliability of our online data ingestion pipeline.| Scribd Technology
We brought a whole team to San Francisco to present and attend this year’s Data and AI Summit, and it was a blast! I would consider the event a success both in the attendance to the Scribd hosted talks and the number of talks which discussed patterns we have adopted in our own data and ML platform. The three talks I wrote about previously were well received and have since been posted to YouTube along with hundreds of other talks.| Scribd Technology
We are very excited to be presenting and attending this year’s Data and AI Summit which will be hosted virtually and physically in San Francisco from June 27th-30th. Throughout the course of 2021 we completed a number of really interesting projects built around delta-rs and the Databricks platform which we are thrilled to share with a broader audience. In addition to the presentations listed below, a number of Scribd engineers who are responsible for data and ML platform, machine learning s...| Scribd Technology
Streaming data from Apache Kafka into Delta Lake is an integral part of Scribd’s data platform, but has been challenging to manage and scale. We use Spark Structured Streaming jobs to read data from Kafka topics and write that data into Delta Lake tables. This approach gets the job done but in production our experience has convinced us that a different approach is necessary to efficiently bring data from Kafka to Delta Lake. To serve this need, we created kafka-delta-ingest.| Scribd Technology
When your core business is selling tyre 1. Build A Data Custodian Network and a Data Catalogue 🕸️ The first step in becoming Data Driven is to identify the experts in the data within your company. Those would generally be people within your IT organisation that have a good understanding of| Michelin IT Engineering Blog