Oxbow is a project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application. We are excited to introduce terraform-oxbow, an open-source Terraform module that simplifies the deployment and management of AWS Lambda and its supporting components. Whether you’re working with AWS Glue, Kinesis Data Firehose, SQS, or DynamoDB, this module provides a streamlined approach to infras...| Scribd Technology
One of the major themes for Infrastructure Engineering over the past couple years has been higher reliability and better operational efficiency. In a recent session with the Delta Lake project I was able to share the work led Kuntal Basu and a number of other people to dramatically improve the efficiency and reliability of our online data ingestion pipeline.| Scribd Technology
Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.| Scribd Technology
We brought a whole team to San Francisco to present and attend this year’s Data and AI Summit, and it was a blast! I would consider the event a success both in the attendance to the Scribd hosted talks and the number of talks which discussed patterns we have adopted in our own data and ML platform. The three talks I wrote about previously were well received and have since been posted to YouTube along with hundreds of other talks.| Scribd Technology
We recently migrated Looker to a Databricks SQL Serverless, improving our infrastructure cost and reducing the footprint of infrastructure we need to worry about! “Databricks SQL” which provides a single load balanced Warehouse for executing Spark SQL queries across multiple Spark clusters behind the scenes. “Serverless” is an evolution of that concept, rather than running a SQL Warehouse in our AWS infrastructure, the entirety of execution happens on the Databricks side. With a much ...| Scribd Technology
We are very excited to be presenting and attending this year’s Data and AI Summit which will be hosted virtually and physically in San Francisco from June 27th-30th. Throughout the course of 2021 we completed a number of really interesting projects built around delta-rs and the Databricks platform which we are thrilled to share with a broader audience. In addition to the presentations listed below, a number of Scribd engineers who are responsible for data and ML platform, machine learning s...| Scribd Technology
Armadillo is the fully featured audio player library Scribd uses to play and download all of its audiobooks and podcasts, which is now open source. It specializes in playing HLS or MP3 content that is broken down into chapters or tracks. It leverages Google’s Exoplayer library for its audio engine. Exoplayer wraps a variety of low level audio and video apis but has few opinions of its own for actually using audio in an Android app.| Scribd Technology
Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects and content types which can make it challenging to link them together. One way to connect content can be through a taxonomy - an important type of structured information widely used in various domains. In this series, we have already shared how we identify document types and...| Scribd Technology
Extracting metadata from our documents is an important part of our discovery and recommendation pipeline, but discerning useful and relevant details from text-heavy user-uploaded documents can be challenging. This is part 2 in a series of blog posts describing a multi-component machine learning system the Applied Research team built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations the team faced when...| Scribd Technology
Delta Lake is integral to our data platform which is why we have invested heavily in delta-rs to support our non-JVM Delta Lake needs. This year I had the opportunity to share the progress of delta-rs at Data and AI Summit. Delta-rs was originally started by my colleague QP just over a year ago and it has now grown to now a multi-company project with numerous contributors, and downstream projects such as kafka-delta-ingest.| Scribd Technology
User-uploaded documents have been a core component of Scribd’s business from the very beginning, understanding what is actually in the document corpus unlocks exciting new opportunities for discovery and recommendation. With Scribd anybody can upload and share documents, analogous to YouTube and videos. Over the years, our document corpus has become larger and more diverse which has made understanding it an ever-increasing challenge. Over the past year one of the missions of the Applied Res...| Scribd Technology
Streaming data from Apache Kafka into Delta Lake is an integral part of Scribd’s data platform, but has been challenging to manage and scale. We use Spark Structured Streaming jobs to read data from Kafka topics and write that data into Delta Lake tables. This approach gets the job done but in production our experience has convinced us that a different approach is necessary to efficiently bring data from Kafka to Delta Lake. To serve this need, we created kafka-delta-ingest.| Scribd Technology