Scribd offers a variety of publisher and user-uploaded content to our users and while the publisher content is rich in metadata, user-uploaded content typically is not. Documents uploaded by the users have varied subjects and content types which can make it challenging to link them together. One way to connect content can be through a taxonomy - an important type of structured information widely used in various domains. In this series, we have already shared how we identify document types and...| Scribd Technology
Extracting metadata from our documents is an important part of our discovery and recommendation pipeline, but discerning useful and relevant details from text-heavy user-uploaded documents can be challenging. This is part 2 in a series of blog posts describing a multi-component machine learning system the Applied Research team built to extract metadata from our documents in order to enrich downstream discovery models. In this post, we present the challenges and limitations the team faced when...| Scribd Technology
User-uploaded documents have been a core component of Scribd’s business from the very beginning, understanding what is actually in the document corpus unlocks exciting new opportunities for discovery and recommendation. With Scribd anybody can upload and share documents, analogous to YouTube and videos. Over the years, our document corpus has become larger and more diverse which has made understanding it an ever-increasing challenge. Over the past year one of the missions of the Applied Res...| Scribd Technology