We know that building scalable data pipelines powerful enough to interpret large amounts bioinformatic data has never been easy. Especially, when you're already under-water trying to figure out how to make results trackable all the way back to its raw input data.
The purpose of this webinar is to demonstrate to you that your data-pipelines can, in fact, be scalable from the start. And that everything, yes everything (including raw data) can be versioned, tracked, and reconstructed at a moments notice. Best of all, we're going to show it from a technical viewpoint and not bore you with a bunch of slides.
Pachyderm is “Git for Data Science.” We offer a complete enterprise-grade solution that brings scalability and complete version control for data while making sure your data science team has the same first-class development tools as software developers. Pachyderm is ideal for building machine learning pipelines and ETL workflows because we track every model/output directly to the raw input datasets that created it (aka: Provenance).
Since everything in Pachyderm is a container, data scientists can use any languages or libraries they want (e.g. Spark, R, Python, OpenCV, etc) without any additional infrastructure overhead.