The Data Science Incubation Program

Submit a proposal for Fall 2014
The goal of the Data Science Incubator is to enable new science by bringing together data scientists and domain scientists to work on focused, intensive, collaborative projects. Projects frequently, but not exclusively, involve a non-trivial software engineering component. Our team of data scientists can provide expertise in state-of-the-art technology and methods in large-scale data manipulation and analytics (e.g., Hadoop, GraphLab, Myria, SciDB), cloud and cluster computing, statistics and machine learning, and visualization to help researchers extract knowledge from large, complex, and noisy datasets.

Overview

To apply to the program, any faculty, research staff, or student (typically, but not exclusively, at UW) can submit a short project proposal (details below) describing the science goals, the relevant datasets, and the expected technical challenges.

Each project must include a project lead who is willing to physically co-locate with the incubator staff. We find that collaboration in a shared space is important for deeper technical engagement and provides opportunities for "cross-pollination" among multiple concurrent projects. For Fall 2014, the incubator will operate on Tuesdays and Thursdays, and the project lead should plan to be available for several hours on these days. The pilot program will operate out of the eScience space in Sieg 326, moving to a new Data Science Studio planned to open in November of 2014.

Incubator projects are not "for-hire" software jobs -- each project will be led by representatives of the applicant's team working in collaboration with the data scientists and the broader eScience community.

Areas of Focus

Each project will be different, but we emphasize projects in the following categories:
Scalable Analytics:
As data sizes continue to explode, parallel methods have become critical at every step. Scripts in Python and R are not natively parallel and are difficult to apply to datasets larger than main memory. Our team can help triage your problem and adapt it for use with parallel data manipulation and machine learning platforms such as Hadoop/MapReduce, parallel SQL databases, GraphLab, SciDB, and advanced research systems such as UW's own Myria. We also design and implement new parallel algorithms for large datasets independent of existing platforms.
Data Management and Automation:
Our collaborators report spending 90% of their time "handling" data as opposed to analyzing data: data discovery, acquisition, file format conversions, cleaning, restructuring, loading, sharing, etc. Leveraging technology from cloud providers and SQLShare, we aim to simplify or eliminate these data manipulation tasks and let researchers focus on the science.
Visualization:
We have experience building data-driven visualizations to help scientists make sense of data. We focus on web-enabled, interactive visualizations using platforms like D3.
Reproducibility and Open Science:
We can help you share your code, data, and results with collaborators and with the general public. We favor projects that emphasize open data and open source, allowing other researchers to recreate your results with minimal effort. We advocate alternative metrics and can help you maximize recognition and credit for ensuring reproducibility and open access. Suitable incubator projects may include organizing and uploading data into suitable public repositories, reviewing and publishing your code on GitHub, identifying venues for publishing papers describing your data or code (they exist!), or migrating your application to a commercial cloud to improve access.
We structure our work according to agile methodologies, typically breaking large projects into multiple short-term sprints of a few weeks each.

Success Stories

Our team has a strong track record of building systems that get real use. Below are listed some of our previous collaborations. In the Spring 2014 pilot, we accepted 6 proposals from 5 different departments around campus led by students, postdocs, research staff, and faculty. You can review the full list of projects from Spring 2014.

How to Get Started

Do you have interesting data challenges? You can submit a project proposal for Fall quarter. Proposals should include: These proposals are prioritized based on the following criteria: We expect that some good proposals will not meet every criteria.

Important Dates for the Fall 2014 Session

Additional Info

You can learn more by reviewing the slides from the information session from February 2014. In addition, you can review some frequently asked questions.
Sponsor Logos