The goal of the Data Science Incubator is to enable new science by bringing together data scientists and domain scientists to work on focused, intensive, collaborative projects. Projects frequently, but not exclusively, involve a non-trivial software engineering component. Our team of data scientists can provide expertise in state-of-the-art technology and methods in large-scale data manipulation and analytics (e.g., Hadoop, GraphLab, Myria, SciDB), cloud and cluster computing, statistics and machine learning, and visualization to help researchers extract knowledge from large, complex, and noisy datasets.
To apply to the program, any faculty, research staff, or student
(typically, but not exclusively, at UW) can submit a short project
proposal (details below) describing the science goals, the relevant
datasets, and the expected technical challenges.
Each project must include a project lead who is willing to physically
co-locate with the incubator staff. We
find that collaboration in a shared space is important for deeper
technical engagement and provides opportunities for "cross-pollination"
among multiple concurrent projects.
For Fall 2014, the incubator will operate on Tuesdays and Thursdays, and
the project lead should plan to be available for several hours on these days.
The pilot program will operate out of the eScience space in Sieg 326, moving to a new Data Science Studio planned to open in November of 2014.
Incubator projects are not "for-hire" software jobs -- each project will
be led by representatives of the applicant's team working in collaboration
with the data scientists and the broader eScience community.
Areas of Focus
Each project will be different, but we emphasize projects in the following categories:
- Scalable Analytics:
- As data sizes continue to explode, parallel methods have become critical at every step. Scripts in Python and R are not natively parallel and are difficult to apply to datasets larger than main memory. Our team can help triage your problem and adapt it for use with parallel data manipulation and machine learning platforms such as Hadoop/MapReduce, parallel SQL databases, GraphLab, SciDB, and advanced research systems such as UW's own Myria. We also design and implement new parallel algorithms for large datasets independent of existing platforms.
- Data Management and Automation:
- Our collaborators report spending 90% of their time "handling" data as opposed to analyzing data: data discovery, acquisition, file format conversions, cleaning, restructuring, loading, sharing, etc. Leveraging technology from cloud providers and SQLShare, we aim to simplify or eliminate these data manipulation tasks and let researchers focus on the science.
- We have experience building data-driven visualizations to help scientists make sense of data. We focus on web-enabled, interactive visualizations using platforms like D3.
- Reproducibility and Open Science:
- We can help you share your code, data, and results with collaborators and with the general public. We favor projects that emphasize open data and open source, allowing other researchers to recreate your results with minimal effort. We advocate alternative metrics and can help you maximize recognition and credit for ensuring reproducibility and open access. Suitable incubator projects may include organizing and uploading data into suitable public repositories, reviewing and publishing your code on GitHub, identifying venues for publishing papers describing your data or code (they exist!), or migrating your application to a commercial cloud to improve access.
We structure our work according to agile methodologies
, typically breaking large projects into multiple short-term sprints of a few weeks each.
Our team has a strong track record of building systems that get real use. Below are listed some of our previous collaborations.
In the Spring 2014 pilot, we accepted 6 proposals from 5 different departments around campus led by students, postdocs, research staff, and faculty.
You can review the full list of projects
from Spring 2014.
How to Get Started
Do you have interesting data challenges? You can
submit a project proposal
for Fall quarter. Proposals should include:
- Contact information for the project lead -- the one who will join us in the studio and be responsible for carrying out the project.
- Project summary / objective (1 page).
- A description of your data. At least the size, formats, where the data currently resides, and any privacy and access restrictions. We strongly favor projects that have already collected the relevant data rather than "preparatory" projects that involve building software in the anticipation of future data collection activities.
- A list of the key science questions the data will help answer, and a discussion of the publications that you anticipate resulting.
- A list of key technical challenges you face in answering these questions: Do you need new methods or algorithms? Do you need to scale up existing methods? Do you need to integrate data so it can be analyzed? Do you need to publish data and/or code to improve collaborative opportunities and reproducibility?
- The timeframe for your work.
- The names of those researchers who will be physically joining the team to lead the project.
These proposals are prioritized based on the following criteria:
- Good clustering between proposals; ideally, we seek a cohort of proposals with a common theme.
- Alignment with sponsor and program goals
- Participant availability and engagement
- Ability to answer fundamentally new research questions
- Clarity and shovel-readiness
- Capacity for measurable outcomes
- Capabilities of the incubator staff
We expect that some good proposals will not meet every criteria.
Important Dates for the Fall 2014 Session
- Sep 8: 1-hour information session, 11:00 am - 12:00 pm, Paul Allen Center, 403.
- Sep 18: 1-page proposals due
- Sep 24: Accepted proposals announced
- Sep 29: Kickoff meeting
You can learn more by reviewing
from the information session from February 2014.
In addition, you can review some frequently asked questions