How Data Clinic is Building Open Source Tooling to Support Mission-Driven Organizations
Kaushik Mohan is a speaker for the ODSC East 2020 Virtual Conference. Be sure to check out his talk, “Open Source Tools for Social Impact,” there!
As the data for good arm of the investment manager Two Sigma, Data Clinic brings Two Sigma’s people, data science skills, and technological know-how to help nonprofits, government agencies, and academic institutions to use data and tech more effectively. Volunteer teams of Two Sigma employees work closely with partner organizations on a project basis tackling data science and tech challenges to deliver tailored solutions ranging from research insights to statistical models to small scale engineering builds.
Over the last 6 years, Data Clinic has worked with organizations spanning multiple domains, geographies, and levels of technical sophistication. We’ve learned a lot over the years, collaborating one-on-one with our partners on bespoke solutions. However, many of the challenges they face are not unique, and we started thinking about how we could scale this support to widen our impact.
A little bit of tooling could go a long way…
One overarching and significant challenge involves data. Before research questions can be tackled, we need to ensure that there is available, accessible, and context-appropriate data.
While many organizations collect some data in-house, this data often isn’t created for research purposes and may not be aligned with the specific questions of interest. Thankfully, over the last decade or so, the open data movement has led to a lot of administrative datasets at different levels of government being made public. At Data Clinic, we love open data and have used this immense resource to help solve key challenges for our partner organizations:
But anyone familiar with open data knows that finding and working with these data sets isn’t exactly easy. First comes the challenge of identifying the right datasets to support your specific needs. Although the increase in the number of open datasets is encouraging, the variety and quantity makes it hard to surface those that are thematically relevant and joinable via common identifiers. Meet scout.
Second, once you have gathered useful data, comes the part data scientists love to hate- data cleaning. It is well known that the majority of our time and effort goes into cleaning and transforming data in preparation for analysis. This effort requires substantial resources and bandwidth that mission-driven organizations might simply not have. While some solutions exist for processing numerical data, cleaning text data is a lengthy and painstaking process even for those who live and breathe regular expressions. Introducing smooshr.
Lastly, a lot of useful open data is not in tabular forms (xls, csv, etc.). A prime example of this is geospatial datasets, which form a sizable portion of publicly available data. These datasets require specific knowledge and expertise to load, visualize, and especially, to analyze. Say hello to NewerHoods.
Over the past year, we have been developing these open source projects- scout, smooshr, and NewerHoods-to address these challenges and to empower organizations and individuals to use open data more effectively. Come check out our talk at the Open Data Science Conference in Boston on April 16th to find out more and how you can get involved in these efforts. In the meanwhile, check out our GitHub and learn more about past Data Clinic projects here.
Kaushik Mohan is a data scientist at Data Clinic, where he develops data-driven applications and brings statistical analysis to help nonprofits answer research questions.
He has been working on solving social challenges using data over the last four years through stints at the Data Science for Social Good Fellowship and as the lead data scientist at m.Paani, a social start-up in India. Having worked on projects with governments to small businesses, he has experienced the impact data science can make at every level of society.