A Data Science Platform (DSP) for Scientists

Scientific advance, like civil society, is based on a shared understanding of objective facts. Hippocampus Analytics is an advocate for and developer of solutions for scientific data sharing that will also address problems in the modern data sharing economy. According to a 2017 Forbes article, data scientists spend 80% of their time preparing data and only 20% of their time using that data to solve a problem. This is problematic because data scientists tend to come from computer science backgrounds while scientists focus their training on employing the scientific method. Even freely available data sets can be difficult to use unless a scientist is working with a data scientist. Additionally, the amount of data available has only grown since the publication of that article. It is a mistake to leave scientists and non-programmers out of the data revolution. A system that will produce the maximum benefit from open data sets has the following features:


Features


The good news is that many solutions now being developed. The most well-developed solutions are focused on solving technical issues; moving data from one place to the other and/or providing easy data manipulation. To be successful and sustainable, however, a solution must address all of the needs of the scientific community:

The scientific community requires a mechanism by which individuals can get credit for producing (and/or documenting) high quality data. In a data economy, scientists can be thought of as the equivalent of the "content providers" found in social media. A system modeled after i-tunes or tiktok would enable consumers select which data sets they use, effectively "voting with their feet."

In order to be usable, many large scientific data sets require labor intensive post-processing. However, only a small subset of are actually currently post-processed. This may be due to funding or other issues. A system that gives credit and ownership to the data producers would provides an opportunity for small businesses and entrepreneurial labs to participate in the scientific endeavor. If the data collector uploads the raw data, a small businesses with the right expertise could transform the data for a fee.

Finally, it seems likely that some solutions will rely on recommender engine technology. Recommender engines were initially developed for this purpose and remain widely used in industries such as social media. However, they are unsuitable for use in the scientific endeavor as they rely on algorithms that are not transparent. This introduces users to an unknown unknown and enables confirmation bias. Users do not know which data they are not seeing and they do not know why. This is antithetical to the idea of a controlled experiment. We are now focused on developing efficient mechanisms for searching large data sets based on biological circuits that perform similar tasks. Our system is based on Kubernetes and can run in a hybrid cloud environment or on prem. There is currently no API implementation, but I am happy to discuss the system and our other goals.


Hippocampus Analytics is a Woman Owned Small Business.