Lowering the barriers to hypothesis-driven data science
Subramanian, Krishna; Borchers, Jan Oliver (Thesis advisor); Cairns, Paul (Thesis advisor)
Aachen : RWTH Aachen University (2022, 2023)
Dissertation / PhD Thesis
Dissertation, RWTH Aachen University, 2022
Data science is a frequent task in academia and industry. One common use of data science is to validate hypotheses, in which the analyst uses significance-based hypothesis testing to draw insights about a population distribution based on experimental data. Apart from data scientists, who are professionally trained in data science and have high skills levels, many non-professional analysts also carry out data analysis. These non-professionals, who we refer to as data workers, are domain experts who lack expertise in data science, such as academic researchers, project managers, and sales managers. Through interviews, observations, online surveys, and content analyses, we aim to understand data workers’ workflows across important tasks in hypothesis testing: learning theoretical and practical statistics, selecting statistical procedures, using data science programming IDEs to experiment with ideas in source code, refine and refactor source code, and disseminating findings from an analysis. We present our findings grouped into two steps when performing data science tasks: 1. Preparing to perform data science tasks: We discuss our findings about the impact of formal training on real-world statistical practice; trade-offs among information sources used for selecting statistical procedures; perceived complexity and uncertainty about statistical procedure selection; and reluctance among data workers to adopt alternative methods of analysis. Based on the above findings, we present design recommendations and one artifact to improve data workers’ workflows. Our artifact StatPlayground is an interactive simulation tool that can be used to self-learn or teach statistical concepts and statistical procedure selection. 2. Performing data science tasks: Our findings include an overview of data workers’ workflows when performing hypothesis testing using programming IDEs, which follows an exploratory programming workflow; and a comparison of existing interfaces for data science programming, namely computational notebooks, scripts, and consoles, and a discussion of how well they support various steps in hypothesis testing. To improve data workers’ workflows when performing data science tasks, we contribute design recommendations and two artifacts. Our artifacts include StatWire, an experimental hybrid-programming interface that encourages data workers to write high-quality source code; and Tractus, an interactive visualization that can lower the cost of working with experimental source code. Based on our work, we present four takeaways that can be used by researchers, software developers, and educators to lower the barriers to hypothesis testing.
- Department of Computer Science 
- Chair of Computer Science 10 (Media Computing Group)