Abstract:Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations. Data scientists use Jupyter notebook as the de-facto standard for creating and sharing machine-learning based solutions, primarily written in Python. Recent studies have demonstrated, however, that a large portion of Jupyter notebooks available on public platforms are undocumented and lacks a narrative structure. This reduces the readability of these notebooks. To address this shortcoming, this paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers based on a taxonomy of ML operations, and classifies and displays function calls according to this taxonomy. For this functionality to be realized, HeaderGen enhances an existing call graph analysis in PyCG. To improve precision, HeaderGen extends PyCG's analysis with support for handling external library code and flow-sensitivity. The former is realized by facilitating the resolution of function return-types. The evaluation on 15 real-world Jupyter notebooks from Kaggle shows that HeaderGen's underlying call graph analysis yields high accuracy (95.6% precision and 95.3% recall). This is because HeaderGen can resolve return-types of external libraries where existing type inference tools such as pytype (by Google), pyright (by Microsoft), and Jedi fall short. The header generation has a precision of 85.7% and a recall rate of 92.8%. In a user study, HeaderGen helps participants finish comprehension and navigation tasks faster. To further evaluate the type inference capability of tools, we introduce TypeEvalPy, a framework for evaluating type inference tools with a micro-benchmark containing 154 code snippets and 845 type annotations. Our comparative analysis on four tools revealed that HeaderGen outperforms other tools in exact matches with the ground truth.

TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

Static Type Analysis for Python

TIPICAL -- Type Inference for Python In Critical Accuracy Level

Generating Python Type Annotations from Type Inference: How Far Are We?

Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python

DyPyBench: A Benchmark of Executable Python Software

iJTyper: An Iterative Type Inference Framework for Java by Integrating Constraint- and Statistically-based Methods

Static Analysis Driven Enhancements for Comprehension in Machine Learning Notebooks

TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference

Learning Type Inference for Enhanced Dataflow Analysis

Automated Return Type Annotation for Python

DLInfer: Deep Learning with Static Slicing for Python Type Inference.

Optimizing and Evaluating Transient Gradual Typing

How Do Developers Use Type Inference: An Exploratory Study in Kotlin

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

How Well Static Type Checkers Work with Gradual Typing? A Case Study on Python.

PoTo: A Hybrid Andersen's Points-to Analysis for Python

AmPyfier: Test Amplification in Python

Python Probabilistic Type Inference with Natural Language Support

Trace Typing: An Approach for Evaluating Retrofitted Type Systems (Extended Version)