Alternative to Massive Jupyter Notebooks

Alternative to Massive Jupyter Notebooks

I used to have the problem of massive Jupyter notebooks. They were slow and I couldn't find what I was looking for. I was adding markdown headers and jump links... In retrospect, I was doing too much in one notebook and was not pulling out big functions into scripts.

Now I use a file structure, imported scripts, and more (smaller) notebooks. Each notebook has one job. I removed processing code from notebooks and import it from script files. This also encouraged me to refactor those big functions into more appropriately sized functions and made my code more re-usable, expandable, and debuggable.

If you're looking to make a similar workflow improvement, then this is a good opportunity to use git (or just copy your whole project file). Make a branch and just refactor all your code. Make a scripts folder and move boilerplate code into .py files (make sure to add _init_.py so the scripts can be imported).

In my current practice, notebooks are for analysis and making figures. I still use notebooks to prototype and build my load, clean, transformation workflow, but once the functions are good to go, I move them to a script.

For each dataset I have a process_dataset.py file containing functions for the following tasks:

  • Load the data and return the unmodified dataset AND a dataset with selected columns.
  • Do specific column transformations that require complex conditional logic. These are essentially small functions that are applied to the dataset in the transform step.
  • Transform the data - apply specific column transformations and do basic adjustments, such as, fixing datatypes, renaming, and filtering out rows.
  • Export dataset.

Make sure to look into the autoreload magic function in Jupyter notebooks. It allows you to modify and save a script that is in another folder and imported such that it is reloaded (aka updated) in the Jupyter notebook each time you save it.

Here are a couple links about this - there is a lot on google about how to use this if you search for the Jupyter autoreload extension.

Quick implementation - put this in your import cell.

%reload_ext autoreload
%autoreload 2