Pangeo: JupyterHub, Dask, and XArray on the Cloud
2.5 Key Insight: Democratizing access to distributed computing for scientists requires not just technical integration of tools like Dask and Kubernetes, but also cross-institutional collaboration between academics, engineers, and open source communities.
Matt Rocklin announces Pangeo, a proof-of-concept deployment combining JupyterHub, Dask, and XArray on Google Container Engine to enable atmospheric and oceanographic scientists to analyze large datasets in the cloud. The system allows users to log in, launch Jupyter notebooks, spin up Dask clusters, and process data stored in cloud storage without managing their own infrastructure. The project was built in just a few weeks through collaboration across multiple institutions, mixing academics, staff, and professional software developers. While still immature and experimental, the deployment has revealed important lessons about cloud file formats, environment customization, and the need for cross-community collaboration.
4 The file formats used for this sort of data are pervasive, but not particularly convenient or efficient on cloud storage.
3 Notice the mix of academic and for-profit institutions. Also notice the mix of scientists, staff, and professional software developers. We believe that this mixture helps ensure th…
2 Libraries like Dask and XArray already solve this problem computationally if scientists have their own clusters, but we seek to expand access by deploying on cloud-based systems.
DaskDistributed Computing