Securing Your Collaborative Jupyter Notebooks in the Cloud using Container and Load Balancing Services

Haw-minn Lu,Adrian Kwong,Jose Unpingco
DOI: https://doi.org/10.48550/arXiv.2006.01818
2020-06-03
Abstract:Jupyter has become the go-to platform for developing data applications but data and security concerns, especially when dealing with healthcare, have become paramount for many institutions and applications dealing with sensitive information. How then can we continue to enjoy the data analysis and machine learning opportunities provided by Jupyter and the Python ecosystem while guaranteeing auditable compliance with security and privacy concerns? We will describe the architecture and implementation of a cloud based platform based on Jupyter that integrates with Amazon Web Services (AWS) and uses containerized services without exposing the platform to the vulnerabilities present in Kubernetes and JupyterHub. This architecture addresses the HIPAA requirements to ensure both security and privacy of data. The architecture uses an AWS service to provide JSON Web Tokens (JWT) for authentication as well as network control. Furthermore, our architecture enables secure collaboration and sharing of Jupyter notebooks. Even though our platform is focused on Jupyter notebooks and JupyterLab, it also supports R-Studio and bespoke applications that share the same authentication mechanisms. Further, the platform can be extended to other cloud services other than AWS.
Cryptography and Security,Networking and Internet Architecture
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address data security and privacy issues when using Jupyter Notebook for collaboration in a cloud computing environment, especially when handling sensitive information such as healthcare data. Specifically, the paper explores how to ensure compliance with security and privacy requirements while enjoying the data analysis and machine learning opportunities provided by the Jupyter and Python ecosystem. ### Main Issues Include: 1. **Data Security**: How to protect sensitive data stored in the cloud from unauthorized access and data breaches. 2. **Privacy Protection**: How to ensure the privacy of user data, particularly in compliance with HIPAA (Health Insurance Portability and Accountability Act) requirements. 3. **Authentication and Authorization**: How to implement robust user authentication mechanisms to ensure that only verified users can access specific resources. 4. **Encrypted Communication**: How to ensure the security of data during transmission, achieving end-to-end encryption. 5. **Audit and Monitoring**: How to provide audit functions to monitor the usage of critical resources, enabling timely detection and response to potential security threats. 6. **Security of Containerized Services**: How to avoid security vulnerabilities brought by technologies like Kubernetes and JupyterHub, ensuring the secure operation of containerized services. ### Solution Overview The paper proposes a cloud architecture based on AWS (Amazon Web Services), integrating AWS Elastic Container Service (ECS) and Application Load Balancer (ALB), and implementing user authentication through JSON Web Tokens (JWT). Specific measures include: - **Container Management**: Using ECS to manage container instances, ensuring that each user's container instance is isolated and secure. - **Load Balancing**: Implementing load balancing of user requests through ALB and ensuring communication security via HTTPS. - **Persistent Storage**: Using encrypted file systems like ObjectiveFS to ensure the security of data at rest. - **Authentication Mechanism**: Implementing user authentication through OpenID Connect (OIDC) compatible identity providers (such as Okta) and passing authentication information to the application layer. - **Auto Scaling and Recycling**: Automatically starting or stopping container instances based on user activity, reducing resource waste and enhancing security. Through these measures, the paper proposes a secure, compliant, and efficient cloud platform architecture suitable for research and enterprise environments that need to handle sensitive data.