Storage Options

Using Amazon EFS and Amazon FSx for Lustre with Kubeflow

This guide describes how to use Amazon EFS and Amazon FSx for Lustre with Kubeflow.

Amazon EFS

Amazon EFS is managed NFS in AWS. Amazon EFS supports ReadWriteMany access mode, which means the volume can be mounted as read-write by many nodes. It is very useful for creating a shared filesystem that can be mounted into pods such as Jupyter. For example, one group can share datasets or models across an entire team. By default, the Amazon EFS CSI driver is not enabled and you need to follow steps to install it.

Deploy the Amazon EFS CSI Plugin

  1. git clone https://github.com/kubeflow/manifests
  2. cd manifests/aws
  3. kubectl apply -k aws-efs-csi-driver/base

Static Provisioning

You can provision a new Amazon EFS file system in the Amazon EFS console. Choose the VPC, Subnet IDs, and provisioning mode to use. Ensure that in the Security Groups configuration that traffic to NFS port 2049 is allowed. Once created, retrieve the new file system ID and use it to create PersistentVolume and PersistentVolumeClaim objects.

In this example, eksctl was used to provision the cluster. You can choose ClusterSharedNodeSecurityGroup.

Amazon EFS Create

If it’s not already in place, specify a storage class for EFS and create it with kubectl.

  1. cat << EOF > efs-sc.yaml
  2. apiVersion: storage.k8s.io/v1
  3. metadata:
  4. name: efs-sc
  5. provisioner: efs.csi.aws.com
  6. EOF
  7. kubectl apply -f efs-sc.yaml

To create a volume, replace <your_efs_id> with the file system id from the creation above and create it with kubectl.

  1. EFS_ID=<your_efs_id>
  2. cat << EOF > efs-pv.yaml
  3. apiVersion: v1
  4. kind: PersistentVolume
  5. metadata:
  6. name: efs-pv
  7. spec:
  8. capacity:
  9. storage: 5Gi
  10. volumeMode: Filesystem
  11. accessModes:
  12. - ReadWriteMany
  13. persistentVolumeReclaimPolicy: Retain
  14. storageClassName: efs-sc
  15. csi:
  16. driver: efs.csi.aws.com
  17. volumeHandle: ${EFS_ID}
  18. EOF
  19. kubectl apply -f efs-pv.yaml

Finally, create a claim on the volume for use. Replace <your_namespace> with your namespace and create the PVC with kubectl.

  1. NAMESPACE=<your_namespace>
  2. cat << EOF > efs-claim.yaml
  3. apiVersion: v1
  4. kind: PersistentVolumeClaim
  5. metadata:
  6. name: efs-claim
  7. namespace: ${NAMESPACE}
  8. spec:
  9. accessModes:
  10. - ReadWriteMany
  11. storageClassName: efs-sc
  12. resources:
  13. requests:
  14. storage: 5Gi
  15. EOF
  16. kubectl apply -f efs-claim.yaml

By default, new Amazon EFS file systems are owned by root:root, and only the root user (UID 0) has read-write-execute permissions. If your containers are not running as root, you must change the Amazon EFS file system permissions to allow other users to modify the file system.

In order to share EFS between notebooks, you can create a sample pod as shown below to change permission the file system permissions. If you use EFS for other purposes (e.g. sharing data across pipelines), you don’t need this step.

Replace <your_namespace> with your namespace and create the job with kubectl.

  1. NAMESPACE=<your_namespace>
  2. cat << EOF > job.yaml
  3. apiVersion: batch/v1
  4. kind: Job
  5. metadata:
  6. name: set-permission
  7. namespace: ${NAMESPACE}
  8. spec:
  9. template:
  10. metadata:
  11. annotations:
  12. sidecar.istio.io/inject: "false"
  13. spec:
  14. restartPolicy: Never
  15. containers:
  16. - name: app
  17. image: centos
  18. command: ["/bin/sh"]
  19. args:
  20. - "-c"
  21. - "chmod 2775 /data && chown root:users /data"
  22. volumeMounts:
  23. - name: persistent-storage
  24. mountPath: /data
  25. volumes:
  26. - name: persistent-storage
  27. persistentVolumeClaim:
  28. claimName: efs-claim
  29. EOF
  30. kubectl apply -f job.yaml

To use Amazon EFS as a notebook volume when you create Jupyter notebooks, specify the PersistentVolumeClaim name. Amazon EFS JupyterNotebook Volume

Amazon FSx for Lustre

Amazon FSx for Lustre provides a high-performance file system optimized for fast processing for machine learning and high performance computing (HPC) workloads. AWS FSx for Lustre CSI Driver can help Kubernetes users easily leverage this service.

Lustre is another file system that supports ReadWriteMany. One difference between Amazon EFS and Lustre is that Lustre can be used to cache training data with direct connectivity to Amazon S3 as the backing store. With this configuration, you don’t need to transfer data to the file system before using the volume.

By default, the Amazon FSx CSI driver is not enabled and you need to follow steps to install it.

Deploy the Amazon FSx CSI Plugin

Ensure your driver will have the required IAM permissions. For details, refer to the project documentation.

  1. git clone https://github.com/kubeflow/manifests
  2. cd manifests/aws
  3. kubectl apply -k aws-fsx-csi-driver/base

Static Provisioning

You can statically provision Amazon FSx for Lustre and then use the file system ID, DNS name and mount name to create PersistentVolume and PersistentVolumeClaim objects.

Amazon FSx for Lustre provides both scratch and persistent deployment options. Choose the deployment which best suits your needs. For more details on deployment options, see the documentation.

Amazon FSx Create Volume

Persistent file systems in FSx are replicated within a single availability zone. Select the appropriate subnet based on your cluster’s node group configuration.

Amazon FSx Network Settings

Once the file system is created, you can retrieve “File system ID”, “DNS name” and “Mount name” for the configuration steps below.

Amazon FSx Network Settings

You can optionally retrieve this information with the aws CLI:

  1. aws fsx describe-file-systems

Retrieve the FileSystemId, DNSName, and MountName values.

To create a volume, replace <file_system_id>, <dns_name>, and <mount_name> with your values and create it with kubectl.

  1. FS_ID=<file_system_id>
  2. DNS_NAME=<dns_name>
  3. MOUNT_NAME=<mount_name>
  4. cat << EOF > fsx-pv.yaml
  5. apiVersion: v1
  6. kind: PersistentVolume
  7. metadata:
  8. name: fsx-pv
  9. spec:
  10. capacity:
  11. storage: 1200Gi
  12. volumeMode: Filesystem
  13. accessModes:
  14. - ReadWriteMany
  15. mountOptions:
  16. - flock
  17. persistentVolumeReclaimPolicy: Recycle
  18. csi:
  19. driver: fsx.csi.aws.com
  20. volumeHandle: ${FS_ID}
  21. volumeAttributes:
  22. dnsname: ${DNS_NAME}
  23. mountname: ${MOUNT_NAME}
  24. EOF
  25. kubectl apply -f fsx-pv.yaml

Now you can create a claim on the volume for use. Replace <your_namespace> with your namespace and create the PVC with kubectl.

  1. NAMESPACE=<your_namespace>
  2. cat << EOF > fsx-pvc.yaml
  3. apiVersion: v1
  4. kind: PersistentVolumeClaim
  5. metadata:
  6. name: fsx-claim
  7. namespace: <your_namespace>
  8. spec:
  9. accessModes:
  10. - ReadWriteMany
  11. storageClassName: ""
  12. resources:
  13. requests:
  14. storage: 1200Gi
  15. volumeName: fsx-pv
  16. EOF
  17. kubectl apply -f fsx-pvc.yaml

Dynamic Provisioning

You can optionally dynamically provision Amazon FSx for Lustre filesystems. The SecurityGroupId and SubnetId are required. Amazon FSx for Lustre is an Availability Zone-based file system, and you can only pass one subnet in this dynamic configuration. This means you need to create a cluster in single Availability Zone, which makes sense for machine learning workloads.

If you already have a training dataset in Amazon S3, you can configure it for use by Amazon FSx for Lustre as a data repository and your file system will be ready with the training dataset.

For dynamic provisioning, see the example in the project repo here.

Last modified 04.05.2021: refactor and refresh aws docs (#2688) (ef4cda60)