Description
Is your feature request related to a problem? Please describe.
Recently we had issues with our FSX volumes mounting in our application pods. We could not ls
the directory.
It was very unclear what the issue was because within the AWS Console the FSX volume was not at capacity. There were no issues
Within the csi driver daemonset pod, there were these logs:
E0514 19:15:20.815598 1 driver.go:104] "GRPC error" err=<
rpc error: code = Internal desc = Could not mount "fs-<id>.fsx.us-west-2.amazonaws.com@tcp:/xmym3bev" at "/var/lib/kubelet/pods/b95c1daf-c469-4177-9113-0c73bab808b3/volumes/kubernetes.io~csi/<fsxname>/mount": mount failed: exit status 5
Mounting command: mount
Mounting arguments: -t lustre fs-<id>.fsx.us-west-2.amazonaws.com@tcp:/xmym3bev /var/lib/kubelet/pods/b95c1daf-c469-4177-9113-0c73bab808b3/volumes/kubernetes.io~csi/<fsxname>/mount
Output: mount.lustre: mount fs-<id>.fsx.us-west-2.amazonaws.com@tcp:/xmym3bev at /var/lib/kubelet/pods/b95c1daf-c469-4177-9113-0c73bab808b3/volumes/kubernetes.io~csi/<fsxname>/mount failed: Input/output error
Is the MGS running?
>
It would be great if the pod had metric saying there were mounting issues. With that metric, I can fire an alert to our SREs!
Eventually we rolled out the daemonset + deployment and that resolved this issue... But even that wasn't in your troubleshooting guide.
Describe the solution you'd like in detail
Ideally expose metrics that show health or problems, through a prometheus endpoint. We would like to build prom queries, and alerting that can show that the fsx csi driver is healthy
Describe alternatives you've considered
I can also build a solution around parsing logs... but I would prefer to just have metrics. Prom metrics seems to be an industry standard
Additional context
N/A