Is this a BUG REPORT or FEATURE REQUEST?:
This is a bug report, with a fix at #82.
What happened:
We began running into a situation where our ServiceAccount annotations would be set with eks.amazonaws.com/role-arn: "". After digging into the code, we discovered that the default case handler in the controller has this if statement that is problematic. I think its intent was to handle an intermittent failure in the READY handler, where the ServiceAccount was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.
Here is the rundown of events that can cause this failure:
- A new
Iamrole is created.. and the controller reconciliation starts.
- The
r.IAMClient.CreateRole function is called.
- The function sets
roleAlreadyExists=true and logs the error.
- Because
roleAlreadyExists, the code then skips populating the IAMRoleResponse object.
- The handler checks if
resp.RoleARN is set. It is not because we got an empty response object.
- The handler tries to use the previously known good state in
iamRole.Status.RoleARN ... but remember, this is a newly created Iamrole resource, so that is empty too.
- The
ServiceAccount is populated with an empty string annotation.
What you expected to happen:
I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.
How to reproduce it (as minimally and precisely as possible):
Pre-create an IAM Role... then try creating an Iamrole resource.
Anything else we need to know?:
While digging into this, I also noticed that there is no reconciliation that happens for the ServiceAccount annotation. I have a second PR (prepped at Nextdoor#4) that significantly revamps the k8s/rbac.go package to be more testable, and implements regular reconciliation of the ServiceAccount annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.
Is this a BUG REPORT or FEATURE REQUEST?:
This is a bug report, with a fix at #82.
What happened:
We began running into a situation where our
ServiceAccountannotations would be set witheks.amazonaws.com/role-arn: "". After digging into the code, we discovered that thedefaultcase handler in the controller has thisifstatement that is problematic. I think its intent was to handle an intermittent failure in theREADYhandler, where theServiceAccountwas already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.Here is the rundown of events that can cause this failure:
Iamroleis created.. and the controller reconciliation starts.r.IAMClient.CreateRolefunction is called.roleAlreadyExists=trueand logs the error.roleAlreadyExists, the code then skips populating theIAMRoleResponseobject.resp.RoleARNis set. It is not because we got an empty response object.iamRole.Status.RoleARN... but remember, this is a newly createdIamroleresource, so that is empty too.ServiceAccountis populated with an empty string annotation.What you expected to happen:
I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.
How to reproduce it (as minimally and precisely as possible):
Pre-create an IAM Role... then try creating an Iamrole resource.
Anything else we need to know?:
While digging into this, I also noticed that there is no reconciliation that happens for the
ServiceAccountannotation. I have a second PR (prepped at Nextdoor#4) that significantly revamps thek8s/rbac.gopackage to be more testable, and implements regular reconciliation of theServiceAccountannotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.