uswitch / kiam Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 241.0 9.7 MB

Integrate AWS IAM with Kubernetes

License: Apache License 2.0

Go 78.86% Makefile 0.47% Shell 18.71% Dockerfile 0.27% Mustache 1.69%

aws-iam iam iam-role kubernetes

kiam's People

Contributors

Stargazers

Watchers

Forkers

etsangsplk mariusv mahpatil jpb bkochendorfer elafarge locationlabs mikesplain barklyprotects bobhenkel ripta tasdikrahman cjbradfield sp-joseluis-ledesma moofish32 atlassian idiamond-stripe therealryanbonham nckturner ricebowljr smart-devs billyteves cspargo jakehow tdegiacinto gladiatr72 tuapuikia alindeman someword amitsaha word knqyf263 bala-27 darren-claughton wesleycharlesblake star-bob usabilla mustafa3009 integrii max-lobur silvercar klia ckaye fieldhouse pzou1974 santhoshkumarvs seh jlegrone thejasbabu r0fls paulbes isaacchapman msharma3 anilrockz efranford mjasion aimanparvaiz mihirchakradeo aairbag gregwhorley mklauber fbreckle michas2 lexxoff miguelperalvo lyft jakub-roman surindersinghp nanofaroque pravarag pravisankar gianrubio addisonj d-tech-simpson gerganadzhumerkova bernardovale randomvariable kaezarrex mwmix rmoreira andrewlubenets jacksontj taharah mattmb edify42 bhavin192 harishspqr kfirsch cyrus-mc migolfi abc11h andrecastro junaid18183 denispesikov moolen surajnarwade daviddyball wade1990 aarongorka judam1971

kiam's Issues

Record IAM errors as events

Although the Kiam server will currently record when its unable to retrieve credentials for a role (perhaps because the role doesn't exist, or isn't trusted by the node trying to AssumeRole) it means accessing Kiam logs.

It would be lovely to be able to see these errors reported as events for the Pod so that end-users can more easily track down issues.

'error warming credentials' but AWS CLI works

I've seen messages such as this one:

time="2017-06-02T12:21:36Z" level=error msg="error warming credentials: AccessDenied: Not authorized to perform sts:AssumeRole\n\tstatus code: 403, request id: 059aab77-478e-11e7-a540-2938881f4c35" generation.metadata=0 pod.iam.role="arn:aws:iam::XXXXXXXX:role/YYYYYY" pod.name=.... pod.namespace=.... pod.status.ip=10.2.28.39 pod.status.phase=Running resource.version=26173143

I tried a simple aws sts assume-role --role-session test --role-arn arn:aws:iam::XXXXXXXX:role/YYYYYY on the node and in the kiam pod, and in both cases got credentials.

FWIW: I'm in the eu-west-1 region -- this might be due to an assumption somewhere about the region, but that's just a guess.

Support regexp/prefix match for `iam.amazonaws.com/role` annotation

Hi,

I'm using cloudformation to create my IAM roles. I can't give them stable names because cloudformation has to be able to dynamically generate the names if you want a way to update the roles (see docs).

Currently I need to apply my cfn template and copy the generated role names to my k8s manifests.

The generated name look like this: [stack-name]-[logical-resource-name]-[random-string], for example int-site-ExtDNSRole-1XGYNGL4GST7Y

It would be great if kiam supported prefix or regexp based match, so I could use a annotation like this to allow my pod to assume this role: iam.amazonaws.com/role: "int-site-ExtDNSRole-.*"

Require node iam permissions

What perms does the node require for kiam to function?

For kube-router users: interface is kube-bridge

Just putting this here, you might want to add this to the readme, if you're using kube-router as the CNI, the default interface is kube-bridge.

Tested with kube-router on CoreOS container linux

Add documentation for cmd server and agent options

Document the options that are available with server and agent cmd. So, users can override the default options if needed.
Something like kube2iam docs.

Support for default roles

kube2iam supports assigning a default role to a pod that doesn't specify another role explicitly. From what I can see kiam is not able to do this.

My use case: I try to keep configuration down to a minimum, and rely a lot on default values. For that reason many of my pods that need no special permissions do not specify a role, and instead assume the environment will be setup nicely.

High resource utilization when not using limits

Have kiam deployed on 3 clusters in AWS and didn't use resource limits. After few days I started noticing that the amount of resource utilization is constantly increasing. Deleting the pods fixed the resource consumption for a few hours but then starts again and is only increasing in time, never goes back.

On 2 clusters didn't use any role yet, just deployed kiam so no requests are made and the logs don't show anything weird.
Now I'm setting limits and see how it goes, but don't understand why it needs this amount of CPU

Here are two screenshots using kubectl top pods and top inside the container

Document / demo Kops deployment

Kops applies some taints to masters that changes the default deployment manifests in ./deploy. It'd be nice to have an easier/simpler Kops example that people could run and demo kiam working.

Helm charts

It'd be nice to make it really easy for people to deploy with Helm. I suspect a prerequisite for this would be #29.

add rbac

The deploy files don't work on an rbac enabled cluster. From debug logs, it looks like it mostly just needs to do GETs on pods and namespaces.

Will try and figure out the minimum set and PR a file, just trying to track the things I had to do to get this running. :)

Prefetching credentials logs errors for roles for which sts:AssumeRole is not given

This is more of a discussion issue than an actual bug (from my understanding).

What I have is a cluster with a couple of worker nodes, each of which has the "worker" role. Additionally there is a controlplane node which only runs the api server and some very few critical pods from the kube-system namespace. This node has the "controlplane" role. This is what https://github.com/kubernetes-incubator/kube-aws produces in a nutshell.

A kiam pod is put on all these nodes, which seems reasonable: it's in the kube-system NS, and it should also provide credentials for those systems pods if needed -- especially when we end up with a default role (see #1).

With all that what I see is a lot of errors in the kiam pod on the controlplane node, caused by missing permissions for the controlplane role to assume roles of pods that are supposed to only run on the worker nodes.

I'm wondering what the "story" here should be:

Document the behavior somehow, and essentially ignore the errors -- after all these pods will never appear, and therefore the failures in prefetching are irrelevant
Somehow avoid querying these roles
Silence the errors in some way?

Block Access to Other Metadata

Kube2iam has a feature to block access to the rest of the metadata from the iptables level with a simple flag: https://github.com/jtblin/kube2iam#iptables

It seems like kiam did this by accident at one point #3 and was undone. Anyway to make this optional. As we would like to block access to the metadata for our pods.

deploy/server.yaml healthchecks failing

Deployed to a 1.8 kops cluster. Generated Server and Client TLS certs using CFSSL with localhost and kiam-server as server hosts, and put these in as a secret with the ca.pem

Agent starts fine and I can see the health-check in the agents logs (bit annoying, but lets me know it think it's alive) but the server just starts up then shuts down without ever being marked as ready.

A describe on the pod tells me the shutdown is because of a liveness fail and this is spammed:

time="2017-12-21T11:32:24Z" level=warning msg="error checking health: rpc error: code = Unavailable desc = there is no connection available"

Removing the healthchecks appear to allow the server to run, and I see some logs about it finding annotations, but I've yet to achieve an assumed role yet, so it's possible I've done something wrong. :)

Kiam 2.7 doesn't build

Steps:
git clone [email protected]:uswitch/kiam.git --branch 2.7-deploy
cd kiam
make

When running make in the project root, this is the ouput:

# github.com/uswitch/kiam/pkg/server
pkg/server/server.go:187: cannot use server (type *KiamServer) as type kiam.KiamServiceServer in argument to ServerWithTelemetry:
	*KiamServer does not implement kiam.KiamServiceServer (wrong type for GetHealth method)
		have GetHealth("context".Context, *kiam.GetHealthRequest) (*kiam.HealthStatus, error)
		want GetHealth("github.com/uswitch/kiam/vendor/golang.org/x/net/context".Context, *kiam.GetHealthRequest) (*kiam.HealthStatus, error)
pkg/server/telemetry.go:30: cannot use TelemetryClient literal (type *TelemetryClient) as type kiam.KiamServiceClient in return argument:
	*TelemetryClient does not implement kiam.KiamServiceClient (wrong type for GetHealth method)
		have GetHealth("context".Context, *kiam.GetHealthRequest, ...grpc.CallOption) (*kiam.HealthStatus, error)
		want GetHealth("github.com/uswitch/kiam/vendor/golang.org/x/net/context".Context, *kiam.GetHealthRequest, ...grpc.CallOption) (*kiam.HealthStatus, error)
pkg/server/telemetry.go:71: cannot use TelemetryServer literal (type *TelemetryServer) as type kiam.KiamServiceServer in return argument:
	*TelemetryServer does not implement kiam.KiamServiceServer (wrong type for GetHealth method)
		have GetHealth("context".Context, *kiam.GetHealthRequest) (*kiam.HealthStatus, error)
		want GetHealth("github.com/uswitch/kiam/vendor/golang.org/x/net/context".Context, *kiam.GetHealthRequest) (*kiam.HealthStatus, error)
make: *** [bin/agent] Error 2

Add Prometheus metric reporting support

kiam-server failing with credential error

I have deployed kiam master-agent process in k8s, it is now failing with below error. kiam-server process is not running in k8s master, I have made changes tolerations and nodselector to run on a different set of node groups. IAM role permissions also configured for new node groups.

{"level":"info","msg":"starting server","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 0","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 1","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 2","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 3","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 4","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 5","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 6","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"starting credential manager process 7","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"started cache controller","time":"2018-05-29T07:16:18Z"}
{"level":"info","msg":"started namespace cache controller","time":"2018-05-29T07:16:18Z"}
{"level":"error","msg":"error requesting credentials: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors","pod.iam.role":"arn:aws:iam::666578924526:role/demo_k8s_default_role","time":"2018-05-29T07:16:39Z"}
{"generation.metadata":0,"level":"error","msg":"error warming credentials: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors","pod.iam.role":"arn:aws:iam::666578924526:role/demo_k8s_default_role","pod.name":"external-dns-87d9c58b6-b66qg","pod.namespace":"kube-system","pod.status.ip":"10.34.0.19","pod.status.phase":"Running","resource.version":"11111090","time":"2018-05-29T07:16:39Z"}
{"level":"info","msg":"stopping server","time":"2018-05-29T07:16:48Z"}
{"level":"info","msg":"stopping prometheus metric listener","time":"2018-05-29T07:16:48Z"}
{"level":"info","msg":"stopped","time":"2018-05-29T07:16:48Z"}

Validate agent certificate common name in server

To make it easier to use cluster-wide CAs it would be good to check whether the common name of the agent certificate is authorised by the server.

I'm not super familiar with gRPC client libs but this appears to be the kind of thing we'd need to do (as well as adding a configuration flag to control what common-name to approve).

grpc/grpc-go#111 (comment)

I think this kind of thing would likely be a pre-requisite to using Vault, cert-manager or the Certificate API in Kubernetes to manage TLS (rather than what's described in docs/TLS.md currently).

Proxy unhandled APIs

When using kiam the complete meta-data service on 169.254.169.254 is replaced, which breaks anything that depends on that -- usually a lot.

Instead the server should proxy all unhandled requests through to the original service.

Unable to assign role due to "pod not found" error

This issue resembles #46, and might be essentially the same one.

We've switched to kiam from kube2iam a couple of weeks ago. We're running most of our kubernetes nodes on spot instances, so they are being replaced pretty frequently. When the node is drained before termination, the pods are scheduled on other nodes, and sometimes the application they're running starts throwing the following error:

unable to sign request without credentials set

On the agent side I see the following errors:

{
    "addr": "100.96.143.201:37862",
    "level": "error",
    "method": "GET",
    "msg": "error processing request: rpc error: code = Unknown desc = pod not found",
    "path": "/latest/meta-data/iam/security-credentials/",
    "status": 500,
    "time": "2018-03-15T18:59:02Z"
}

And those are errors on the server side:

{
    "level": "error",
    "msg": "error finding pod: pod not found",
    "pod.ip": "100.96.143.201",
    "time": "2018-03-15T18:59:01Z"
}

It looks like we have a delay between the time pod is scheduled and the time information about it becomes available to KIAM. If I delay the startup of the app for couple of seconds it fixes the problem. Deleting the problematic pod and letting kubernetes reschedule it does the trick as well.

After looking into your code it seems that increasing the prefetch-buffer-size value might help, cause the issue mostly happens when many pods are scheduled at the same time. But maybe I'm missing something.

Any advice would be greatly appreciated.

P.S.: We're using kiam:v2.5 and kubernetes v1.8.6.

Unable to fetch credentials using kube-bridge host interface

We are running kube-router for the k8s networking. We moved away from Calico.

We changed the host-interface to kube-bridge. (Ref: jtblin/kube2iam#120).

Logs of kiam-agent :

{"addr":"10.2.10.2:45980","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = there is no address available","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2018-04-19T13:59:14Z"}
{"addr":"10.2.10.2:45980","headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2018-04-19T13:59:14Z"}

kiam-master was retrieving the credentials successfully. The kiam-master and kiam-server are running in different subnets.

Please let me know if you need any other information.

failed to load system roots and no roots provided - TLS error

{"generation.metadata":0,"level":"error","msg":"error warming credentials: RequestError: send request failed\ncaused by: Post https://sts.amazonaws.com/: x509: failed to load system roots and no roots provided","pod.iam.role":"arn:aws:iam::###########:role/chrisiamtest1","pod.name":"aws-cli3","pod.namespace":"default","pod.status.ip":"100.112.74.130","pod.status.phase":"Running","resource.version":"4849725","time":"2018-02-14T17:59:54Z"}

{"generation.metadata":0,"level":"error","msg":"error warming credentials: RequestError: send request failed\ncaused by: Post https://sts.amazonaws.com/: x509: failed to load system roots and no roots provided","pod.iam.role":"chrisiamtest1","pod.name":"aws-cli","pod.namespace":"default","pod.status.ip":"100.112.74.129","pod.status.phase":"Running","resource.version":"4849455","time":"2018-02-14T17:59:54Z"}

{"level":"error","msg":"error requesting credentials: RequestError: send request failed\ncaused by: Post https://sts.amazonaws.com/: x509: failed to load system roots and no roots provided","pod.iam.role":"arn:aws:iam::############:role/chrisiamtest1","time":"2018-02-14T18:00:54Z"}

I've tried with just using the role name and the full ARN in the pod deployment.
Can someone help me understand what this error means?
Is there documentation on how to specify the base-arn or is autodetect the best solution?

Kiam doesn't work the aws-go-sdk on `m5/c5` instances

First of all, thanks a lot for the great work on kiam. Providing the ability to grant IAM roles to pods i/o nodes is a big security improvement wrt. deploying Kubernetes clusters atop AWS :)

I came accross an issue with the Golang AWS SDK today though: somehow, my Golang pods were trying to assume the role given to their underlying EC2 instance instead of using the one I was providing in the annotation.

I think I found the cause of the issue: the AWS SDK in Go calls the /iam/security-credentials/ endpoint without the trailing slash and, therefore, kiam doesn't intercept the request and passes it to the "real" instance metadata API endpoint (https://github.com/aws/aws-sdk-go/blob/db3e1e27b1ace4fc57be9c5cf7cea0566bd12034/aws/credentials/ec2rolecreds/ec2_role_provider.go#L128).

I made a PR to make the trailing slash optional in gorilla/mux route configuration, it seems to fix the issue: #42 .

Issues with IAM role access after a rolling update of the k8s cluster

I am using kiam v2.6 with kops managed kubernetes 1.9.5 cluster in AWS. Whenever a rolling update is performed, kops rolling-update, some pods have issues accessing their respective IAM roles.

From the agent error log messages it seems that the mapping between the IAM roles and pods is lost. The affected pod is trying to access one IAM role, but Kiam believes it is annotated with another IAM role.

{
    "addr":"100.124.0.4:36284",
    "level":"error",
    "method":"GET",
    "msg":"error processing request: assume role forbidden: requested 'iam-role-one' but annotated with 'iam-role-two', forbidden",
    "path":"/latest/meta-data/iam/security-credentials/iam-role-one",
    "status":403,
    "time":"2018-03-22T19:45:29Z"
}

The workaround is to restart all kiam servers after which the correct mapping is restored. I was able to replicate the issue twice in a row.

The server is started with the following settings:

        - --json-log
        - --level=info
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/server.pem
        - --key=/etc/kiam/tls/server-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --role-base-arn-autodetect
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

while the agent with:

        - --level=info
        - --iptables
        - --host-interface=weave+
        - --json-log
        - --port=8181
        - --cert=/etc/kiam/tls/agent.pem
        - --key=/etc/kiam/tls/agent-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --server-address=kiam-server:443
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

Please let me know if you need any other information.

Regards,

Dimitar

doc: switching from kube2iam to kiam

Would it be possible to switch to kiam without actually bringing the cluster down? I am confused on the part where the DS's are created from the example manifests, since the annotation style used is backward compatible to kube2iam, I was unsure how would you suggest going about this?

Thanks

Make it possible to run the agent and server on the same host without using hostNetworking on the server

Today we tried to get the agent and the server running on the same host but saw that the server's own requests to try to get credentials are getting redirected because they are coming through a cali+ interface. With hostNetworking enabled on the server the PREROUTING rule doesn't match so the requests can go through, but it would be nice to not have to use hostNetworking in order to get this to work.

DaemonSet kiam-server has zero pods

When deploying kiam using the yaml files in /deploy, both DaemonSets are created, but only the kiam-agent DaemonSet actually has any pods. kiam-server has no pods at all, there are no failures and no logs.

Kubernetyes version is 1.74.

Do you have any pointers on how to deploy this properly?

Agents should load-balance across servers

Allow configurable cache TTL

According to the AWS documentation AssumeRole has a minimum session duration of 15 minutes. Kiam's server currently expires credentials from it's cache after 10 minutes- if a pod is still running that needs the credentials they'll be updated ahead of the client needing them.

#41 introduced the ability to generate credentials with customised durations (longer than the 15 minute default) but the AWS API is still called every 10 minutes to request credentials.

For users with large numbers of roles/lots of AWS calls it'd be nice to change Kiam so that it expires credentials from the cache relative to their session duration: if the session expires in an hour the cache should evict the credentials at 55 minutes.

As an extension, we could support an additional annotation that would allow users to specify the session duration (#38 also mentions adding session names).

Add request ID to agent handlers

Can detect the envoy headers if set, otherwise add separate request id and log with all messages- make it easier to locate the log data around retries/failures for individual request.

Misleading error message when role doesn't exist

If the kiam server attemps to assume a role that doesn't exist the error message is currently reported as:

AccessDenied: Not authorized to perform sts:AssumeRole

It'd be nice to make it clearer that it failed because the role doesn't exist, rather than a trust policy issue etc.

Support all "API versions" for the IAM parts

Currently kiam only reacts to the 'latest' API version, but there are more. It seems reasonable to be able to whitelist API versions that work, and to report errors for all kiam-handled paths on not-supported API versions (to prevent leaking of credentials of the node itself).

Health check fails

The health check of the kiam-server is failing with the following

/etc/kiam/tls # /health --key server-key.pem --ca ca.pem --cert server.pem --server-address=localhost:443
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no address available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
WARN[0000] error checking health: rpc error: code = Unavailable desc = there is no connection available
FATA[0001] error retrieving health: rpc error: code = Unavailable desc = there is no connection available

Is there something else required to make the server work?

Namespace restrictions

We run clusters for multiple different teams. We'd like to use an annotation on the Namespace with a regular expression to control which roles can be assumed by pods in that namespace.

For example:

kind: Namespace
metadata:
    name: ecommerce
    annotations:
        iam.amazonaws.com/allowed-roles:
            - ^ecommerce.*

It means that cluster operators don't need to modify the namespace each time the user wants to add a new role but also gives users control over where their roles can be used.

Handle absolute ARNs in the role annotation

Currently kiam assumes that the IAM annotation would be something that can be literally appended to the given "base ARN". At the same time it also assumes that the value of the role annotation should be exposed on the http interface.

Using a full ARN as role breaks things: one can set the base ARN to '' without an issue, but that means the http interface now returns weird values (and from what I can see it breaks things down the road when the values contain "illegal characters" such as ':' and '/'.)

One would want absolute ARNs in some (esoteric) cross-account situations; in our case we also keep these role definitions in a separate file that should have them fully qualified for auditing purposes.

Replace metrics with native Prometheus

Given Prometheus as the standard means for collecting custom metrics within Kubernetes clusters, and the issues around exporting the go-metrics ones, it'd be prudent to overhaul and replace them with native prom equivalents.

I don't think we'd aim for naming parity so probably sensible to just replace according to the naming scheme the Prometheus authors recommend and fold into a major version release.

Change Server to only report healthy once a full pod/namespace sync has been performed

The cache controller used by both the pod and namespace cache can report whether a sync operation has been performed. Until the sync the server relies on watchers delivering events to record their state.

This would help control the rollout of server processes (alongside the pod disruption budget) to proceed more safely.

kiam icon for Helm chart

In the Helm chart for kiam we can add a link to an icon.
I can't find any images for kiam - Can someone create one, maybe based on the image in @pingles recent Medium article?

Unknown health check type None in docker logs

Was going through the docker journald logs on one of the nodes where kiam-server is scheduled, saw something like

Mar 29 07:15:09 ip-10-1-71-111 env[841]: time="2018-03-29T07:15:09.037099505Z" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 24ff319a210f5cf9a1908105647d7c1873bd42a91a4a8c77184dbf8c59031d8d"
Mar 29 07:18:29 ip-10-1-71-111 env[841]: time="2018-03-29T07:18:29.922256047Z" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 4c1ebaf2ca9c0c9d2f2a4cdb89795c02a2acf7a61297d333f3fe573385e847f8"

The kiam server which went into CrashLoopBackOff had logs

{"level":"info","msg":"starting server","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"started cache controller","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"started namespace cache controller","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 0","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 1","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 2","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 3","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 4","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 5","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 6","time":"2018-03-29T07:18:29Z"}
{"level":"info","msg":"starting credential manager process 7","time":"2018-03-29T07:18:29Z"}

server manifest pertaining to liveness and readinessProbe

          livenessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/kiam-server.crt
              - --key=/etc/kiam/tls/kiam-server.key
              - --ca=/etc/kiam/tls/kiam-ca.crt
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 10
          readinessProbe:
            exec:
              command:
              - /health
              - --cert=/etc/kiam/tls/kiam-server.crt
              - --key=/etc/kiam/tls/kiam-server.key
              - --ca=/etc/kiam/tls/kiam-ca.crt
              - --server-address=localhost:443
              - --server-address-refresh=2s
              - --timeout=5s
            initialDelaySeconds: 3
            periodSeconds: 10
            timeoutSeconds: 10

Recommended way to run agent alongside server on some nodes

What is the current recommended way to run the agent alongside the servers on any hosts the servers run on? The currently deploy splits servers/agents onto the controllers/workers respectively. Is it just a case of removing the node selector from the agent deploy? I've tried that but experiencing some issues with pods getting the credentials for the node. Not sure if that's something I've configured wrong so just wanted to ask the recommendation before reporting.

The use case is a daemon set we have which runs across all nodes and uses an IAM role. We have a few of those, one is fluentd which reports to kinesis.

Slightly related to #39 although I don't mind using the host networking.

Manage TLS with Certificates API

Configuring/deploying Kiam currently requires generating TLS certificates to secure credentials in-flight and authenticate the agents and servers.

Kubernetes' Certificates API could be used instead: https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/, potentially simplifying some of the steps.

Unsure on how to assume role (or debug this not working)

So I've set KIAM up and can see the server process logging my attempts at fetching credentials but the end container.

I have the following jenkinsci/kubernetes-plugin pipeline (it's not really important, just for context):

podTemplate(label: 'mypod', cloud: 'kubernetes', containers: [
    containerTemplate(
        name: 'awscli',
        image: 'governmentpaas/awscli',
        ttyEnabled: true,
        command: 'cat',
    ),
],
annotations: [
    podAnnotation(key: "iam.amazonaws.com/role", value: "kiam-test")
]) {
  node('mypod') {
    stage("test") {
        container('awscli') {
          sh """
            aws --debug --region eu-west-2 ec2 describe-instances
          """
        }
      }
    }
}

Which in turn ultimatly returns the following error message:

NoCredentialsError: Unable to locate credentials
Unable to locate credentials. You can configure credentials by running "aws configure".

My server process looks like this:

{"level":"info","msg":"starting server","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"will serve on 0.0.0.0:443","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"started cache controller","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"started namespace cache controller","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 0","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 1","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 2","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 3","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 4","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 5","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 6","time":"2017-12-21T12:56:04Z"}
{"level":"info","msg":"starting credential manager process 7","time":"2017-12-21T12:56:04Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","level":"info","msg":"requested new credentials","time":"2017-12-21T12:58:56Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"","pod.status.phase":"Pending","resource.version":"15184546","time":"2017-12-21T12:58:56Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"","pod.status.phase":"Pending","resource.version":"15184547","time":"2017-12-21T12:58:56Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"","pod.status.phase":"Pending","resource.version":"15184548","time":"2017-12-21T12:58:56Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"100.109.134.93","pod.status.phase":"Running","resource.version":"15184585","time":"2017-12-21T12:59:00Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"100.109.134.93","pod.status.phase":"Running","resource.version":"15184585","time":"2017-12-21T12:59:04Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"100.109.134.93","pod.status.phase":"Running","resource.version":"15184604","time":"2017-12-21T12:59:05Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"","pod.status.phase":"Pending","resource.version":"15184715","time":"2017-12-21T12:59:37Z"}
{"credentials.access.key":"BLAHBLAHBLAHBLAH","credentials.expiration":"2017-12-21T13:13:56Z","credentials.role":"kiam-test","generation.metadata":0,"level":"info","msg":"fetched credentials","pod.iam.role":"kiam-test","pod.name":"jenkins-slave-xf919-ftc1q","pod.namespace":"infra-ops","pod.status.ip":"","pod.status.phase":"Pending","resource.version":"15184715","time":"2017-12-21T13:00:04Z"}

Most of my agents aren't saying much, but that's fine this one is 500ing:

{"level":"info","msg":"configuring iptables","time":"2017-12-20T22:59:46Z"}
{"level":"info","msg":"listening :8181","time":"2017-12-20T22:59:46Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = there is no connection available","pod.ip":"100.109.134.92","time":"2017-12-21T13:03:18Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = there is no connection available","pod.ip":"100.109.134.92","time":"2017-12-21T13:03:18Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = there is no connection available","pod.ip":"100.109.134.92","time":"2017-12-21T13:03:18Z"}
{"addr":"100.109.134.92:34232","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = there is no connection available","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2017-12-21T13:03:19Z"}
{"addr":"100.109.134.92:34232","headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2017-12-21T13:03:19Z"}

I assume the 500 is the problem. Any hints on what it's trying to do?

Security Context on the Agent

I am noticing

 securityContext:
            privileged: true

Could the container be run without privileged if it did not set up the DNAT rule with iptables? Or is there another role that can be used that has less root-like permissions?

Deployment with Istio

Instead of using the gRPC library for load-balancing it would be interesting to see how well/easy Kiam works within an Istio mesh.

Testing how easy it is to support Istio's distributed tracing would be cool- I guess most of this work would probably exist within client libraries rather than kiam itself.

--role-base-arn-autodetect does not work when instance profile has an IAM path other than '/'

Current code in question:

kiam/pkg/aws/sts/resolver_detect_arn.go

Lines 54 to 60 in 77d8589

    
           // instance profile arn will be of the form: 
        
           // arn:aws:iam::account-id:instance-profile/role-name 
        
           // so we use the instance-profile prefix as the prefix for our roles 
        
           parts := strings.Split(strings.Replace(info.InstanceProfileArn, "instance-profile", "role", 1), "/") 
        
           if len(parts) != 2 { 
        
           	return "", fmt.Errorf("unexpected instance arn format: %s", info.InstanceProfileArn) 
        
           }

It is quite possible for the ARN to have more than one / in it due to having an IAM path other than /. For example arn:aws:iam::account-id:instance-profile/mypath/role-name.

This should be accounted for in the split, or perhaps it may be more prudent to just get the current account ID and interpolate that into arn:aws:iam::%s:role.

Autodiscover base ARN

Hi!

We want to migrate from Kube2IAM to Kiam, but had to stop it because Kiam lacks the autodiscover base ARN feature: the autodiscover of base ARN

Is this feature planned in Kiam?

Thank you!

Customizing suffix of the assumed role ARN for auditing purpose?

Hi! Thanks for starting & maintaining this great project 👍

I recently realized that running get-caller-identity with the kiam-provided AWS credentials produce:

$ aws sts get-caller-identity --region ap-northeast-1

{
    "Account": "$myaccountid",
    "UserId": "AROAICVHQ4GZUSQIQRRHY:kiam-kiam",
    "Arn": "arn:aws:sts::$myaccountid:assumed-role/my-k8s-service-role/kiam-kiam"
}

my-k8s-service-role is from the pod annotation and kiam-kiam seems to be coming from kiam.

I couldn't find whether it is hard-coded into kiam or not.
Anyway, it would be great if I could configure the part to e.g. kiam-$mycluster-$ns-$pod_name so that it looks kiam-mycluster-kube-system-mytestpod which would add more traceability via CloudTrail logs. More concretely, it would be nice if I could trace from CloudTrail logs which pod in which cluster/namespace called which AWS API.

unable to assign pod the iam role

checked the kiam-agent logs on the node where the pod (which was to be assigned the iam role) was scheduled which look like

{"addr":"10.2.40.2:34938","headers":{},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":200,"time":"2018-03-15T07:02:32Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unknown desc = pod not found","pod.ip":"10.2.40.2","time":"2018-03-15T07:02:32Z"}
{"addr":"10.2.40.2:34940","level":"error","method":"GET","msg":"error processing request: error checking assume role permissions: rpc error: code = Unknown desc = pod not found","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:32Z"}
{"addr":"10.2.40.2:34940","headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:32Z"}
{"addr":"10.2.40.2:34942","headers":{},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":200,"time":"2018-03-15T07:02:33Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unknown desc = pod not found","pod.ip":"10.2.40.2","time":"2018-03-15T07:02:33Z"}
{"addr":"10.2.40.2:34944","level":"error","method":"GET","msg":"error processing request: error checking assume role permissions: rpc error: code = Unknown desc = pod not found","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:33Z"}
{"addr":"10.2.40.2:34944","headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:33Z"}
{"addr":"10.2.40.2:34946","headers":{},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":200,"time":"2018-03-15T07:02:33Z"}
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unknown desc = pod not found","pod.ip":"10.2.40.2","time":"2018-03-15T07:02:33Z"}
{"addr":"10.2.40.2:34948","level":"error","method":"GET","msg":"error processing request: error checking assume role permissions: rpc error: code = Unknown desc = pod not found","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:33Z"}
{"addr":"10.2.40.2:34948","headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/my-app-iam-role","status":500,"time":"2018-03-15T07:02:33Z"}

running uswitch/kiam:v2.4 on both the agent and the server.

The namespace where the pod is scheduled has the annotation

  annotations:
    iam.amazonaws.com/permitted: .*

as stated in the docs

along with

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

in the trust relationships on the iam-role picked up the node where the pods are being scheduled.

Not sure if it's related, but we recently upgraded k8s from 1.8.4 to 1.8.9, but I guess that shouldn't be the problem.

Divergent annotations for kiam/kube2iam projects, an end user plea

Hi all and thanks for your efforts on kiam, I really like the idea of pre-caching credentials for faster container starts.

As an end user it would be fantastic if the kaim and kube2iam implementations could agree to support that same annotations. This make it easier for kube2iam users to switch to kiam or to try it and switch back. Or to deploy applications to two clusters, one using each implementation.

Right now the Pod annotation is the same for both - fantastic!

kind: Pod
metadata:
  name: foo
  namespace: iam-example
  annotations:
    iam.amazonaws.com/role: reportingdb-reader

But the namespace annotation has diverged. kiam using a single regex to match roles:

kind: Namespace
metadata:
  name: iam-example
  annotations:
    iam.amazonaws.com/permitted: ".*"

And kubei2iam uses a list of roles or role wildcards:

Kind: Namespace
metadata:
  name: iam-example
  annotations:
    iam.amazonaws.com/allowed-roles:
    - "role-arn"
    - "my-custom-path/*"
    - "*"

Now I am sure both projects had their reasons and preferences. But again, as an end user of both these great projects, I look at the divergent namespace annotations and say "Really!??" 😄

IMHO, it doesn't seem like either approach is significantly better in any practical way, or in a way that makes this divergence beneficial to users. I feel the use cases for compatibility are strong, and I don't perceive what the use cases are for this divergence?

Pretty please, could the projects not make friends, get drunk together if necessary, and pick either or both namespace annotation options, and both projects support the same one(s)? 🍻

Or second best, if you can't agree, could kiam also support iam.amazonaws.com/allowed-roles by compiling the list of allowed-roles roles into a regex to use internally?

Reduce 200 ping requests in logs

Do we need the logs with contents processed request for path: /ping and status: 200 at info level? I've been trying to investigate a problem but the logs are kind of swamped with these. Kubernetes will already be monitoring the response although there's some value in logging the reason why it's failing on non 200s because by that point it might be impossible to get into the container to investigate.

My proposal is we filter out the 200s for the ping endpoint or switch them to debug.

	// instance profile arn will be of the form:
	// arn:aws:iam::account-id:instance-profile/role-name
	// so we use the instance-profile prefix as the prefix for our roles
	parts := strings.Split(strings.Replace(info.InstanceProfileArn, "instance-profile", "role", 1), "/")
	if len(parts) != 2 {
	return "", fmt.Errorf("unexpected instance arn format: %s", info.InstanceProfileArn)
	}