Comments (25)
Corrected March 13th Burndown report:
Status as of noon, PDT, March 13th.
We still have 19 issues open. Some of these don't show up on github searches because they are not labelled correctly, which is being resolved.
We are waiting for a bunch of changes to go into 1.9 to fix downgrade tests, which should clear up some of the pending issues list. This does point to needing a change in how we do downgrade tests in the future.
There are several major regressions without resolution right now, sufficient to delay ending code freeze, and possibly the final release per burndown meeting this AM.
Red
Issues which are blockers, or are status unknown and look serious, without a good PR.
Performance issues in analysis/progress:
Failing tests with currently unknown causes:
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [job failed] 1.9-master upgrade|downgrade jobs
- [test failed] gci-gce-alpha-features
Regression in progress, but fix untested. Issue was accidentally dropped by SIG and just picked up again:
Yellow
Issues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs.
Test Fails in Progress
- "Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
- "CreateContainerConfigError: failed to prepare subPath for volumeMount" error with configMap volume
- Subpath tests don't work in multizone GCE
1.9 downgrade tests issues
These issues are waiting for code on 1.9 in order to fix the downgrade tests.
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
- pull-kubernetes-kubemark-e2e-gce is failing
Green
Non-blocker issues.
- zsh completion throws error v1.10.0-beta.2
- Controller-manager sees higher mem-usage when load test runs before density
- HostPath mounts failing with "Path is not a shared or slave mount"
Tracking Issues
from sig-release.
1.10 is out now, closing.
from sig-release.
Status as of 2/21, 1 day into Code Slush.
Detail behind this report can be found on the tracking spreadsheet
1 day after Code Slush, here is issue status.
Bump
Recommend bumping these issues from the milestone based on inactivity and lack of responsiveness by SIGs/issue owner: 11 issues.
- CronJob controller should use shared informers
- move kube-proxy into a daemonset
- Document what docker image (Dockerfile) features we support
- e2e tests for cloud-controller-manager
- Review Cluster Autoscaler code for Azure
- Graduate the kubeletconfig API to beta
- RFC: Validate Docker against the Docker API versions
- Remove docker dependency on Kubelet startup
- Support out-of-tree authentication providers
- Communicate etcd2 support deprecation timeline
- Kubelet often fails on AWS spot instances
- Advanced Auditing 1.10 umbrella bug
- Don't duplicate status in audit events
- audit.Event.RequestObject underspecified for patch requests
- Remove the PersistentVolumeLabel Admission Controller
Wait
Status of issues/PRs has been queried, currently waiting for response from owners: 9 issues
- Request to add namespace name and namespace UUID to metadata of on-disk log file
- Support a Vault based KMS provider for envelope encryption of resources in a cluster
- ConfigMaps and Secrets mounted with subPath do not update when changed
- kubernetes GPU device plugin for Ubuntu and Debian images
- Delete in-tree support for Nvidia GPUs
- Job backoff limit workings when parallelism > 1
- RFC: Some thoughts about node e2e tests
- Document upgrade and downgrade steps for etcd 3.2 upgrade
Keep
1.10 issues which appear to be in progress and headed towards resolution by Code Freeze: 19 issues
- CRI: Support CRI log stats
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- Add support for accessing Ceph RBD with the rbd-nbd client
- [audit] Restore audit logging in the scalability tests
- After 1.8, scheduler could reject unknown extended resource names
- Kubelet: Resolve paths from dynamic config payloads against unpacked config file location
- add PV resize support for azure disk
- Kubernetes Container Runtime Interface (CRI) doesn't support WindowsContainerConfig and WindowsContainerResources
- Graduate the Lease Endpoint Reconciler to beta in v1.10
- Add a subresource for Node.ConfigSource?
- CRI: Implement container log rotation
- Invalid capacity 0 on windows image filesystem
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- Node's providerID is wrong for azure vmss when using instance metadata
- [e2e test failure] each node by dropping all outbound packets for a while and ensure they function afterwards
- [test failure] verify.openapi-spec
- Cut and Vendor cAdvisor v0.29.0
Tracking
These are tracking issues. In a few cases, they link to multiple issues which are NOT currently marked with the 1.10 milestone. 6 issues.
- v1.10 known issues / FAQ accumulator
- CRI: support non-builtin container runtimes
- Move Priority and Preemption to Beta
- Bringing server-side printing to beta summary
- Standalone Azure cloud provider
- Move
KubeControllerManagerConfiguration
topkg/controller/apis/
from sig-release.
Graduate the kubeletconfig API to beta
was completed in 1.10
from sig-release.
Support out-of-tree authentication providers
design is merged, implementing PR is under review and planned to merge this week
from sig-release.
All SIG-Azure issues have been curated.
from sig-release.
Status as of 2/23. All issue owners have been reminded of the need to add status/approved-for-milestone. 11 issues were removed from the milestone or resolved. 3 new issues were added. Total of 40 issues.
Bump
Recommend bumping these issues from the milestone based on inactivity and lack of responsiveness by SIGs/issue owner, or request by issue owner that it be bumped: 9 issues
- Document what docker image (Dockerfile) features we support
- e2e tests for cloud-controller-manager
- Review Cluster Autoscaler code for Azure
- Remove the PersistentVolumeLabel Admission Controller
- RFC: Validate Docker against the Docker API versions
- kubernetes GPU device plugin for Ubuntu and Debian images
- Remove docker dependency on Kubelet startup
- add PV resize support for azure disk
- Kubelet often fails on AWS spot instances
Wait
Issue Owner/SIG has been queried, waiting for response. 10 issues
- Add support for accessing Ceph RBD with the rbd-nbd client
- ConfigMaps and Secrets mounted with subPath do not update when changed
- After 1.8, scheduler could reject unknown extended resource names
- Delete in-tree support for Nvidia GPUs
- Graduate the Lease Endpoint Reconciler to beta in v1.10
- Request to add namespace name and namespace UUID to metadata of on-disk log file
- RFC: Some thoughts about node e2e tests
- Don't duplicate status in audit events
- audit.Event.RequestObject underspecified for patch requests
- Device Plugin failure handling in kubelet is racy
Keep
Issue is approved by SIG for 1.10 or looks likely to be approved/completed, or looks like a blocking bug against 1.10: 16 issues.
- [audit] Restore audit logging in the scalability tests
- Kubelet: Resolve paths from dynamic config payloads against unpacked config file location
- Support out-of-tree authentication providers
- Kubernetes Container Runtime Interface (CRI) doesn't support WindowsContainerConfig and WindowsContainerResources
- Job backoff limit workings when parallelism > 1
- Add a subresource for Node.ConfigSource?
- Document upgrade and downgrade steps for etcd 3.2 upgrade
- Invalid capacity 0 on windows image filesystem
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- v1.10 known issues / FAQ accumulator
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server
- [test failure] verify.openapi-spec
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- Windows containers creation failed because of rslave mounts
- Move Taints and Tolerations to GA
Tracking
Tracking issues which have features/bugs in 1.10:
- Advanced Auditing 1.10 umbrella bug
- CRI: support non-builtin container runtimes
- Move Priority and Preemption to Beta
- Move
KubeControllerManagerConfiguration
topkg/controller/apis/
- Bringing server-side printing to beta summary
from sig-release.
Burndown report as of scheduled Code Freeze 2/26
Complete issue tracker is a spreadsheet.
Red
These issues represent what look like blockers for 1.10, and do not have a PR. One has had no attention at all.
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
Yellow
These issues have one or more PRs. However, either the PRs aren't in good shape (such as failing tests or review) or don't completely resolve the issue.
- After 1.8, scheduler could reject unknown extended resource names
- Remove docker dependency on Kubelet startup
- Support out-of-tree authentication providers
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- Device Plugin failure handling in kubelet is racy
- OSS gke tests fails check APIReachability
- kubernetes/kubernetes#60381
Green
These issues have a PR in good shape and are expected to merge and close within a couple of days.
- [audit] Restore audit logging in the scalability tests
- Kubernetes Container Runtime Interface (CRI) doesn't support WindowsContainerConfig and WindowsContainerResources
- Document upgrade and downgrade steps for etcd 3.2 upgrade
- [test failure] verify.openapi-spec
- DaemonSet should ignore the
unschedulable
field of a node - Move Taints and Tolerations to GA
Tracking
These are tracking issues for features being dealt with in multiple steps/releases/PRs.
- Move
KubeControllerManagerConfiguration
topkg/controller/apis/
- Bringing server-side printing to beta summary
- Advanced Auditing 1.10 umbrella bug
- v1.10 known issues / FAQ accumulator
Bump
These issues are expected to get bumped out of 1.10 once automation kicks in tommorrow, as they are not approved or are low-priority.
- Document what docker image (Dockerfile) features we support
- e2e tests for cloud-controller-manager
- Remove the PersistentVolumeLabel Admission Controller
- Kubelet often fails on AWS spot instances
- Move
KubeControllerManagerConfiguration
topkg/controller/apis/
- Bringing server-side printing to beta summary
- Request to add namespace name and namespace UUID to metadata of on-disk log file
- RFC: Some thoughts about node e2e tests
- Image Manager should return a copy of image list to avoid data race.
- Add support for accessing Ceph RBD with the rbd-nbd client
from sig-release.
Thank you so much! This takes the organizational cake.
from sig-release.
As of today, we have 27 issues open against 1.10, although some of them (4) would have been removed by automation if we hadn't run out of github tokens. I don't have a good comparison against prior releases because I've discovered math errors in the devstats charts. Working on that. Closest comparison is 20 open issues 4 days after Code Freeze for 1.9.
Tracking spreadsheet is here and is up to date as of this afternoon. Note the three "NA" issues; these are recent bugs and/or test failures which look like 1.10 issues, but have not been confirmed by their respective SIGs.
Red
Issues with no PR, or no complete PR, which cannot be easily kicked out of 1.10.
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- Apiserver CPU/Mem usage bumped to 1.5-2x in big clusters
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- RFC: Some thoughts about node e2e tests
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- Improve k8s support for multizone PDs
- DaemonSet should ignore the
unschedulable
field of a node - kubernetes/kubernetes#60381
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
- [test flakes] master-scalability suites
- [failing test] Nginx should conform to Ingress spec
- Failing to list pods with selector
spec.nodeName!= ""
- Custom resources with finalizers can "deadlock" customresourcecleanup.apiextensions.k8s.io finalizer
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
- Document upgrade and downgrade steps for etcd 3.2 upgrade
- [audit] Restore audit logging in the scalability tests
- Support out-of-tree authentication providers
- Kubelet often fails on AWS spot instances
- [test failure] verify.openapi-spec
- kubectl completion failed to list file names
- Server side print returns generic information for deployments
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug
- Bringing server-side printing to beta summary
Kick
Issues which are waiting for automation to kick them out of the milestone.
- e2e tests for cloud-controller-manager
- Request to add namespace name and namespace UUID to metadata of on-disk log file
from sig-release.
Burndown report as of 3/1 around 1pm Pacific
Complete issue tracker is a spreadsheet.
We currently have around 21 "real" issues, excluding a few which will be dropped by automation as soon as the patched munger catches up.
Red
These issues represent what look like blockers for 1.10, and do not have a PR.
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- DaemonSet should ignore the
unschedulable
field of a node - Apiserver CPU/Mem usage bumped to 1.5-2x in big clusters
- Failing to list pods with selector
spec.nodeName!= ""
Yellow
These issues have one or more PRs. However, either the PRs need review (such as failing tests or review) or don't completely resolve the issue.
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- Improve k8s support for multizone PDs
- kubernetes/kubernetes#60381
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
- kubectl completion failed to list file names
- [failing test] Nginx should conform to Ingress spec
- [test flakes] master-scalability suites
- [test flakes] master-scalability suites
- [job flake] kubelet-master
- Mount propagation moved to beta, comment not updated
Green
These issues have a PR in good shape and are expected to merge and close within a couple of days.
Tracking
These are tracking issues for features being dealt with in multiple steps/releases/PRs.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug
- Bringing server-side printing to beta summary
from sig-release.
As of today, we have 21 issues open against the 1.10 milestone, which is the same as yesterday. However, several of those issues have moved from yellow to green status because of PRs being approved/fixed, so we can expect a drop in the number of issues over the weekend.
On the down side, several issues are of special concern, as they represent severe problems which may throw off the release schedule. These are all in the Red section and detailed there.
Tracking spreadsheet is here and is up to date as of this afternoon.
Red
Issues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions.
The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that doesn't produce a major performance regression. Depending on how fixing it goes, we may have to punt on it for 1.10.0 and wait for a fix in 1.10.1.
These two are really the same issue, and show what may be a very large performance regression in 1.10 even without a Stale Reads fix.
Waiting on test framework to find out how bad the problem is; it's related to some serious test flakes, though, so it's undetermined
whether it's a substantial issue or a problem with the test.
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- kubectl completion failed to list file names
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
- Support out-of-tree authentication providers
- Kubelet often fails on AWS spot instances
- DaemonSet should ignore the
unschedulable
field of a node - kubernetes/kubernetes#60381
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
- Failing to list pods with selector
spec.nodeName!= ""
- [failing test] Nginx should conform to Ingress spec
- [job flake] kubelet-master
- Mount propagation moved to beta, comment not updated
Kick
Issues just waiting for the grace period to elapse before being kicked out of 1.10.
- Request to add namespace name and namespace UUID to metadata of on-disk log file
- Improve k8s support for multizone PDs
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug
- Bringing server-side printing to beta summary
Kick
Issues which are waiting for automation to kick them out of the milestone.
- e2e tests for cloud-controller-manager
- Request to add namespace name and namespace UUID to metadata of on-disk log file
from sig-release.
As of around 10am PST today, we have 25 issues open against the 1.10 milestone, which is an increase of 4 from Friday. Most of the new issues are actually breakouts of a larger test fail issue, in order to have one issue per SIG for resolution (see test fails below).
Tracking spreadsheet is here
and is up to date as of this morning.
Red
Issues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions.
The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that
doesn't produce a major performance regression. Depending on how fixing it goes, we may have to punt on it for 1.10.0 and wait for
a fix in 1.10.1.
This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. Test flakes are fixed, so hopefully we can get confirmation (or not).
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- Advanced Auditing 1.10 umbrella bug
- Bringing server-side printing to beta summary
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
- kubeadm v1.10 release tracking issue
New Test Fails
We've added a bunch of test fail tracking over the weekend. These represent related test fails across several test suites, with individual asignees. All are considered Yellow right now, as they either have fixes in progress, or are too new to be considered stuck. Most of these are being tracked from issue #60003.
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- HPA is refuses to autoscale if DesiredReplicas > MaxReplicas (but I want it to scale up to MaxReplicas)
- [test failed] Services should work after restarting apiserver
- [test failed] regular resource usage tracking resource tracking for 0 pods per node
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [test failed] [1.10 upgrade] Servers with support for Table transformation
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
- [test failed] [1.10 upgrade] Proxy version v1
- Flexvolume e2e tests failing
- Advanced Audit tests flaking
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
- Kubelet often fails on AWS spot instances
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
- [test flakes] master-scalability suites
- Mount propagation moved to beta, comment not updated
- Failure in Conformance test - Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug
- Bringing server-side printing to beta summary
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
Kick
Issues just waiting for the grace period to elapse before being kicked out of 1.10.
- Request to add namespace name and namespace UUID to metadata of on-disk log file
- Improve k8s support for multizone PDs
from sig-release.
As of around 11am PST today, we have 26 issues open against the 1.10 milestone, which is an increase of 1 from yesterday. Most issues are various test failures, as non-test-fail issues are getting resolved.
Tracking spreadsheet is here
and is up to date as of this morning.
Red
Issues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions.
The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that
doesn't produce a major performance regression. It does not look to be headed towards resolution on a reasonable timeline for 1.10, so we may need to talk about it for 1.10.1.
This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify.
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- Improve k8s support for multizone PDs (this is an exception)
Test Fails In Progress
The vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures. In at least one case, we need to unfreeze the 1.9 tree to fix the test. Not clear at this point whether we have a general upgrade issue the way we did in 1.9.
- [job failure] ci-kubernetes-e2e-kubeadm-gce
- [job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting
- [test flakes] master-scalability suites
- Advanced Audit tests flaking
- Flexvolume e2e tests failing
- [test failed] Services should work after restarting apiserver
- [test failed] regular resource usage tracking resource tracking for 0 pods per node
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [job failed] 1.9-master upgrade|downgrade jobs
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
- Kubelet often fails on AWS spot instances
- [failing test] daemonset-upgrade in 1.9-master upgrade jobs
- Mount propagation moved to beta, comment not updated
- Failure in Conformance test - Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate [Conformance]
- [test failed] [1.10 upgrade] Servers with support for Table transformation
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (just waiting on fixing the tests)
- Bringing server-side printing to beta summary (should be done?)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
As of around 4pm PST today, we have 22 issues open against the 1.10 milestone, which is a decrease of 4 from yesterday. Most issues are various test failures, as non-test-fail issues are getting resolved.
Tracking spreadsheet is here
and is up to date as of this afternoon.
Critical concerns right now are:
- Potential major performance regression
- Issue with stale reads affecting scalability
- Test fails not getting enough attention from SIGs
- Test fails which require patches against 1.9 to resolve.
Red
Issues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions.
The Stale Reads issue is a potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that
doesn't produce a major performance regression. It does not look to be headed towards resolution on a reasonable timeline for 1.10, so we may need to talk about it for 1.10.1.
This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify.
It may also be related to issue #60762, below.
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
These two failing tests are receiving zero attention from their respective SIGs, 3 days after notice. SIGs bothered on Slack.
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- Improve k8s support for multizone PDs (this is an exception)
Test Fails In Progress
The vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures.
- [test flakes] master-scalability suites
- Flexvolume e2e tests failing
- [test failed] Services should work after restarting apiserver
- [test failed] regular resource usage tracking resource tracking for 0 pods per node
- [job failed] 1.9-master upgrade|downgrade jobs
- [test failed] [1.10 upgrade] Proxy version v1
- FlexVolume probe race condition potentially crashes kubelet
- "Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
These two test fails seem to require modifying tests/code for 1.9 in order to fix. We need a hotfix to allow the owners to do this, and then we need a better procedure for handling upgrade tests in the future so that it doesn't lead to needing to patch tests on an older, frozen, version.
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
This issue is unconfirmed and not yet assigned to 1.10, but could be causing some of the test failures above:
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (just waiting on fixing the tests)
- Bringing server-side printing to beta summary (should be done?)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
As of around 4pm PST today, we have 18 issues open against the 1.10 milestone, which is a decrease of 4 from Wednesday, and a great trajectory to be on. Most issues are various test failures, as non-test-fail issues are getting resolved, and all test fails are now getting attention.
Tracking spreadsheet is here
and is up to date as of this afternoon.
Critical concerns right now are:
- Potential major performance regression
- Opening the 1.9 tree in order to patch the upgrade tests (see Yellow)
Red
Issues with no PR, or no complete PR, which cannot be easily taken out of 1.10 or represent major regressions.
This shows what may be a very large performance regression in 1.10 even without a Stale Reads fix. SIG thinks we have a major regression here, even though test flakes are making it hard to verify.
The Stale Reads issue is still potentially major issue, affecting all supported versions of Kubernetes, without a clear solution that doesn't produce a major scalability regression. However, it looks highly unlikely to be resolved in the next 2 weeks, so we are recommending taking it out of the 1.10 milestone.
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
Test Fails In Progress
The vast majority of issues open now are test fails. All of the below are in progress in some way, but don't yet have a clear resolution. Many of these are the usual upgrade test failures.
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [test failed] regular resource usage tracking resource tracking for 0 pods per node
- [job failed] 1.9-master upgrade|downgrade jobs
- k8s.io/kubernetes/examples test failed
These four test fails seem to require modifying tests/code for 1.9 in order to fix. We need a hotfix to allow the owners to do this, and then we need a better procedure for handling upgrade tests in the future so that it doesn't lead to needing to patch tests on an older, frozen, version. Either that, or we have to decide to ignore the upgrade tests for 1.10.
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation.
- Flexvolume e2e tests failing
- [test failed] Services should work after restarting apiserver
- FlexVolume probe race condition potentially crashes kubelet
- "Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (just waiting on fixing the tests)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
As of around 1pm PDT today, we have 15 accepted issues open against the 1.10 milestone, and four possibles, which is a decrease of 3 from Friday. Further, four issues are likely to close in the next 2-5 hours as fixed tests pass.
However, we have a couple of major blocker issues which look possible to delay the release, see Red below.
Tracking spreadsheet is here
and is up to date as of 1pm PDT.
Red
The big potential release-derailer is the major performance regressions possibly due to unidentified performance changes in etcd:
At this point, it is unclear if the etcd issues account for all of the problems, or if changing etcd version/settings will fix the issues.
This is an apparently unrelated increase on memory used by the API server. @shyamjvs has been hard at work bisecting for this, and may have found a culprit:
IMHO, these two performance regressions are significant enough to warrant a release delay.
We also have two test fails which have been receiving no attention. While neither looks that bad, I'm flagging them because we don't actually know what's causing them:
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [job failed] 1.9-master upgrade|downgrade jobs
The Stale Reads issue has been removed from 1.10.
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
- None. Yay! But ...
Test Fails In Progress
All of the below are in progress in some way, but don't yet have a clear resolution.
- [test failed] regular resource usage tracking resource tracking for 0 pods per node
- pull-kubernetes-kubemark-e2e-gce is failing
Possibles
These three issues may be 1.10 issues; they were recently reported and seem related to other issues with 1.10. However, none of them have been examined by the SIGs yet. None look like release-blockers.
- k8s.io/kubernetes/examples test failed
- zsh completion throws error v1.10.0-beta.2
- KUBE_GIT_VERSION contains extraneous comma when building from git archive source.tgz
- Controller-manager sees higher mem-usage when load test runs before density
Green
Issues with an approved PR which is just waiting for labels, release notes, or automation. A bunch of these are tests which are being fixed now that code was merged into 1.9, just waiting on them to pass.
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
- FlexVolume probe race condition potentially crashes kubelet
- "Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (waiting on docks merge)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
Burndown report as of 10am PDT March 13
DELETED because issues with Google Sheets caused it to be inaccurate. New burndown shortly.
from sig-release.
/cc
from sig-release.
Status as of noon, PDT, March 14th.
We have three issues which were re-opened yesterday, because they were closed in advance of verifying the tests.
Several of the upgrade/downgrade tests have been fixed, but we are waiting on all tests to pass before we actually clear them, since several test fails have been closed and reopened. A big thanks to @liggitt for pursuing those.
Overall status is "Crimson". We have multiple unclosed issues, any of which are sufficient to block release, and two of which (performance/scalability and ) have no specific timeline for resolution. We also have one trailing feature of unknown status. Further delaying the release seems more likely than not.
Red
Issues which are blockers, or are status unknown and look serious, without a good PR.
Performance issues in analysis/progress. Some of the performance and scalability issues have been resolved, and others have been broken out into more specific issues. There are some issues (see Green below) which are not expected to be resolved for 1.10, but are regarded as non-blockers. Many thanks to @shyamjvs for diving into these regressions!
- [test flakes] master-scalability suites
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller
Failing tests with currently unknown causes:
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [test failed] gci-gce-alpha-features
Regression in progress, but fix not passing tests:
Orphaned feature, awaiting response from SIG:
Yellow
Issues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs.
Test Fails in Progress
These are currently all upgrade tests
- [test failed] [1.10 upgrade] Servers with support for Table transformation
- [test failed] [1.10 upgrade][gci-gke] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
Green
Non-blocker issues (expected to remain broken for 1.10, need to add release note):
Flaky timeouts while waiting for RC pods to be running in density test
Controller-manager sees higher mem-usage when load test runs before density
HostPath mounts failing with "Path is not a shared or slave mount"
Resolved, pending having all tests passing:
Apiserver CPU/Mem usage bumped to 1.5-2x in big clusters
[test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
pull-kubernetes-kubemark-e2e-gce is failing
"Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
"CreateContainerConfigError: failed to prepare subPath for volumeMount" error with configMap volume
Subpath tests don't work in multizone GCE
Resolved, waiting for automation:
zsh completion throws error v1.10.0-beta.2
Tracking Issues
from sig-release.
As of around noon PDT today, we have 12 accepted issues open against the 1.10 milestone, which is a decrease of 3 from yesterday. Further, four issues are likely to close in the next 2-5 hours as fixed tests pass.
However, we have a couple of major blocker issues which look possible to delay the release, see Red below.
Tracking spreadsheet is here
and is up to date as of 1pm PDT.
Red
The big potential release-derailer is the major performance regressions:
- [test flakes] master-scalability suites
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller
- Flaky timeouts while waiting for RC pods to be running in density test (non-blocker)
These two performance regressions are considered significant enough to warrant a release delay work on them, including git bisect and scalability testing, has been ongoing. This is slow due to the relatively small number of folks who understand kube scalability and the scalability tests. You a performance geek who wants to get involved with Kubernetes? We could use you.
This test fail may be related to the fluentd performance issues, but root cause unknown:
Pod deletion has a problem with race conditions, work is in progress but intial patch attempt needs work:
Yellow
Issues with a PR that is not approved, or issues with no PR which look possible to ignore/kick in 1.10.
The problem with being unable to delete PVCs on downgrade is in progress, in the form of manual downgrade docs and a patch for the tests, with an actual fix due in 1.9.5:
This GCE deprecated flag issue has a PR in progress and near approval. It is currently breaking a lot of unrelated tests:
The rest of the Daemonset Scheduling work looks likely to be postponed until 1.11, but it's unclear at this point what would be required to back out committed work:
Green
Issues that are non-blockers or expected regressions and are expected to remain issues after 1.10.0 release:
- HostPath mounts failing with "Path is not a shared or slave mount"
- Apiserver CPU/Mem usage bumped to 1.5-2x in big clusters
- Controller-manager sees higher mem-usage when load test runs before density
Issues with an approved PR which is just waiting for labels, release notes, or automation:
Test fails which have been fixed but we're waiting for a couple days of green before we stop watching them:
- Advanced Audit tests flaking
- [test failed] [1.10 upgrade] Servers with support for Table transformation
- [test failed] [1.10 upgrade] Dynamic Provisioning DynamicProvisioner
- [test failed] [1.10 upgrade][gci-gke] Kubernetes Dashboard should check that the kubernetes-dashboard instance is alive
- [test failed] [1.10 upgrade] Cadvisor should be healthy on every node
- [test failed] [1.10 upgrade] Proxy version v1
- pull-kubernetes-kubemark-e2e-gce is failing
- "Cluster level logging implemented by Stackdriver should ingest events" fails for GKE Regional Clusters
- Subpath tests don't work in multizone GCE
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (waiting on docks merge)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
As of around noon PDT today, we have 8 accepted issues open against the 1.10 milestone, which is a decrease of 4 from yesterday.
At this point, we have three outstanding areas of work, which relate to multiple issues: the performance regressions, PVC protection downgrade, and Daemonset scheduling. Everything else known is resolved.
Tracking spreadsheet is here
and is up to date as of 11am PDT.
Red
Performance regressions are in progress, but still not completely nailed down. Bisect has revealed a candidate issue which is possibly due to an already-reverted PR, and as such the release team wants to get an RC built so that they can really test the tweaks already made:
- [test flakes] master-scalability suites
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller
- Flaky timeouts while waiting for RC pods to be running in density test (non-blocker)
SIG-Storage is working to fix the failing test for downgrade of protected PVCs. @liggitt is working on a test patch to implement the manual instructions so that we can complete the downgrade tests. Risk: some of the other downgrade tests start failing now that they can finish running.
- [job failed] 1.9-master upgrade|downgrade jobs
Yellow
Daemonset scheduling feature is cleared to go into 1.10. All code updates have been merged, although one refactoring PR is deferred to 1.11. Remaining open PR is docs, plus release notes are needed. Risk: we may break new tests with merge this morning.
- Schedule DaemonSet Pods by default scheduler
- [failing test] should restart all nodes and ensure all nodes and pods recover (also relates to fluentd issues)
- [test failed] gci-gce-alpha-features
We also have one more miscellaneous test fail in progress:
Special Issues
Primarily tracking issues.
- v1.10 known issues / FAQ accumulator
- Advanced Auditing 1.10 umbrella bug (waiting on docks merge)
- [job failure]ci-kubernetes-e2e-gci-gke|gce-serial
from sig-release.
Status as of noon, PDT, March 19th.
Overall status is "saffron" (yellow with some organge). While the majority of bugs are either closed or have short-timeline plans for closure, we still have outstanding performance issue(s) whose cause is unknown.
Red
Issues which are blockers, or are status unknown and look serious, without a good PR.
Performance issues in analysis/progress. With fluentd patches, performance issues have been addressed within acceptable tolerances (there is increased resource usage in this version of Kubernetes, period). Except this one, whose cause is still unknown:
- [test flakes] master-scalability suites
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller (this was patched, but the patch did not improve test results)
Yellow
Issues which are blockers, with a good PR. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs.
Features
- Schedule DaemonSet Pods by default scheduler (merged except for Docs/Release Notes)
PVC protection:
This is being dealt with as a documentation bug with a documented workaround. There is a patch in progress against 1.9 that will make PVC downgrade work, to come out with 1.9.6. In the meantime, users who have a lot of PVCs should be encouraged to wait to upgrade until 1.9.6 is out.
Bugs
- Current image in glbc.manifest points to alpha version (gave go ahead at Release Burndown)
Test Fails in Progress
- [failing test] should restart all nodes and ensure all nodes and pods recover
- [test failed] gci-gce-alpha-features
Both of these test fails are related to either Daemonset scheduling, and should turn green soon now that PRs are merged. Hoping!
Green
Non-blocker issues (expected to remain broken for 1.10, generally need to add release note):
Flaky timeouts while waiting for RC pods to be running in density test
Controller-manager sees higher mem-usage when load test runs before density
HostPath mounts failing with "Path is not a shared or slave mount"
kubeadm: etcd certs missing in self-hosted deploy (will be fixed in point release)
Tracking Issues
from sig-release.
Status as of 11am, PDT, March 20th.
Overall status is "tangerine" (trending red). While the majority of bugs are either closed, we still have outstanding performance issue(s) whose cause is unknown and may delay the release. We also have several other unrelated issues which need fixing.
Tracking sheet is here as always.
Red
Release Blockers without a resolution timeline of less than 24 hours.
Performance issues in analysis/progress. There are two, which may be related, causing unacceptable performance on GCE. The cause of these may be in some way related to fluentd, but that doesn't make them a non-blocker:
- [test flakes] master-scalability suites
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller (this was patched, but the patch did not improve test results)
GKE tests are no longer running due to some issue with GKE, but we need this resolved before we release:
Yellow
Issues which are blockers, which are expected to resolve in 24 hours. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs.
Daemonset Scheduling feature needs to be reverted, or more accurately neutralized by disabling the alpha gate, before release:
- DaemonSet scheduling conflates hostname and nodename
- [test failed] gci-gce-alpha-features
- Schedule DaemonSet Pods by default scheduler
PVC protection workaround for 1.10.0, with fix pending for 1.9. This shouldn't be a blocker anymore:
Green
Non-blocker issues (expected to remain broken for 1.10, generally need to add release note):
Flaky timeouts while waiting for RC pods to be running in density test
Controller-manager sees higher mem-usage when load test runs before density
HostPath mounts failing with "Path is not a shared or slave mount"
kubeadm: etcd certs missing in self-hosted deploy (will be fixed in point release)
Mounting socket files from subPaths fail (will be fixed in point release)
Tracking Issues
from sig-release.
Status as of 11am, PDT, March 21st. Happy Nowruz! "zardi ye man az to, sorkhi ye to az man" seems particularly appropriate here.
Overall status is "straw" (light yellow). At this point, everthing is resolved or resolving in the next few hours except for #60589, which suffers from having to make a potentially painful tradeoff.
Red
Release Blockers without a resolution timeline of less than 24 hours.
This issue has been traced to a commit which was also a bugfix. At this point, we need opinions from multiple SIG leads about what to do on a reversion:
Yellow
Issues which are blockers, which are expected to resolve in 24 hours. Also undecided issues with or without PRs which look like they won't be considered 1.10 bugs.
Issue with subpaths which would prevent someone from upgrading from specific versions of Kubernetes, which has a patch just waiting to be cherrypicked:
Green
The Fluentd scaler issue has been fixed sufficient to not be a blocker for 1.10. There are still effects of it which will need fixing in future point releases:
- Fluentd-scaler causing fluentd pod deletions and messes with ds-controller (this was patched, but the patch did not improve test results)
GKE tests are now running.
Daemonset scheduling and PVC protection issues have been resolved. Important release note for PVC protection regarding downgrades.
Non-blocker issues (expected to remain broken for 1.10, generally need to add release note):
Flaky timeouts while waiting for RC pods to be running in density test
Controller-manager sees higher mem-usage when load test runs before density
HostPath mounts failing with "Path is not a shared or slave mount"
kubeadm: etcd certs missing in self-hosted deploy (will be fixed in point release)
Mounting socket files from subPaths fail
from sig-release.
Related Issues (20)
- Cut v1.26.13 release HOT 3
- Cut v1.30.0-alpha.1 release HOT 5
- Cut v1.x.y-{alpha,beta,rc}.z release HOT 1
- Kubernetes v1.30 Major Themes contact HOT 4
- Release Manager access for @mehabhalodiya
- Cut v1.30.0-alpha.2 release HOT 4
- Cut v1.29.2 release HOT 3
- Cut v1.28.7 release HOT 2
- Cut v1.27.11 release HOT 2
- Cut v1.26.14 release HOT 3
- Cut v1.30.0-alpha.3 release HOT 5
- Fix hyperlinks in 1.30 release markdown
- Cut v1.30.0-beta.0 release HOT 6
- Cut v1.29.3 release HOT 3
- Cut 1.28.8 release HOT 4
- Cut 1.27.12 release HOT 5
- Cut 1.26.15 release HOT 2
- Cut v1.30.0-rc.0 release HOT 8
- Update publishing-bot for release-1.30
- Cut v1.30.0-rc.1 release HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sig-release.