kubewharf / katalyst-core Goto Github PK

Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is the core components in Katalyst system, including multiple agents and centralized components

License: Apache License 2.0

Makefile 0.10% Shell 0.07% Dockerfile 0.01% Go 99.81%

katalyst-core's People

Contributors

Stargazers

Watchers

Forkers

leoliuyan pendoragon caohe lubinszarm chuanyi-zjc xuyunxiao123 hexiaofeng dut3062796s rouzip brightk7 jadeflute0127 gavinljj rayoluo run-lin waynepeking348 xuefengchang luomingmeng csfldf cheney-lin sun-yuliang xiangdaos zzzzhhb justadogistaken akuan1994 smart2003 zhy76 xxisxuxu tghfly wangzzzhe dragon-flyings gdzy1987 zhangsonglee chenxi-seu sysadminxxx shuimo03 googs1025 attlee-wang lan-ce-lot web-logs2 y-ykcir wanglei4687 yanxiaoqi932 luyaozhong crain-cn fjding woodscumming kangclzjc jinhua0 sbittla hapream xjh1996 nightmeng leisunstar wangt95117 yamicro dashbaord202401 xuanyingcool yadzhang brave321 airren ddjjia h-w-chen dalesin kapybar4 cccaoke gaohuatao-1 itonyli yangsoon flpanbin augustuslang zhanghaoyu1986 gary-lgy ozline yytrace mingmingshiliyu xixi2 shimingfei xytsinghua xu282934741

katalyst-core's Issues

[Umbrella] Support the management of resources across additional dimensions

This is an umbrella tracking the items that support the management of resources across a wider range of dimensions.

Why is this needed?

Currently, Katalyst supports the management of CPU and memory resources, including the cgroups configuration during container creation and real-time adjustments during subsequent runtime.

To support more colocation scenarios, Katalyst needs to manage resources across a broader range of dimensions, including network, L3 cache, memory bandwidth, and more.

What would you like to be added?

Proposals
- add external manager proposal
APIs
- kubewharf/katalyst-api#1
- define the CRD for QoS related to RDT
External Manager
- #8
- #6
- implement a network external manager to enable the execution of packet tagging in cgroups v2 environments
QRM Plugins
- #14
- network QRM plugin supports network bandwidth management
- extend the CPU QRM plugin to support diverse RDT-related management policies

Support for OOM priority as a QoS enhancement

What would you like to be added?

Users can specify the OOM priority as a QoS enhancement.
Implement OOM priority with oom_score_adj.

Why is this needed?

Currently, Kubernetes will configure different oom_score_adj values for different QoS classes. However, the order of OOM also depends on other dimensional factors such as the memory usage of the container.

In the colocation scenario, it's important to strictly ensure that web services are terminated later than batch jobs due to OOM when the cluster's memory resources become scarce.

Support for recommending resource specifications for workloads

Why is this needed?

Kubernetes is widely adopted for its ability to manage containerized workloads efficiently. However, determining the appropriate resource specifications (CPU and memory) for workloads remains a challenge. Often, users over-provision resources to ensure performance stability, leading to wasted resources and increased costs. On the other hand, under-provisioning can result in performance degradation and service disruptions.
By introducing a resource recommendation feature, we can address these challenges and provide the following benefits:

Resource Efficiency: Users will be able to allocate resources more accurately, reducing waste and optimizing cost management.
Performance Optimization: Tailored recommendations will ensure that workloads have the resources they need to run optimally, minimizing both over-provisioning and performance bottlenecks.
Ease of Use: The automated recommendation process will simplify resource management for both experienced and novice Kubernetes users.

What would you like to be added?

This issue proposes the addition of a new feature that enhances Kubernetes resource utilization by providing the ability to recommend resource specifications for workloads. This feature would analyze historical usage patterns and real-time performance metrics of workloads to intelligently suggest optimal resource requests for CPU and memory.

add documentations to explain the extension mechanism for qos definition

What would you like to be added?

a documentation with examples to explain qos extension mechanism

Why is this needed?

the qos extension mechanism is a little complicated and hard to understand, since qos, enhancement key and values can all be expanded

[Umbrella] Decouple QoS Resource Manager (QRM) from kubelet

Why is this needed?

Currently, Katalyst injects resource management policies through a framework inside kubelet named QoS Resource Manager (QRM). Among various feasible solutions, the QRM solution has the most complete functions and the most reasonable design.

However, some users may find it inconvenient to use in conjunction with the KubeWharf K8s distro, so we plan to provide a solution decoupled from kubelet, which will serve as a supplement to the QRM solution and allow users to choose as needed according to their own situation.

What would you like to be added?

Add an out-of-band resource manager (ORM) module in Katalyst Agent, which includes:

Through the asynchronous update path, ORM injects resource management strategies to a container after it starts, and dynamically adjust the resource allocation of a container when it is running. #406
Through the NRI path, ORM injects resource management strategies synchronously when a container is created. #488, #525
Implement an out-of-band Topology Manager, as we can no longer reuse the NUMA alignment capability provided by the native Topology Manager of kubelet. #435
Implement an out-of-band PodResources server, because the CPU and memory information returned by the native PodResources API of kubelet is not correct. #453
Adapt the reporter plugin that reports topology information so that it can use the out-of-band PodResources API. #453

The detailed design can be found in this doc.

[install error] katalyst-agent CrashLoopBackOff

What happened?

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get pods -nkatalyst-system
NAME                                   READY   STATUS             RESTARTS         AGE
katalyst-agent-4qx2t                   0/1     CrashLoopBackOff   10 (31s ago)     26m
katalyst-agent-jdl97                   0/1     CrashLoopBackOff   10 (22s ago)     26m
katalyst-agent-pwm7l                   0/1     Error              10 (5m11s ago)   26m
katalyst-controller-845ccf946b-ftxgx   1/1     Running            0                26m
katalyst-controller-845ccf946b-lm9bm   1/1     Running            0                26m
katalyst-metric-765c44bbb5-48ws6       1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-swgc4    1/1     Running            0                26m
katalyst-scheduler-5746f9bd4c-x2vct    1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-26c8g      1/1     Running            0                26m
katalyst-webhook-68fcf99cd8-7fs78      1/1     Running            0                26m

root@VM-0-15-ubuntu:/home/ubuntu# kubectl logs katalyst-agent-4qx2t -nkatalyst-system
W0502 08:03:20.626350       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024/05/02 08:03:20 <nil>
I0502 08:03:20.626831       1 otel_prom_metrics_mux.go:94] [katalyst-core/pkg/metrics/metrics-pool.(*openTelemetryPrometheusMetricsEmitterPool).GetMetricsEmitter] add path /metrics to metric emitter
W0502 08:03:20.636464       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I0502 08:03:20.636778       1 network_linux.go:80] [katalyst-core/pkg/util/machine.GetExtraNetworkInfo] namespace list: []
W0502 08:03:20.637199       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: eth0 with devPath: /sys/devices/virtual/net/eth0 which isn't pci device
W0502 08:03:20.637248       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: kube-ipvs0 with devPath: /sys/devices/virtual/net/kube-ipvs0 which isn't pci device
W0502 08:03:20.637281       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: lo with devPath: /sys/devices/virtual/net/lo which isn't pci device
W0502 08:03:20.637311       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth064d18ee with devPath: /sys/devices/virtual/net/veth064d18ee which isn't pci device
W0502 08:03:20.637339       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth06d57915 with devPath: /sys/devices/virtual/net/veth06d57915 which isn't pci device
W0502 08:03:20.637365       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth5290716c with devPath: /sys/devices/virtual/net/veth5290716c which isn't pci device
W0502 08:03:20.637396       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth6f37d282 with devPath: /sys/devices/virtual/net/veth6f37d282 which isn't pci device
W0502 08:03:20.637428       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth87922afb with devPath: /sys/devices/virtual/net/veth87922afb which isn't pci device
W0502 08:03:20.637457       1 network_linux.go:178] [katalyst-core/pkg/util/machine.getNSNetworkHardwareTopology] skip nic: veth8dccdf2e with devPath: /sys/devices/virtual/net/veth8dccdf2e which isn't pci device
I0502 08:03:20.638040       1 file.go:239] [GetUniqueLock] get lock successfully
I0502 08:03:20.638069       1 agent.go:85] initializing "katalyst-agent-reporter"
W0502 08:03:20.638121       1 manager.go:400] failed to retrieve checkpoint for "reporter_manager_checkpoint": checkpoint is not found
I0502 08:03:20.638136       1 manager.go:258] registered plugin name system-reporter-plugin
I0502 08:03:20.638153       1 manager.go:239] plugin system-reporter-plugin run success
I0502 08:03:20.638171       1 manager.go:258] registered plugin name kubelet-reporter-plugin
I0502 08:03:20.638210       1 util_unix.go:104] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///var/lib/kubelet/pod-resources/kubelet.sock"
F0502 08:03:20.638341       1 kubeletplugin.go:110] run topology status adapter failed

What did you expect to happen?

All pods start normally

How can we reproduce it (as minimally and precisely as possible)?

None

Software version

Environment:

Kubernetes version (use kubectl version): 1.28
OS version: Ubuntu 22.04
Kernal version:
Cgroup driver: cgroupfs/systemd

Optimize resource recommendation controller

Why is this needed?

Katalyst has recently added a resource recommendation controller which can work with VPA to help user allocate proper resource request/limit for pods. However, there're still gaps for delivering resource recommendation in a release:

Now we build the Controller using the Kubernetes Controller-Runtime Project. while Katalyst has its own controller framework(e.g. vpa controller). We expect to use this framework to refactor the controller section
It lacks deployment artifacts
It lacks relevant documentations

What abilities do you need?

Familiar with golang programming, have the ability to independently analyze and solve problems, and proficient in using search engines.
Familiar with or willingness to learn the k8s controller-manager mechanism, it would be even better if you are familiar with or have used The Kubernetes controller-runtime Project.
Familiar with or willingness to learn the usage and code of the Katalyst controller framework(e.g. vpa controller).
Have some experience in developing and using helm chart
Have ability to write documents
Can reasonably arrange time to participate in the design and development of open source projects, and maintain a passion for learning.

What would you like to be added?

Code

Refactor resource recommendation controller code so that it aligns with other katalyst controllers
Add helm chart for resource recommendation

Documents

Add a quick start to gokatalyst.io

Cpu estimation is NaN

What happened?

NAN is generated，resulting in a series of subsequent errors and cause kcnr data cannot be reported properly

What did you expect to happen?

Prevent NAN

How can we reproduce it (as minimally and precisely as possible)?

I applied a dedicated pod

Software version

$ <software> version
# paste output here

CustomNodeResource status 中的 CPU allocatable 数值没有更新

What happened?

我按照文档 https://gokatalyst.io/docs/getting-started/colocation-quick-start/ 安装部署了 katalyst, 然后创建了 shared-normal-pod 应用，应用创建前后观察 kcnr 中 resource.katalyst.kubewharf.io/reclaimed_millicpu 的数值并没有变化。

root@ubuntu:~/katalyst/examples# kubectl get nodes -owide
NAME           STATUS   ROLES           AGE   VERSION               INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
10.6.202.151   Ready    control-plane   19h   v1.24.6-kubewharf.8   10.6.202.151   <none>        Ubuntu 20.04.5 LTS   5.4.0-125-generic   containerd://1.4.12
node1          Ready    <none>          19h   v1.24.6-kubewharf.8   10.6.202.152   <none>        Ubuntu 20.04.5 LTS   5.4.0-125-generic   containerd://1.4.12
node2          Ready    <none>          19h   v1.24.6-kubewharf.8   10.6.202.153   <none>        Ubuntu 20.04.5 LTS   5.4.0-125-generic   containerd://1.4.12

root@ubuntu:~/katalyst/examples# helm list -A
NAME               	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                        	APP VERSION
katalyst-colocation	katalyst-system 	1       	2024-05-24 09:28:44.44903291 +0000 UTC 	deployed	katalyst-colocation-orm-0.5.0	v0.5.0
malachite          	malachite-system	1       	2024-05-24 09:16:19.208333849 +0000 UTC	deployed	malachite-0.1.0              	0.1.0

node2 节点的资源使用情况，确实是占用了2 core cpu.

shared-normal-pod 调度到了 node2 节点，节点的配置是 4核8G，该节点的 kcnr 中的 status.resources. allocatable 中的 cpu 和 memory 都没有变化。所有节点的信息都一样。

root@ubuntu:~/katalyst/examples# kubectl get kcnr node2 -oyaml
apiVersion: node.katalyst.kubewharf.io/v1alpha1
kind: CustomNodeResource
metadata:
  annotations:
    katalyst.kubewharf.io/cpu_overcommit_ratio: "1.00"
    katalyst.kubewharf.io/guaranteed_cpus: "0"
    katalyst.kubewharf.io/memory_overcommit_ratio: "1.00"
    katalyst.kubewharf.io/overcommit_cpu_manager: none
    katalyst.kubewharf.io/overcommit_memory_manager: None
  creationTimestamp: "2024-05-24T02:01:18Z"
  generation: 2
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: node2
    kubernetes.io/os: linux
  name: node2
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Node
    name: node2
    uid: 6a7c0a4b-451a-4c96-a580-c0e792772077
  resourceVersion: "131467"
  uid: 69d3adc7-7709-4e36-aa30-554ad7d6e1be
spec:
  nodeResourceProperties:
  - propertyName: numa
    propertyQuantity: "1"
  - propertyName: nbw
    propertyQuantity: 10k
  - propertyName: cpu
    propertyQuantity: "4"
  - propertyName: memory
    propertyQuantity: 8148204Ki
  - propertyName: cis
    propertyValues:
    - avx2
  - propertyName: topology
    propertyValues:
    - '{"Iface":"ens160","Speed":10000,"NumaNode":0,"Enable":true,"Addr":{"IPV4":["10.6.202.153"],"IPV6":null},"NSName":"","NSAbsolutePath":""}'
status:
  resources:
    allocatable:
      resource.katalyst.kubewharf.io/reclaimed_memory: 5Gi
      resource.katalyst.kubewharf.io/reclaimed_millicpu: 4k
    capacity:
      resource.katalyst.kubewharf.io/reclaimed_memory: 5Gi
      resource.katalyst.kubewharf.io/reclaimed_millicpu: 4k
  topologyPolicy: None
  topologyZone:
  - children:
    - attributes:
      - name: katalyst.kubewharf.io/netns_name
        value: ""
      - name: katalyst.kubewharf.io/resource_identifier
        value: ens160
      name: ens160
      resources:
        allocatable:
          resource.katalyst.kubewharf.io/net_bandwidth: 9k
        capacity:
          resource.katalyst.kubewharf.io/net_bandwidth: 9k
      type: NIC
    - name: "0"
      resources:
        allocatable:
          cpu: "4"
          memory: "8343760896"
        capacity:
          cpu: "4"
          memory: "8343760896"
      type: Numa
    name: "0"
    resources: {}
    type: Socket

What did you expect to happen?

node2 节点的kcnr status 数值更新。

How can we reproduce it (as minimally and precisely as possible)?

按照文档操作：https://gokatalyst.io/docs/getting-started/colocation-quick-start/

Software version

$ <software> version
# paste output here

When using reclaim core pod, the pod cannot be scheduled

What happened?

reclaimed-normal-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": reclaimed_cores
  name: reclaimed-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          "resource.katalyst.kubewharf.io/reclaimed_millicpu": "2k"
          "resource.katalyst.kubewharf.io/reclaimed_memory": 1Gi
        limits:
          "resource.katalyst.kubewharf.io/reclaimed_millicpu": "2k"
          "resource.katalyst.kubewharf.io/reclaimed_memory": 1Gi
  schedulerName: katalyst-scheduler

What did you expect to happen?

Reclaim core pod completes scheduling.

How can we reproduce it (as minimally and precisely as possible)?

Use the default policy scheduler-policy-spread.yaml of the example folder.

Software version

$ <software> version
# paste output here

Support NUMA-granularity reporting for reclaimed resources

What would you like to be added?

Enhance the resource reporting mechanism to support reporting of reclaimed resources at the granularity of NUMA nodes.

Why is this needed?

Currently, the reporting of reclaimed resources is performed at a node granularity level. However, in environments with NUMA architectures, this approach might lead to suboptimal scheduling result and potential pod evictions due to NUMA-level interference.

如何创建dedicated_cores 和 system_cores QoS类型的Pod?

What would you like to be added?

文档示例中只展示了如何创建shared_cores和reclaimed_cores 类型的 Pod，但是没有提到如何创建 dedicated_cores 和 system_cores 类型的 pod，方便提供一个示例吗？谢谢

Why is this needed?

文档示例中只展示了如何创建shared_cores和reclaimed_cores 类型的 Pod，但是没有提到如何创建 dedicated_cores 和 system_cores 类型的 pod

Refactor some methods of node webhook

What happened?

When I was using katalyst recently, I found that there were some duplicate codes in the webhook code. I wanted to do a small refactoring and simplify some codes.
like this
https://github.com/kubewharf/katalyst-core/blob/main/pkg/webhook/mutating/node/allocatable_mutator.go#L66
https://github.com/kubewharf/katalyst-core/blob/main/pkg/webhook/mutating/node/allocatable_mutator.go#L121

[Proposal] Introduce a network QRM plugin to support network resource management policies for different QoS classes

What would you like to be added?

Introduce a network QRM plugin with the following capabilities:

In cgroups v1 environments, configure the net_cls cgroup through the Cgroup Manager to enable outbound traffic tagging for containers.
In cgroups v2 environments, use eBPF through the External Manager to support traffic tagging.

Promote node resource over-commitment to GA

Why is this needed?

In v0.4, we released the MVP version of node resource over-commitment and implemented some basic features.

In v0.5, we plan to make some enhancements to this function to bring it to GA status.

What would you like to be added?

Dynamic over-commitment ratio adjustment: In order to make the amount of over-committed resources more accurate, we will combine long-term and short-term prediction algorithms to calculate the amount of resources that can be over-committed. #472
Interference detection and mitigation: In order to avoid resource competition caused by over-commitment, we will introduce multi-dimensional interference detection strategies, including CPU load/usage, memory usage, the reclaiming rate of kswapd, etc. Furthermore, we will introduce multi-tiered mitigation measures, including scheduling prevention, eviction, etc. #518
Compatibility with core binding: Prevent the bound cores from being over-committed to avoid scheduling too many CPU-bound Pods and causing the Pods to fail to start. #472

Monitoring accuracy and latency of reported information in KCNR

What would you like to be added?

This issue proposes the implementation of a monitoring system to track the accuracy of the reported information in KCNR and to measure the latency in reporting this information. The following features should be added:

Accuracy Monitoring: Implement a mechanism to compare the reported data in KCNR with the actual resource allocation on each NUMA node.
Latency Monitoring: Measure the time taken from the moment a Pod is scheduled to the moment it's information is successfully reported to KCNR.
Visualization: Provide a dashboard that displays accuracy and latency metrics.

Why is this needed?

KCNR is a CRD that stores topology status and resource allocation information of a node. Katalyst gathers and reports these information to KCNR. However, there are concerns about the accuracy of the reported information as well as potential delays in reporting.

resourceAllocatable 和 resourceCapacity 中的 cpu, memory的数值始终是相等的

What happened?

在创建应用前，查看节点的 kcnr，resourceAllocatable 和 resourceCapacity 中的 cpu和memory 相等，部署应用后，resourceAllocatable 和 resourceCapacity中的数值都减少了，但是Allocatable和 Capacity 还是相等。个人理解 Capacity 应该是保持不变的，减少的应该是 Allocatable？
创建应用前, 节点的 kcnr 信息：

Status:
  Resources:
    Allocatable:
      resource.katalyst.kubewharf.io/reclaimed_memory:    60696174Ki
      resource.katalyst.kubewharf.io/reclaimed_millicpu:  48k
    Capacity:
      resource.katalyst.kubewharf.io/reclaimed_memory:    60696174Ki
      resource.katalyst.kubewharf.io/reclaimed_millicpu:  48k

创建 share应用后，节点的 kcnr 信息：

Status:
  Resources:
    Allocatable:
      resource.katalyst.kubewharf.io/reclaimed_memory:    52308024Ki
      resource.katalyst.kubewharf.io/reclaimed_millicpu:  42k
    Capacity:
      resource.katalyst.kubewharf.io/reclaimed_memory:    52308024Ki
      resource.katalyst.kubewharf.io/reclaimed_millicpu:  42k

shared-normal-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": shared_cores
  name: shared-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: "1"
          memory: 1Gi
        limits:
          cpu: "1"
          memory: 1Gi
  schedulerName: katalyst-scheduler
  nodeName: node1

What did you expect to happen?

resourceAllocatable 资源减少，resourceCapacity资源保持不变

How can we reproduce it (as minimally and precisely as possible)?

创建一个 shared_cores pod,观察节点 kcnr 的数值变化情况

Software version

$ <software> version
# paste output here

节点的动态超分比例在增加CPU消耗后，不降反升

What happened?

我按照动态超分的文档体验了下动态超分功能，但是在创建 testpod1 增加 cpu的消耗后，cpu的超分比 cpu_overcommit_ratio 不降反升。

没有pod运行时，查看 g-master2 的kcnr:

[root@g-master1 katalyst]# kubectl describe kcnr g-master2
Name:         g-master2
Namespace:
Labels:       <none>
Annotations:  katalyst.kubewharf.io/cpu_overcommit_ratio: 1.74
              katalyst.kubewharf.io/guaranteed_cpus: 0
              katalyst.kubewharf.io/memory_overcommit_ratio: 1.15
              katalyst.kubewharf.io/overcommit_cpu_manager: none
              katalyst.kubewharf.io/overcommit_memory_manager: None
API Version:  node.katalyst.kubewharf.io/v1alpha1
Kind:         CustomNodeResource
Metadata:
  Creation Timestamp:  2024-05-27T14:02:23Z
  Generation:          1
  Resource Version:    135351666
  UID:                 78bc346b-d009-4ea8-bac1-51e2e6612d07
Spec:
  Node Resource Properties:
    Property Name:      numa
    Property Quantity:  2
    Property Name:      nbw
    Property Quantity:  10k
    Property Name:      cpu
    Property Quantity:  16
    Property Name:      memory
    Property Quantity:  32778468Ki
    Property Name:      cis
    Property Values:
      avx2
    Property Name:  topology
    Property Values:
      {"Iface":"ens192","Speed":10000,"NumaNode":0,"Enable":true,"Addr":{"IPV4":["10.6.202.112"],"IPV6":null},"NSName":"","NSAbsolutePath":""}
Events:  <none>

创建 testpod1 后，再次查看 g-master2 的kcnr:

[root@g-master1 katalyst]# kubectl describe kcnr g-master2
Name:         g-master2
Namespace:
Labels:       <none>
Annotations:  katalyst.kubewharf.io/cpu_overcommit_ratio: 1.99
              katalyst.kubewharf.io/guaranteed_cpus: 0
              katalyst.kubewharf.io/memory_overcommit_ratio: 1.41
              katalyst.kubewharf.io/overcommit_cpu_manager: none
              katalyst.kubewharf.io/overcommit_memory_manager: None
API Version:  node.katalyst.kubewharf.io/v1alpha1
Kind:         CustomNodeResource
Metadata:
  Creation Timestamp:  2024-05-27T14:02:23Z
  Generation:          1
  Resource Version:    135554723
  UID:                 78bc346b-d009-4ea8-bac1-51e2e6612d07
Spec:
  Node Resource Properties:
    Property Name:      numa
    Property Quantity:  2
    Property Name:      nbw
    Property Quantity:  10k
    Property Name:      cpu
    Property Quantity:  16
    Property Name:      memory
    Property Quantity:  32778468Ki
    Property Name:      cis
    Property Values:
      avx2
    Property Name:  topology
    Property Values:
      {"Iface":"ens192","Speed":10000,"NumaNode":0,"Enable":true,"Addr":{"IPV4":["10.6.202.112"],"IPV6":null},"NSName":"","NSAbsolutePath":""}
Events:  <none>


[root@g-master1 katalyst]# kubectl get pod -n katalyst-system
NAME                                            READY   STATUS    RESTARTS       AGE
katalyst-controller-747545d674-54d2j            1/1     Running   9 (14h ago)    6d19h
katalyst-webhook-69bdb7d5d6-jnrh5               1/1     Running   0              6d19h
overcommit-katalyst-agent-l2rdx                 1/1     Running   0              6d19h
overcommit-katalyst-agent-sb2bd                 1/1     Running   0              6d19h
overcommit-katalyst-agent-vb5wc                 1/1     Running   0              6d19h
overcommit-katalyst-scheduler-58f64f644-442lb   1/1     Running   16 (14h ago)   6d19h
testpod1                                        1/1     Running   0              12s

katalyst 版本：

panbin@panbindeMacBook-Pro ~ % helm list -n katalyst-system
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/panbin/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/panbin/.kube/config
NAME      	NAMESPACE      	REVISION	UPDATED                             	STATUS  	CHART                    	APP VERSION
overcommit	katalyst-system	1       	2024-05-27 22:01:28.110633 +0800 CST	deployed	katalyst-overcommit-0.5.0	v0.5.0

What did you expect to happen?

创建 testpod1 后，对应节点的 cpu 超分比 katalyst.kubewharf.io/cpu_overcommit_ratio 降低。

How can we reproduce it (as minimally and precisely as possible)?

按照这个文档操作即可：https://gokatalyst.io/docs/user-guide/resource-overcommitment/dynamic-overcommitment/

Software version

$ <software> version
# paste output here

refine headroom manager for sysadvisor

What would you like to be added?

remove broker from headroom manager

Why is this needed?

oversold logic has been move to headroom policy of resource advisor, so the the broker framework in headroom manager need to be trashed

文档可以丰富一些嘛

目前好像没有看到katalyst支持哪些隔离特性，相关隔离特性如何使用，收益如何？
是否可以提供一个类似这样的站点，来让用户更好的上手katalyst系统。https://koordinator.sh/zh-Hans/docs/user-manuals/load-aware-scheduling

what network qos really do ? only set cgroup net class id?

I have see 0.3 release an network qos plugin ,, is there any doc about it ?

we just see set cgroup net class id,, but where tc policy ready do ? we should manager TC policy manual ?

[Umbrella] Support topology-aware scheduling

This is an umbrella tracking the items that support topology-aware scheduling.

Why is this needed?

Currently, Katalyst supports numa-binding and numa-exclusive enhancements for the dedicated_cores QoS class in the colocation scenario.

In situations outside of colocation, it is also necessary to be aware of the NUMA and device topology when scheduling and allocating resources, so as to improve the performance of containers. Furthermore, in this particular scenario, Katalyst's resource allocation strategy needs to be compatible with Kubernetes' native allocation strategy.

What would you like to be added?

APIs
QRM framework
QRM Plugin
- #144
Reporter
- #45
- #150
MetaServer and Lifecycle Controller
- #151
- #159
Scheduler
- #298
Charts
- kubewharf/charts#3
- kubewharf/charts#4

[Doc] install kubewharf enhanced kubernetes: wget https://github.com/containerd/containerd/releases/download/v1.4.12/containerd-1.6.9-linux-amd64.tar.gz not found

What happened?

When I follow the documentation to install kubewharf enhanced kubernetes, I found this:

Looks like wget https://github.com/containerd/containerd/releases/download/v1.4.12/containerd-1.6.9-linux-amd64.tar.gz already can't find, maybe need to update the relevant documentation.

What did you expect to happen?

update the relevant documentation

How can we reproduce it (as minimally and precisely as possible)?

wget https://github.com/containerd/containerd/releases/download/v1.4.12/containerd-1.6.9-linux-amd64.tar.gz

Software version

$ <software> version
# paste output here

code comments link address 😂

Description：

code comments:

https://github.com/kubewharf/katalyst-core/blob/650ae570146df55529959179487243f5d2803a45/cmd/base/healthz.go#L60C53-L60C53

linked with a PersistentVolumeSource struct.

Question:

In my opinion, It may link with readiness information. Is it right?😁

安装agent日志报错

What happened?

helm install katalyst -n katalyst-system --create-namespace kubewharf/katalyst
kubectl logs katalyst-agent-jwjmw -n katalyst-system --tail=100
sync kubelet pod failed: failed to get pod list, error: Get "http://localhost:10255/pods": dial tcp [::1]:10255: connect: connection refused

What did you expect to happen?

安装文档能否详细些

How can we reproduce it (as minimally and precisely as possible)?

希望有安装成功的说明，相关指标

Software version

latest
helm install katalyst -n katalyst-system --create-namespace kubewharf/katalyst

[Proposal] Introduce an external manager framework

What would you like to be added?

Introduce an external manager framework to:

Execute configurations beyond the scope of the OCI specification.
Dynamically manage the mapping between a container and its corresponding cgroup ID, thereby facilitating the implementation of future QoS management solutions based on eBPF.

[Proposal] QRM cpu/memory plugins support memory enhancement numa_exclusive

What would you like to be added?

katalyst-api issue proposes to add memory enhancement numa_exclusive
QRM cpu/memory plugin decide whether a pod takes up NUMAs exclusively according to pod memory enhancement statement

memory advisor sendAdvices channel sometimes full

What happened?

Maybe we need Increase capacity of sendChan

katalyst-core/pkg/agent/sysadvisor/plugin/qosaware/resource/memory/advisor.go

Line 89 in 6085bab

sendChan: make(chan types.InternalMemoryCalculationResult, 1),

What did you expect to happen?

no error

How can we reproduce it (as minimally and precisely as possible)?

get agent log

Software version

$ <software> version
# paste output here

refine resource manager to support newly cnr definition to report enhanced topology information

What would you like to be added?

add conversion framework to reporter manager to support transformation from old ReportField to newly one
cnr reporter support merge struct type fields by strategic merge
cnr reporter support merge cnr's TopologyZone field by Type and Name as unique key

Type TopologyType `json:"type"`

Name string `json:"name,omitempty"`

Why is this needed?

a newly cnr definition has been proposal in katalyst-api, and the resource manager should be also refined to support the enhanced topology information report.

Support for node resource over-commitment for online services

Why is this needed?

Due to the tidal nature of online services, users often determine the amount of resources to apply for based on the amount of resources consumed during peak periods. In addition, users tend to over-apply resources to ensure business stability. These behaviors will lead to waste of resources.

What would you like to be added?

This issue propose the addition of a new feature that enables node resource over-commitment for online services. The feature consists of three key capabilities:

Resource Protocol Compatibility: Hijack kubelet's request to report the amount of resources based on a webhook, and amplify the amount of allocatable resources. This allows the scheduler to schedule more Pods to a node without users' awareness.
Interference Detection and Mitigation: In order to avoid resource competition caused by over-commitment, we will introduce multi-dimensional interference detection strategies, including CPU load, memory usage, the reclaiming rate of kswapd, etc. Furthermore, we will introduce multi-tiered mitigation measures, including scheduling prevention, eviction, etc.
Node Resource Prediction Algorithms: In order to make the amount of over-committed resources more stable, we combine long-term and short-term prediction algorithms to calculate the amount of resources that can be over-committed.

"$KUBEADM_TOKEN" was not of the form "\\A([a-z0-9]{6})\\.([a-z0-9]{16})\\z"

What happened?

When I set up enhanced kubernetes master:

mkdir -p /etc/kubernetes
export KUBEADM_TOKEN=`kubeadm token generate`
export APISERVER_ADDR=192.168.211.131

kubeadm init --config=/etc/kubernetes/kubeadm-client.yaml --upload-certs -v=5

it shows an error:

initconfiguration.go:306] error unmarshaling configuration schema.GroupVersionKind{Group:"kubeadm.k8s.io", Version:"v1beta3", Kind:"InitConfiguration"}: the bootstrap token "$KUBEADM_TOKEN" was not of the form "\\A([a-z0-9]{6})\\.([a-z0-9]{16})\\z"

What did you expect to happen?

run kubeadm init

How can we reproduce it (as minimally and precisely as possible)?

install enhanced k8s according to https://github.com/kubewharf/katalyst-core/blob/main/docs/install-enhanced-k8s.md

Software version

$ <software> version
# paste output here

Support inter-pod affinity and anti-affinity at NUMA level

What would you like to be added?

This issue is opened to track the development of inter-pod affinity and anti-affinity at NUMA level in Kubernetes.

Why is this needed?

Currently, Kubernetes supports inter-pod affinity and anti-affinity at the node level. However, there is a growing need for extending this support to the NUMA level.

For example, high-memory bandwidth consuming pods, like workers, can impact the performance of other pods on the same NUMA node, such as parameter servers. Allocating these pods to different NUMA nodes can mitigate such interferences.

refine resource advisors in sysadvisor

What would you like to be added?

refine resource advisors, especially cpu advisor in sysadvisor

Why is this needed?

several bugs exist in cpu advisor, which may result in faulty provision and headroom result in certain cases, including but not limited to:
a. add reserved for allocated repeatedly in multi share region case
b. provision policy in use is not updated
c. pool size regulation may have faulty result in extreme cases
d. should consider request sum as estimation when reclaim is disabled
e. per numa reserved for allocate/reclaim calculation could be optimized
update logic in cpu advisor is complicated, more abstraction and simplification is required, including but not limited to:
a. optimize cpu advisor flow path
b. optimize per numa reserve resource calculation
c. abstract provision and headroom assembler for extensibility

Can katalyst add the scheduling function for LLC and memory bandwidth resources?

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

I followed the relevant documentation to test Resource overcommitment , there seems to be a problem

refer to: https://gokatalyst.io/docs/user-guide/resource-overcommitment/
我使用这个文档去测试超分比的示例，看着好像没生效？

root@VM-0-12-ubuntu:/home/ubuntu# kubectl get pod -o wide |grep busybox-deployment-overcommit |awk '{print $3}' |sort |uniq -c
     19 Pending
      1 Running
root@VM-0-12-ubuntu:/home/ubuntu# kubectl get pod -o wide |grep busybox-deployment-overcommit |awk '{print $3}' |sort |uniq -c
     19 Pending
      1 Running
root@VM-0-12-ubuntu:/home/ubuntu# kubectl get nodeovermit
error: the server doesn't have a resource type "nodeovermit"
root@VM-0-12-ubuntu:/home/ubuntu# kubectl get NodeOvercommitConfig
NAME              OVERCOMMITRATIO                SELECTOR
overcommit-demo   {"cpu":"2.5","memory":"2.5"}   overcommit-demo

[Proposal] Implement an RDT external manager to execute RDT-related configurations

What would you like to be added?

Develop an RDT external manager to carry out RDT-related configurations:

Check whether RDT is supported by the CPU and the kernel.
Perform some RDT-related initialization, such as:
- Mounting the resctrl file system;
- Creating a CLOS directory for each QoS class.
Synchronize the tasks of each CLOS.
Apply the CAT configurations for each CLOS.
Apply the MBA configurations for each CLOS.

Enhancing SysAdvisor for generic algorithm framework support

What would you like to be added?

The core idea is to enable SysAdvisor to seamlessly support a generic algorithm framework, allowing for easy integration of various algorithmic implementations through a plugin-based architecture.

Why is this needed?

SysAdvisor is a module that performs algorithmic inferences. As the demand for algorithmic solutions grows, a plugin-based system ensures that SysAdvisor can easily scale to accommodate a broader range of algorithms without introducing complexity to the core codebase.

Is there any plan "Yodel" opensource/included?

In this essay there is a system “Yodel", will it be opensource or included in this project?

https://www.toutiao.com/article/7178732909950747191/?app=news_article&timestamp=1672022923&use_new_style=1&req_id=20221226104842F95548F164929DA9D270&group_id=7178732909950747191&wxshare_count=1&tt_from=weixin&utm_source=weixin&utm_medium=toutiao_android&utm_campaign=client_share&share_token=16e72473-88be-4f05-a0ef-9a937b568400&source=m_redirect&wid=1678417976245

Enhancing fault tolerance for Katalyst Agent

What would you like to be added?

We propose enhancing the fault tolerance capabilities of the Katalyst Agent to ensure more reliable operation in the presence of failures. This includes the following two main aspects:

Enhanced Health Check Criteria: We aim to incorporate a broader range of factors into the health check endpoint of the Katalyst Agent. Currently, the health check primarily focuses on basic connectivity and liveness. We suggest extending this to consider additional dimensions such as the status of the QRM Plugin. This would provide a more comprehensive assessment of the Agent's operational state and help prevent potential issues before they escalate.
Diversified Failure Handling: Currently, when the Katalyst Agent encounters a failure, it employs a limited set of recovery measures such as preventing further scheduling and eviction. We believe it would greatly benefit the system's reliability if we introduce a wider range of actions that can be taken in response to Agent failures. These measures could include dynamic adjustments of resource allocations, etc. By diversifying the recovery strategies, we can increase the likelihood of successful recovery from various failure scenarios.

Why is this needed?

Currently, the health checks and failure handling measures for Katalyst Agent are limited, which cannot meet the stability requirements.

Decoupling some fine-grained resource management features from QoS

What would you like to be added?

Decouple some fine-grained resource management features, such as CPU Burst and IO Limit, from QoS concepts.

Why is this needed?

Currently, fine-grained resource management features are tightly integrated with QoS concepts, making it challenging to utilize them in non-colocation scenarios. However, some features, such as CPU Burst and IO Limit, are also useful in non-colocation scenarios.

Enhanced K8s worker node network error and weird automatic restart

What happened?

I followed this documentation to install the dev k8s environment.

My master node is running fine quickly, but the worker nodes are not working due to cni plugin not initialized

Extracted from kubectl describe node debian-node-2

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 10 Jun 2024 15:46:00 +0800   Mon, 10 Jun 2024 15:46:00 +0800   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Logs extraced from kubectl logs -n kube-system canal-ftzlb（for debian-node-2）

2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.325 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/hostendpoints?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.560 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations"
2024-06-10 08:58:51.587 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.731 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.900 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-06-10 08:58:51.926 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2024-06-10 08:58:51.947 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.160 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations"
2024-06-10 08:58:52.201 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.323 [INFO][63] status-reporter/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/profiles"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"
2024-06-10 08:58:52.326 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints"
2024-06-10 08:58:52.454 [INFO][63] status-reporter/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/caliconodestatuses?limit=500&resourceVersion=23696&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice" error=Get "https://172.23.192.1:443/api/v1/services?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/bgpconfigurations?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused

I checked the logs and it seems that the connection to 172.23.192.1:443 was refused.
But when I check iptables, it shows:

root@debian-node-2:~/deploy# sudo iptables-save | grep 172.23.192.1
-A KUBE-SERVICES -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SVC-ERIFXISQEP7F7OF4 ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-JD5MR3NA4I4DYORP ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 172.28.208.0/20 -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-TCOU7JCQXEZGVUNU ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ

and I try use curl and telnet to coonnect 172.23.192.1:443 in debian-node-2，it works

This seems to be working fine, I have been debugging this for days, but still have no luck.

I also tried to reinstall several times, but the problem reappeared stably.

Finally, I solved it by systemctl restart containerd.service on debian-node-2(work node)

Although I solved the problem, I still have a huge doubt, why can I just restart containerd?

At the same time, my master node (i.e., debian-node-1) would randomly reboot, which was very confusing to me.

It usually manifested itself in a way similar to client_loop: send disconnect: Broken pipe, which I initially thought was a problem with the ssh connection, but when I left the server alone overnight and connected again the next day, it would automatically reboot again.

I had no memory issues, and even the memory usage was not high. Command journalctl -xb -p err did not indicate any problems.

I have previously installed Vanilla Kubernetes on debian-node-1, and also installed a Kind-based k8s container. But there was no restart problem. I have already done a system reset before installing kubewarf-enhanced-k8s.

debian-node-2 (worker node) have same problem.

This is debian-node-1 and debian-node-2 summary from neofetch. The two machines are connected to the same router, which has a bypass route 192.168.2.201 set up to handle the proxy.

root@debian-node-1:~# neofetch
       _,met$$$$$gg.          root@debian-node-1
    ,g$$$$$$$$$$$$$$$P.       ------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64
 ,$$P'              `$$$.     Host: UM480XT
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-21-amd64
`d$$'     ,$P"'   .    $$$    Uptime: 1 hour
 $$P      d$'     ,    $$P    Packages: 573 (dpkg)
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15
 $$;      Y$b._   _,d$P'      CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
 Y$$.    `.`"Y$$$$P"'         GPU: AMD ATI 04:00.0 Renoir
 `$$b      "-.__              Memory: 1494MiB / 31529MiB
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

root@debian-node-2:~/deploy# neofetch
       _,met$$$$$gg.          root@debian-node-2
    ,g$$$$$$$$$$$$$$$P.       ------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64
 ,$$P'              `$$$.     Host: UM480XT
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-21-amd64
`d$$'     ,$P"'   .    $$$    Uptime: 2 hours, 49 mins
 $$P      d$'     ,    $$P    Packages: 517 (dpkg)
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15
 $$;      Y$b._   _,d$P'      CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
 Y$$.    `.`"Y$$$$P"'         GPU: AMD ATI 04:00.0 Renoir
 `$$b      "-.__              Memory: 703MiB / 15425MiB
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

What did you expect to happen?

Worker nodes can run normally without systemctl restart containerd.service

Master node will not auto restart

How can we reproduce it (as minimally and precisely as possible)?

followed this documentation

Software version

debian-node-1:

root@debian-node-1:~# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

debian-node-2:

root@debian-node-2:~/deploy# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

Filter the Pods reported in KCNR

What would you like to be added?

The results returned by the pod resources server are not filtered, so anything returned by the pod resources server will be reported to KCNR. This may also need to be considered in conjunction with the scheduler to determine if we need to filter out only non-NUMA binding Pods.

Why is this needed?

The Pods currently reported by KCNR may need filtering

Support for tidal colocation with HPA and node pool management

What would you like to be added?

This issue propose the addition of a new feature that enables time-shared node reuse on Kubernetes. This feature aims to enhance resource utilization and efficiency by allowing multiple types of workloads, such as online services and batch jobs, to share a node in a time-sliced manner. The feature consists of two key capabilities: HPA and node pool management.

HPA Enhancement: Extend the existing HPA functionality to support scaling workloads based on a schedule, i.e. CronHPA. This enhancement will enable workloads to reduce their resource footprint when demand is low, freeing up resources for other workloads.
Node Pool Management: Introduce a node pool management mechanism that dynamically reallocates nodes between different types of workloads. When a workload is scaled down, the vacant nodes will be identified and assigned to another workload that is experiencing higher demand. This will facilitate the efficient utilization of nodes and prevent resource wastage.

Why is this needed?

The need for tidal colocation arises from the desire to optimize resource utilization and enhance cost efficiency within Kubernetes clusters. Currently, workloads often run on dedicated nodes, leading to suboptimal resource usage and potential underutilization during off-peak hours. By implementing time-shared node reuse with HPA and node pool management, several benefits can be realized:

Resource Efficiency: Many workloads experience varying levels of demand throughout the day. By allowing workloads to scale down during low-traffic periods and releasing their nodes to other workloads, we can ensure that resources are used more effectively.
Workload Isolation: With the proposed mechanism, different types of workloads run on different nodes, avoiding interference and resource contention between workloads.
Dynamic Scaling: The time-shared node reuse feature will enable dynamic scaling, allowing clusters to adapt more efficiently to changing workloads without manual intervention.

Katalyst-colocation-orm can be installed on enhanced-k8s cluster but katalyst-colocation cannot be installed

What happened?

I followed Colocate your application using Katalyst to install Katalyst.

It mentioned that if you use Kubewharf enhanced kubernetes, install katalyst-colocation

And if you use vanilla kubernetes, install katalyst-colocation-orm

My node follows Install Kubewharf enhanced-k8s to install enhanced k8s, but only katalyst-colocation-orm can be installed instead of katalyst-colocation

If I install katalyst-colocation, it will report the following error in katalyst-colocation-agent

I0610 13:10:27.641756       1 state_checkpoint.go:121] "[cpu_plugin] State checkpoint: restored state from checkpoint"
I0610 13:10:27.641777       1 util.go:68] [katalyst-core/pkg/agent/qrm-plugins/cpu/util.GetCoresReservedForSystem] get reservedQuantityInt: 0 from ReservedCPUCores configuration
I0610 13:10:27.641787       1 util.go:77] [katalyst-core/pkg/agent/qrm-plugins/cpu/util.GetCoresReservedForSystem] take reservedCPUs:  by reservedCPUsNum: 0
I0610 13:10:27.641832       1 policy.go:950] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).cleanPools] there is no pool to delete
I0610 13:10:27.641842       1 policy.go:964] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).initReservePool] initReservePool reserve:
I0610 13:10:27.641859       1 state_mem.go:109] "[cpu_plugin] updated cpu plugin pod entries" podUID="reserve" containerName="" allocationInfo="{\"pod_uid\":\"reserve\",\"owner_pool_name\":\"reserve\",\"allocation_result\":\"\",\"original_allocation_result\":\"\",\"topology_aware_assignments\":{},\"original_topology_aware_assignments\":{},\"init_timestamp\":\"\",\"labels\":null,\"annotations\":null,\"qosLevel\":\"\"}"
I0610 13:10:27.644274       1 policy.go:1039] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).initReclaimPool] exist initial reclaim: 0-9
I0610 13:10:27.644300       1 agent.go:102] needToRun "qrm_cpu_plugin"
I0610 13:10:27.644308       1 agent.go:91] initializing "qrm_io_plugin"
I0610 13:10:27.644320       1 agent.go:102] needToRun "qrm_io_plugin"
I0610 13:10:27.644325       1 agent.go:91] initializing "qrm_network_plugin"
W0610 13:10:27.644335       1 util.go:122] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.filterNICsByAvailability] nic: eno1 doesn't have IP address
I0610 13:10:27.644344       1 util.go:302] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.getReservedBandwidth] reservedBanwidth: 0, nicCount: 1, policy: first,
I0610 13:10:27.644361       1 state_net.go:47] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.NewNetworkPluginState] initializing new network plugin in-memory state store"
I0610 13:10:27.644372       1 util.go:37] [GenerateMachineState: katalyst-core/pkg/agent/qrm-plugins/network/state.GenerateMachineState] NIC wlp2s0's speed: -1, capacity: [0/0], reservation: 0
I0610 13:10:27.644511       1 util.go:37] [GenerateMachineState: katalyst-core/pkg/agent/qrm-plugins/network/state.GenerateMachineState] NIC wlp2s0's speed: -1, capacity: [0/0], reservation: 0
I0610 13:10:27.644531       1 state_net.go:121] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*networkPluginState).SetMachineState] updated network plugin machine state" NICMap="{\"wlp2s0\":{\"egress_state\":{\"Capacity\":0,\"SysReservation\":0,\"Reservation\":0,\"Allocatable\":0,\"Allocated\":0,\"Free\":0},\"ingress_state\":{\"Capacity\":0,\"SysReservation\":0,\"Reservation\":0,\"Allocatable\":0,\"Allocated\":0,\"Free\":0},\"pod_entries\":{}}}"
I0610 13:10:27.644543       1 state_net.go:145] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*networkPluginState).SetPodEntries] updated network plugin pod resource entries" podEntries="{}"
I0610 13:10:27.644555       1 state_checkpoint.go:136] "[network_plugin: katalyst-core/pkg/agent/qrm-plugins/network/state.(*stateCheckpoint).restoreState] state checkpoint: restored state from checkpoint"
I0610 13:10:27.644572       1 policy.go:177] [katalyst-core/pkg/agent/qrm-plugins/network/staticpolicy.(*StaticPolicy).ApplyConfig] apply configs, qosLevelToNetClassMap: map[dedicated_cores:0 reclaimed_cores:0 shared_cores:0 system_cores:0], podLevelNetClassAnnoKey: katalyst.kubewharf.io/net_class_id, podLevelNetAttributesAnnoKeys: []
I0610 13:10:27.644581       1 agent.go:102] needToRun "qrm_network_plugin"
I0610 13:10:27.644588       1 agent.go:91] initializing "periodical-handler-manager"
I0610 13:10:27.644593       1 agent.go:102] needToRun "periodical-handler-manager"
I0610 13:10:27.644600       1 agent.go:91] initializing "katalyst-agent-orm"
I0610 13:10:27.644631       1 manager.go:86] "Creating topology manager with policy per scope" topologyPolicyName=""
E0610 13:10:27.644640       1 manager.go:129] unknown policy: ""
E0610 13:10:27.644647       1 agent.go:94] Error initializing "katalyst-agent-orm"
I0610 13:10:27.644662       1 file.go:257] [GetUniqueLock] release lock successfully
I0610 13:10:28.396105       1 file.go:90] fsNotify watcher notify "/var/lib/kubelet/resource-plugins/kubelet_qrm_checkpoint": CREATE
I0610 13:10:28.396155       1 topology_adapter.go:281] qrm state file changed, notify to update topology status
I0610 13:10:28.396166       1 kubeletplugin.go:177] send topology change notification to plugin kubelet-reporter-plugin
run command error: failed to init ORM: unknown policy: ""

Only katalyst-agent not working

root@debian-node-1:~# kubectl get pods -n katalyst-system
NAME                                                       READY   STATUS             RESTARTS      AGE
katalyst-colocation-katalyst-agent-f5glx                   0/1     CrashLoopBackOff   4 (36s ago)   2m32s
katalyst-colocation-katalyst-agent-jzgft                   0/1     CrashLoopBackOff   4 (52s ago)   2m32s
katalyst-colocation-katalyst-controller-59b5c89cd6-jcn9m   1/1     Running            0             2m32s
katalyst-colocation-katalyst-controller-59b5c89cd6-vpjvq   1/1     Running            0             2m32s
katalyst-colocation-katalyst-metric-85c47ff4bf-nl9sf       1/1     Running            0             2m32s
katalyst-colocation-katalyst-scheduler-77cdd9d66f-8mszz    1/1     Running            0             2m32s
katalyst-colocation-katalyst-scheduler-77cdd9d66f-c27qc    1/1     Running            0             2m32s
katalyst-colocation-katalyst-webhook-5f6ccc7cb-ngz2x       1/1     Running            0             2m32s
katalyst-colocation-katalyst-webhook-5f6ccc7cb-vrnzs       1/1     Running            0             2m32s

But install katalyst-colocation-orm in Kubewharf enhanced kubernetes work fine（pod status of agent is Running）

What did you expect to happen?

install katalyst-colocation in KubeWharf-enhanced-kubernetes work fine

How can we reproduce it (as minimally and precisely as possible)?

Install katalyst-colocation using helm after installing KubeWharf-enhanced-kubernetes

helm install katalyst-colocation -n katalyst-system --create-namespace kubewharf/katalyst-colocation

Software version

No response

refine implementations for qrm

What would you like to be added?

refine implementations for qrm to simplify the logic

Why is this needed?

currently, qrm plugin is a little complicated and lack of abstraction for similar functionalities

Enhancing security and rate limiting for Katalyst endpoints

What would you like to be added?

This issue proposes adding authentication and rate limiting capabilities to various endpoints within Katalyst. This enhancement would encompass two main categories of interfaces:

HTTP Endpoints: Specifically, we aim to secure and implement rate limiting for the data provisioning interface from the Katalyst Agent to KCMAS. This involves integrating authentication mechanisms to ensure that only authorized entities can access this interface. Additionally, incorporating rate limiting would prevent abuse and ensure fair usage of resources, maintaining optimal performance even during high traffic scenarios.
gRPC Endpoints: Extend these security measures to each manager's plugin registration endpoint.

Why is this needed?

Currently, the endpoints provided by Katalyst has no authentication and rate limiting mechanism, which brings some risks to the stability of the cluster.The inclusion of authentication and rate limiting mechanisms for Katalyst endpoints addresses security and performance concerns.

A dedicated_cores pod do not have an exclusive CPU

What happened?

I created a dedicated_cores pod, but it not have an exclusive CPU.

dedicated_cores_pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": dedicated_cores
    "katalyst.kubewharf.io/memory_enhancement": '{
      "numa_binding": "true",
      "numa_exclusive": "true"
    }'
  name: numa-dedicated-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: "1"
          memory: 1Gi
        limits:
          cpu: "1"
          memory: 1Gi
  schedulerName: katalyst-scheduler

check the cpuset for the pod:

root@ubuntu:~# ./get_cpuset.sh numa-dedicated-normal-pod
Wed 05 Jun 2024 03:03:08 AM UTC
0-47

What did you expect to happen?

The dedicated_cores pod should be allocated an exclusive CPU core.

How can we reproduce it (as minimally and precisely as possible)?

Create a dedicated cores pod, like dedicated_cores_pod.yaml, as mentioned above.

Software version

root@ubuntu:~/katalyst/examples# helm list -A
NAME               	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                        	APP VERSION
katalyst-colocation	katalyst-system 	1       	2024-05-24 09:28:44.44903291 +0000 UTC 	deployed	katalyst-colocation-orm-0.5.0	v0.5.0
malachite          	malachite-system	1       	2024-05-24 09:16:19.208333849 +0000 UTC	deployed	malachite-0.1.0              	0.1.0

Add helm chart for resource recommendation

katalyst-agent: get reclaimed resource.katalyst.kubewharf.io/reclaimed_memory capacity failed: resource memory last report value not found

What happened?

What did you expect to happen?

GetReportContentResponse from headroom-reporter-plugin Endpoint success.

How can we reproduce it (as minimally and precisely as possible)?

Use the image of the latest main branch

Software version

$ <software> version
# paste output here

Test flake: pkg/custom-metric/collector/prometheus Test_scrape

What happened?

A unit test case for custom metrics sometimes fails.

https://github.com/kubewharf/katalyst-core/actions/runs/4687761540/jobs/8307450662#step:4:2457

=== RUN   Test_scrape
I0413 09:42:44.219863   18202 scrape.go:102] start scrape manger with url: http://127.0.0.1:34219
    scrape_test.go:76: 
        	Error Trace:	/home/runner/work/katalyst-core/katalyst-core/pkg/custom-metric/collector/prometheus/scrape_test.go:76
        	Error:      	Not equal: 
        	            	expected: 4
        	            	actual  : 0
        	Test:       	Test_scrape
    scrape_test.go:77: 
        	Error Trace:	/home/runner/work/katalyst-core/katalyst-core/pkg/custom-metric/collector/prometheus/scrape_test.go:77
        	Error:      	elements differ
        	            	
        	            	extra elements in list A:
        	            	([]interface {}) (len=4) {
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=21) "none_namespace_metric",
        	            	  Labels: (map[string]string) (len=1) {
        	            	   (string) (len=4) "test": (string) (len=23) "none_namespace_metric_l"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 0,
        	            	    Timestamp: (int64) 3
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=18) "none_object_metric",
        	            	  Labels: (map[string]string) (len=2) {
        	            	   (string) (len=7) "label_1": (string) (len=18) "none_object_metric",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 16,
        	            	    Timestamp: (int64) 4
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=11) "full_metric",
        	            	  Labels: (map[string]string) (len=4) {
        	            	   (string) (len=10) "label_test": (string) (len=4) "full",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1",
        	            	   (string) (len=6) "object": (string) (len=3) "pod",
        	            	   (string) (len=11) "object_name": (string) (len=5) "pod_1"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 176,
        	            	    Timestamp: (int64) 55
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=22) "with_labeled_timestamp",
        	            	  Labels: (map[string]string) (len=4) {
        	            	   (string) (len=10) "label_test": (string) (len=7) "labeled",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1",
        	            	   (string) (len=6) "object": (string) (len=3) "pod",
        	            	   (string) (len=11) "object_name": (string) (len=5) "pod_2"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 179,
        	            	    Timestamp: (int64) 123
        	            	   })
        	            	  }
        	            	 })
        	            	}
        	            	
        	            	
        	            	listA:
        	            	([]*data.MetricSeries) (len=4) {
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=21) "none_namespace_metric",
        	            	  Labels: (map[string]string) (len=1) {
        	            	   (string) (len=4) "test": (string) (len=23) "none_namespace_metric_l"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 0,
        	            	    Timestamp: (int64) 3
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=18) "none_object_metric",
        	            	  Labels: (map[string]string) (len=2) {
        	            	   (string) (len=7) "label_1": (string) (len=18) "none_object_metric",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 16,
        	            	    Timestamp: (int64) 4
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=11) "full_metric",
        	            	  Labels: (map[string]string) (len=4) {
        	            	   (string) (len=10) "label_test": (string) (len=4) "full",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1",
        	            	   (string) (len=6) "object": (string) (len=3) "pod",
        	            	   (string) (len=11) "object_name": (string) (len=5) "pod_1"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 176,
        	            	    Timestamp: (int64) 55
        	            	   })
        	            	  }
        	            	 }),
        	            	 (*data.MetricSeries)({
        	            	  Name: (string) (len=22) "with_labeled_timestamp",
        	            	  Labels: (map[string]string) (len=4) {
        	            	   (string) (len=10) "label_test": (string) (len=7) "labeled",
        	            	   (string) (len=9) "namespace": (string) (len=2) "n1",
        	            	   (string) (len=6) "object": (string) (len=3) "pod",
        	            	   (string) (len=11) "object_name": (string) (len=5) "pod_2"
        	            	  },
        	            	  Series: ([]*data.MetricData) (len=1) {
        	            	   (*data.MetricData)({
        	            	    Data: (int64) 179,
        	            	    Timestamp: (int64) 123
        	            	   })
        	            	  }
        	            	 })
        	            	}
        	            	
        	            	
        	            	listB:
        	            	([]*data.MetricSeries) <nil>
        	Test:       	Test_scrape
I0413 09:42:49.220486   18202 scrape.go:108] stop scrape manger with url: http://127.0.0.1:34219
--- FAIL: Test_scrape (5.00s)
FAIL
	github.com/kubewharf/katalyst-core/pkg/custom-metric/collector/prometheus	coverage: 33.1% of statements
FAIL	github.com/kubewharf/katalyst-core/pkg/custom-metric/collector/prometheus	5.071s

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Software version

413cc12