Coder Social home page Coder Social logo

openkruise / kruise-game Goto Github PK

View Code? Open in Web Editor NEW
224.0 12.0 39.0 16.69 MB

Game Servers Management on Kubernetes

Home Page: https://openkruise.io/kruisegame/introduction

License: Apache License 2.0

Dockerfile 0.13% Makefile 0.89% Go 97.98% Shell 1.00%
game-server go kubernetes openkruise multiplayer kruise-game openkruisegame okg

kruise-game's People

Contributors

chrisliu1995 avatar clarklee92 avatar fillzpp avatar lizhipeng629 avatar ringtail avatar shawnall avatar smartwang avatar somefive avatar songkang7 avatar wangying-ly avatar whislly avatar yuanyiyi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kruise-game's Issues

Enhance | ServiceQuality supports multiple results returned by a single probe

Background

The current probe state of the service quality function can only be marked with true/false. This often results in the need for multiple detection scripts to reveal multiple states. The problem with using multiple quality of service detection scripts is that due to subtle differences in the probe execution cycles, even if conflicting logic is avoided at the script level, there is still a certain probability that multiple detection results will return true at the same time, resulting in setting status conflicts.

Proposal

Now it is proposed that through one detection, the user returns multiple results and sets different actions for different results.

API

// Not Change
type ServiceQuality struct {
	corev1.Probe  `json:",inline"`
	Name          string `json:"name"`
	ContainerName string `json:"containerName,omitempty"`
	// Whether to make GameServerSpec not change after the ServiceQualityAction is executed.
	// When Permanent is true, regardless of the detection results, ServiceQualityAction will only be executed once.
	// When Permanent is false, ServiceQualityAction can be executed again even though ServiceQualityAction has been executed.
	Permanent            bool                   `json:"permanent"`
	ServiceQualityAction []ServiceQualityAction `json:"serviceQualityAction,omitempty"`
}

type ServiceQualityAction struct {
	State bool `json:"state"`
	// Result indicate the probe message returned by the script.

	// When Result is defined, it would exec action only when the according Result is actually returns.
	Result         string `json:"result,omitempty"`

	GameServerSpec `json:",inline"`
}

type ServiceQualityCondition struct {
	Name   string `json:"name"`
	Status string `json:"status,omitempty"`

	// Result indicate the probe message returned by the script
	Result                   string      `json:"result,omitempty"`

	LastProbeTime            metav1.Time `json:"lastProbeTime,omitempty"`
	LastTransitionTime       metav1.Time `json:"lastTransitionTime,omitempty"`
	LastActionTransitionTime metav1.Time `json:"lastActionTransitionTime,omitempty"`
}

Add a Result field in ServiceQualityAction. When it is specified, the Spec of GameServer will be changed only when the script returns the corresponding value. In this way, a detection script can return multiple statuses.

New image update strategy

It is recommended to add an update strategy that allows the image to be updated only under certain pod states.
When the status of the container is controlled through scripts, there will be the following problems:
1.If the pod status is allocated, the image will also be updated.This will affect players who are playing.
2.If you want to solve problem 1, you need to rebuild gss.yaml. If there is a scaling mechanism, you also need to rebuild scaled.yaml, and then delete the previous yaml, which is very troublesome.Of course there are other ways.

controller restart when GameServerSet redeploy to cluster

Background

We encountered an issue while configuring Hostport network mode using OKG's GameServerSet.

First of all, the GameServerSet has been deployed in the cluster, and the pod has been started, and its status is Running.
Then I deleted it by running "kubectl delete" on the GameServerSet. At this time, pod's status is Terminating.
While the pod has not finished exiting, I re-applied this GameServerSet. However, the newly created pod failed to obtain the Hostport.

Later, I deleted this GameServerSet again, waited for the pod to completely exit, and then re-applied this GameServerSet. The pod was able to obtain the Hostport information correctly.

Upon investigation, I found that the Kruise-game controller panicked due to this operation, which caused a restart. I suspect that the issue was caused by this operation as a whole.

Deployment file

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: trunk
spec:
  replicas: 1
  updateStrategy:
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
  network:
    networkType: Kubernetes-HostPort
    networkConf:
    - name: ContainerPorts 
      value: "container1:5000/TCP"
  gameServerTemplate:
    spec:
      containers:
        - image: container1-image
          imagePullPolicy: IfNotPresent
          name: container1
          env:
          - name: KRUISE_CONTAINER_PRIORITY
            value: "2"
          volumeMounts:
            - name: network
              mountPath: /opt/network
        - image: container2-image
          imagePullPolicy: IfNotPresent
          name: container2
          env:
          - name: KRUISE_CONTAINER_PRIORITY
            value: "1"
      volumes:
      - name: network
        downwardAPI:
          items:
          - path: "annotations"
            fieldRef:
              fieldPath: metadata.annotations['game.kruise.io/network-status']

    volumeClaimTemplates:
      - metadata:
          name: db-storage
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: "cfs"
          resources:
            requests:
              storage: 10Gi

Controller logs

1.686739246445438e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/validate-v1alpha1-gss", "UID": "e5c1cf00-f3d3-46ae-b86a-9b5b8ad537a6", "kind": "game.kruise.io/v1alpha1, Kind=GameServerSet", "resource": {"group":"game.kruise.io","version":"v1alpha1","resource":"gameserversets"}}
1.686739246445752e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/validate-v1alpha1-gss", "code": 200, "reason": "pass validating", "UID": "e5c1cf00-f3d3-46ae-b86a-9b5b8ad537a6", "allowed": true}
1.6867392465119815e+09	DEBUG	events	Normal	{"object": {"kind":"GameServerSet","namespace":"default","name":"a4","uid":"587e3fb3-cc3a-4b12-aa3b-6254ff0d7875","apiVersion":"game.kruise.io/v1alpha1","resourceVersion":"33261501457"}, "reason": "CreateWorkload", "message": "created Advanced StatefulSet"}
1.6867392465414512e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "a8cf9c48-0ab7-4366-978a-885777aee37f", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392465423882e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "a8cf9c48-0ab7-4366-978a-885777aee37f", "allowed": true}
1.6867392466240625e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "3d27baa4-34d7-4681-b3c3-42ac7bee3ea2", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392466250088e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "3d27baa4-34d7-4681-b3c3-42ac7bee3ea2", "allowed": true}
1.686739246684605e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "82bb9c68-6182-44ad-9874-d5845f78af91", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.686739246685527e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "82bb9c68-6182-44ad-9874-d5845f78af91", "allowed": true}
1.6867392467381902e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "413f0e56-ed28-46b2-a6c6-b7b183adbec2", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.686739246739107e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "413f0e56-ed28-46b2-a6c6-b7b183adbec2", "allowed": true}
1.686739246776154e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "2a41ab6a-97d3-4c68-847b-450fcebd1752", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392467770834e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "2a41ab6a-97d3-4c68-847b-450fcebd1752", "allowed": true}
1.6867392468166428e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "1da9233f-f440-4845-84b4-b807ea56279e", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392468175259e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "1da9233f-f440-4845-84b4-b807ea56279e", "allowed": true}
1.6867392470155354e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "a23451b2-dd25-4628-995f-56ca2470e466", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392470164003e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "a23451b2-dd25-4628-995f-56ca2470e466", "allowed": true}
1.6867392473775764e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "964ff6a1-5588-4aca-a1a5-1959576b29eb", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392473784983e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "964ff6a1-5588-4aca-a1a5-1959576b29eb", "allowed": true}
1.6867392481780772e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "54e1e9c6-5769-48e3-86ec-b06390cb19f8", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392481789427e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "54e1e9c6-5769-48e3-86ec-b06390cb19f8", "allowed": true}
1.6867392495441675e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "96ac2a82-7933-4894-8c27-0764d4069dd1", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392495450299e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "96ac2a82-7933-4894-8c27-0764d4069dd1", "allowed": true}
1.6867392521424189e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "0d92f93c-0e2e-4a55-ad51-e8425ce19de4", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392521432865e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "0d92f93c-0e2e-4a55-ad51-e8425ce19de4", "allowed": true}
1.68673925730318e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "16658c6b-f5e2-4976-b9c7-eb08e8d4bdb1", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392573040788e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "16658c6b-f5e2-4976-b9c7-eb08e8d4bdb1", "allowed": true}
1.6867392675877454e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "f7c72efe-8657-4122-97c4-7ebc8c530a3c", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
1.6867392675886924e+09	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-v1-pod", "code": 200, "reason": "", "UID": "f7c72efe-8657-4122-97c4-7ebc8c530a3c", "allowed": true}
1.6867392744261086e+09	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-v1-pod", "UID": "dedc8191-1383-4fd4-a151-ada8b4a73caf", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
panic: runtime error: index out of range [-1]

goroutine 1790 [running]:
github.com/openkruise/kruise-game/cloudprovider/kubernetes.(*HostPortPlugin).deAllocate(0xc0003cf8b0, {0xc0008025d0, 0x1, 0x1a1172e?}, {0xc00099acd0, 0xc})
	/workspace/cloudprovider/kubernetes/hostPort.go:251 +0x169
github.com/openkruise/kruise-game/cloudprovider/kubernetes.(*HostPortPlugin).OnPodDeleted(0xc0003cf8b0, {0x7000000000000?, 0xc00077a060?}, 0xc000700800, {0x0?, 0x0?})
	/workspace/cloudprovider/kubernetes/hostPort.go:175 +0x14a
github.com/openkruise/kruise-game/pkg/webhook.(*PodMutatingHandler).Handle.func1()
	/workspace/pkg/webhook/mutating_pod.go:81 +0xdf
created by github.com/openkruise/kruise-game/pkg/webhook.(*PodMutatingHandler).Handle
	/workspace/pkg/webhook/mutating_pod.go:72 +0x37d

GameServer lifecycle optimization

Background

Currently, GameServer is recycled by the GameServerSet controller. When GameServerSet finds that the managed pods no longer exist in the cluster, the corresponding GameServers are deleted. One problem with this recycling method is that the life cycle of GameServer is not so certain. For example, when a pod is deleted and rebuilt, if the pod is rebuilt quickly, the GameServer will not be recycled and its attributes will still be retained; but if the rebuilt is slow, the GameServer will also be deleted and a new one will be created after the pod is created. GameServer with the same name, the status previously recorded in GameServer will be lost.

Proposal

I think we need a more deterministic approach to lifecycle management and leave the choice to the user. The user will decide whether the owner of GameServer is GameServerSet or Pod. If the owner is GameServerSet, GameServer will not be deleted when the pod is deleted. GameServer will be deleted only when GameServerSet is deleted. If the owner is GameServer, GameServer will be deleted when the pod is deleted.

背景

当前,GameServer是由GameServerSet控制器回收的,当GameServerSet发现管理的pods已经不存在集群,则删除对应的GameServers。这种回收方式有一个问题在于,GameServer的生命周期不是那么确定。比如在pod删除重建时,如果pod重建速度较快,则GameServer将不会被回收,它的属性也依旧保留;但如果重建较慢,则GameServer也将被删除,pod创建后会创建一个新的同名GameServer,之前记录在GameServer的状态将丢失。

提议

我认为我们需要一个更加确定的生命周期管理方式,并将选择权交由用户。用户将决定GameServer的owner是GameServerSet还是Pod。如果owner是GameServerSet,则pod删除时,GameServer不会被删除,当GameServerSet被删除,GameServer才会被删除。如果owner是GameServer,则pod删除时GameServer就会被删除。

service quality probe 100% failed

Kruise version: 1.5.0
Kruise game version: 0.6.1

企业微信截图_51bae1bd-a3f6-4c36-b531-70d682a5afc0

Game server set sample from https://openkruise.io/zh/kruisegame/user-manuals/service-qualities :

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
name: minecraft
namespace: default
spec:
replicas: 3
gameServerTemplate:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/gs-demo/gameserver:idle
name: minecraft
updateStrategy:
rollingUpdate:
podUpdatePolicy: InPlaceIfPossible
maxUnavailable: 100%
serviceQualities: # 设置了一个idle的服务质量
- name: idle
containerName: minecraft
permanent: false
#与原生probe类似,本例使用执行脚本的方式探测游戏服是否空闲,不存在玩家
exec:
command: ["bash", "./idle.sh"]
serviceQualityAction:
#不存在玩家,标记该游戏服运维状态为WaitToBeDeleted
- state: true
opsState: WaitToBeDeleted
#存在玩家,标记该游戏服运维状态为None
- state: false
opsState: None

kubectl -n kruise-system logs -f kruise-daemon-pk252

企业微信截图_e41fcb37-2fe0-48cc-99cb-ba32e4a774b0

This problem also occurs in Kruise version 1.5.1

Hostport not allocated for rebuilt Pods with potential for intermittent error

Backgroud

Firstly, please note the following series of events:

A Pod has been established within the cluster utilizing OKG for GSS, complete with requested hostport and performs regularly.
Upon updating the image and GSS configuration incorporating a ReadinessProbe, an error arises preventing the new Pod from launching (the grpc ReadinessProbe was applied, but presently, the cluster isn't capable of accommodating this new feature).
In response, I removed the problematic ReadinessProbe configuration which prompted a rebuild, leading to the successful creation of the new Pod.
Curiously, after its establishment, the new Pod neglected to request a hostport.

Upon inspecting the controller logs, it appears that the original Pod had not triggered a deallocation upon deletion. As such, the controller perpetually views the hostport of the Pod as requested. This situation merits further investigation of the related allocation processes to prevent potential errors in similar future scenarios.

Query

How can such instances be avoided? Is there any mechanism to verify the deallocation of the hostport when the Pod is deleted? Any advice regarding resolving this issue would be greatly appreciated.

建议在主页添加更多的文档链接

刚开始接触这个项目,感受就是项目不错,但是文档较少。

其实有一些文档,都藏在这个目录下。建议像别的项目一样,直接在主页放上链接,方便用户查看。

对项目设计目的介绍文章比较简单。在阿里云社区找到这篇文章介绍得很详细,不过排版比较乱,可以考虑整理后放到项目文档中。

ReserveIds缩容结果非预期

初始状态

gss配置如下

...
spec:
  replicas: 5
  reserveGameServerIds:
  - 0
  - 2
  scaleStrategy:
    scaleDownStrategyType: ReserveIds
...

gs信息如下

NAME          STATE   OPSSTATE   DP    UP    AGE
nginx-okg-1   Ready   None       0     0     23h
nginx-okg-3   Ready   None       0     0     6h57m
nginx-okg-4   Ready   None       0     0     7m26s
nginx-okg-5   Ready   None       0     0     7m26s
nginx-okg-6   Ready   None       0     0     6m19s

gss修改内容

kubectl edit gss nginx-okg
spec:
  replicas: 3
  reserveGameServerIds:
  - 0
  - 4
  scaleStrategy:
    scaleDownStrategyType: ReserveIds
...

修改提交后结果

...
spec:
  replicas: 3
  reserveGameServerIds:
  - 0
  - 4
  - 6
  scaleStrategy:
    scaleDownStrategyType: ReserveIds
...
NAME          STATE   OPSSTATE   DP    UP    AGE
nginx-okg-1   Ready   None       0     0     24h
nginx-okg-2   Ready   None       0     0     2m9s
nginx-okg-3   Ready   None       0     0     7h3m

预期结果

1、第一预期应该是gs保留[1, 3, 5],reserveGameServerIds的值更新为[0, 4, 6]
2、目前还不理解新建gs-2的原理,假设新建gs-2符合预期的话,那么reserveGameServerIds的值为何不是更新为[0, 4, 5, 6]

Add serviceName support to GameServerSet for individual Pod DNS resolution

Hello,

I've been working with the OpenKruise Game project, specifically with the GameServerSet custom resource. I noticed that when using GameServerSet, the generated Pods do not have individual DNS records, which seems to be due to the missing subdomain field in the Pods' spec part.

In contrast, when using a StatefulSet with a specified serviceName, the Pods have a subdomain field, which allows for individual DNS resolution for each Pod.

To improve the functionality of GameServerSet, I would like to suggest adding support for a serviceName-like field or directly using the subdomain field. This would enable individual DNS resolution for each Pod managed by a GameServerSet, making it more convenient for use cases that require addressing individual Pods.

Please let me know if this is something that can be considered for implementation in the OpenKruise Game project or if there are any workarounds available to achieve this functionality.

reserveGameServerIds字段在创建gss时不生效

使用以下yaml创建gss不能实现创建预期编号为[1-4]的gs,最终gs与pod编号仍为[0-3]

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: minecraft
spec:
  replicas: 4
  reserveGameServerIds: [0]
  gameServerTemplate:
    spec:
      containers:
        - name: minecraft
          image: registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2

查看events发现编号为4的pod被删除

LAST SEEN   TYPE      REASON                                          OBJECT                    MESSAGE
3m7s        Normal    Scheduled                                       pod/minecraft-0           Successfully assigned default/minecraft-0 to ssl-k8s-126-3
3m6s        Normal    Pulling                                         pod/minecraft-0           Pulling image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2"
3m5s        Normal    Pulled                                          pod/minecraft-0           Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2" in 353.010049ms (685.430123ms including waiting)
3m5s        Warning   ContainersNotReady                              gameserver/minecraft-0    containers with unready status: [minecraft]
3m5s        Normal    Created                                         pod/minecraft-0           Created container minecraft
3m5s        Normal    Started                                         pod/minecraft-0           Started container minecraft
3m4s        Normal    GsStateChanged                                  gameserver/minecraft-0    State turn from Creating to Ready
3m7s        Normal    Scheduled                                       pod/minecraft-1           Successfully assigned default/minecraft-1 to ssl-k8s-126-3
3m7s        Warning   ContainersNotReady                              gameserver/minecraft-1    containers with unready status: [minecraft]
3m6s        Normal    Pulling                                         pod/minecraft-1           Pulling image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2"
3m6s        Normal    Pulled                                          pod/minecraft-1           Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2" in 338.014ms (338.119396ms including waiting)
3m6s        Normal    Created                                         pod/minecraft-1           Created container minecraft
3m6s        Normal    Started                                         pod/minecraft-1           Started container minecraft
3m5s        Normal    GsStateChanged                                  gameserver/minecraft-1    State turn from Creating to NotReady
3m3s        Normal    GsStateChanged                                  gameserver/minecraft-1    State turn from NotReady to Ready
3m7s        Normal    Scheduled                                       pod/minecraft-2           Successfully assigned default/minecraft-2 to ssl-k8s-126-3
3m7s        Warning   ContainersNotReady                              gameserver/minecraft-2    containers with unready status: [minecraft]
3m7s        Normal    Pulling                                         pod/minecraft-2           Pulling image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2"
3m6s        Normal    Pulled                                          pod/minecraft-2           Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2" in 299.383572ms (299.388314ms including waiting)
3m6s        Normal    Created                                         pod/minecraft-2           Created container minecraft
3m6s        Normal    Started                                         pod/minecraft-2           Started container minecraft
3m5s        Normal    GsStateChanged                                  gameserver/minecraft-2    State turn from Creating to Ready
3m7s        Normal    Scheduled                                       pod/minecraft-3           Successfully assigned default/minecraft-3 to ssl-k8s-126-3
3m7s        Warning   ContainersNotReady                              gameserver/minecraft-3    containers with unready status: [minecraft]
3m6s        Normal    Pulling                                         pod/minecraft-3           Pulling image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2"
3m6s        Normal    Pulled                                          pod/minecraft-3           Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2" in 349.925159ms (616.176777ms including waiting)
3m6s        Normal    Created                                         pod/minecraft-3           Created container minecraft
3m6s        Normal    Started                                         pod/minecraft-3           Started container minecraft
3m4s        Normal    GsStateChanged                                  gameserver/minecraft-3    State turn from Creating to Ready
3m7s        Normal    Scheduled                                       pod/minecraft-4           Successfully assigned default/minecraft-4 to ssl-k8s-126-3
3m6s        Warning   ContainersNotReady                              gameserver/minecraft-4    containers with unready status: [minecraft]
3m6s        Normal    Pulling                                         pod/minecraft-4           Pulling image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2"
3m5s        Normal    Pulled                                          pod/minecraft-4           Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2" in 363.68806ms (772.307662ms including waiting)
3m5s        Normal    Created                                         pod/minecraft-4           Created container minecraft
3m5s        Normal    Started                                         pod/minecraft-4           Started container minecraft
3m4s        Normal    Killing                                         pod/minecraft-4           Stopping container minecraft
3m3s        Normal    GsStateChanged                                  gameserver/minecraft-4    State turn from Creating to Deleting
2m32s       Warning   ContainersNotReady; ContainerTerminated:Error   gameserver/minecraft-4    containers with unready status: [minecraft];  ExitCode: 137
3m7s        Normal    CreateWorkload                                  gameserverset/minecraft   created Advanced StatefulSet
3m7s        Normal    Scale                                           gameserverset/minecraft   scale from 0 to 4
3m7s        Normal    SuccessfulCreate                                statefulset/minecraft     create Pod minecraft-1 in StatefulSet minecraft successful
3m7s        Normal    SuccessfulCreate                                statefulset/minecraft     create Pod minecraft-2 in StatefulSet minecraft successful
3m7s        Normal    SuccessfulCreate                                statefulset/minecraft     create Pod minecraft-3 in StatefulSet minecraft successful
3m7s        Normal    SuccessfulCreate                                statefulset/minecraft     create Pod minecraft-4 in StatefulSet minecraft successful
3m7s        Normal    SuccessfulCreate                                statefulset/minecraft     create Pod minecraft-0 in StatefulSet minecraft successful
3m7s        Normal    SuccessfulDelete                                statefulset/minecraft     delete Pod minecraft-4 in StatefulSet minecraft successful

Occasional "NotReady" Network Status on Pod Upon Rebuilding a GameServerSet

We are experiencing an issue where, upon updating our GameServerSet (GSS), which causes all managed Pods to rebuild, there's an occurrence of Pods (out of the 6 running GameServers) ending up with a failure in retrieving network information, resulting in a "NotReady" network status. Below are the specific details and steps that lead to this issue:

Environment:

Network Plugin: HostPort
Number of GameServer replica in the GSS: 6

Steps to Reproduce:

  1. Update the GSS by changing the container image and environment variables. This action triggers a rebuild of all Pods managed by the GSS.
  2. After the old Pods are deleted and new ones are recreated, one of the six Pods encounters an error in obtaining network information.

Expected Behavior:

After the update and subsequent Pod recreation, all Pods should successfully retrieve their network information and display a "Ready" network status.

Log informantion

I am observing logs from kruise-game-manager that warrant attention. Here are the specific log entries:

2024-01-26T14:59:46+08:00 I0126 06:59:46.237778       1 hostPort.go:73] Receiving pod dev/gs-dev-a4-3 ADD Operation
2024-01-26T14:59:46+08:00 I0126 06:59:46.237840       1 hostPort.go:80] There is a pod with same ns/name(dev/gs-dev-a4-3) exists in cluster, do not allocate

[Proposal] Design of refactor for Pod-Mutating-Webhook-Handler

Present Situation

Currently, the pod mutating webhook handler contains two parts:

  • Image and resource specifications in gs spec container. When the pod is created, the corresponding fields will be patched to the pod based on what is declared by gs.spec.containers.
  • Network plugin patch related network annotations fields

When the network plug-in was originally designed, an asynchronous mechanism was proposed to allow access to the network and pod generation asynchronously, that is, to allow the network plugin to fail in OnCreate/OnUpdate, and trigger another Update event through the controller to ensure the creation and availability of the network. Therefore, when the webhook handler encounters an error returned by the network plug-in, it will: 1) not modify the pod field (directly return to the original pod) 2) allow the pod to be created or updated normally, thinking that the network plug-in has made an invalid action and ignore this behavior.

However, there are two problems with the current mechanism:

  1. If the user uses the OKG network plug-in and declares the gs Container field, and the network plug-in encounters an error in Pod OnCreate, the pod will be created normally according to the value in the template, causing the gs Container declaration to become invalid.
  2. The hostport network model is different from other network models. It requires network port allocation and patching to the pod when the pod is created. If an error occurs in this process, the network cannot be re-created during the subsequent update process, and the pod becomes network unavailable.

目前pod mutating webhook handler中有包含两部分:

  • gs spec container中的镜像和资源规格。在pod创建时会以gs声明的为准,将对应的字段patch到pod上。
  • 网络插件patch相关网络annotations字段

当最初设计网络插件时,提出了异步机制,允许接入网络与pod异步生成,即在允许网络插件在OnCreate/OnUpdate失败,通过控制器触发再次Update事件来保障网络的创建和可用。故,webhook handler在遇到网络插件返回error时会:1)不修改pod字段(直接返回原pod)2)允许pod正常创建或更新,认为网络插件做了一次无效动作而忽略本次行为。

然而,当前的机制存在两个问题:

  1. 若用户使用了OKG网络插件,并且声明了gs Container字段,且网络插件在Pod OnCreate发生错误,此时pod会按照template中的值正常创建,导致gs Container声明失效。
  2. hostport网络模型与其他网络模型不同,它需要在pod创建时完成网络端口分配并patch到pod上,若该过程发生错误,后续update过程中也无法再重新创建网络,该pod成为网络不可用的游戏服。

Solution

It is recommended to follow Kubernetes' error handling semantics and retrigger the event when an error is encountered. So:

When the network plugin reports an error, the webhook handler will reject the pod creation/update, set it as failure, and retrigger the pod creation/update action until the action is executed successfully and no errors are generated. The network ready check is still performed asynchronously. The purpose of this proposal is to ensure the successful execution of an operation, regardless of the subsequent results of the operation.

It should be noted that in current network plug-ins such as slb, under the new mechanism, the network requires pod creation to take effect, so repeated errors will occur in the pod create action, resulting in pods not being created, so additional modifications are required.

建议遵循Kubernetes对错误处理的语意,遇到错误需要重新触发事件。如此一来:

当网络插件上报错误,webhook handler将拒绝pod创建/更新,将其置为失败,重新触发pod创建/更新动作,直至动作执行成功,未产生错误。网络ready check依然是异步进行的,这次重构目的在于保证一次操作的成功执行,不考虑操作后续结果如何。

需要注意,当前例如slb等网络插件中在新机制下,网络由于需要pod创建而生效,所以pod create动作会出现反复错误,进而导致pod创建不出来的情况,故需要进行额外改造。

Feat | Add AutoUpdateStrategy

Backgroud

Currently GameServerSet supports batch updates in a user-defined manner by setting UpdatePriority and Partition. However, under this strategy, users need to frequently operate gss and gs objects, and many times users want to complete rolling updates in a more automated way.
There are currently two scenario requirements:

  1. The existing gs under gss is not updated, and the new gs is created with a new image. In this way, users can achieve hot updates of versions through the automatic scaling capabilities provided by OKG without additional manual intervention.
  2. User-specified gs update in certain states. gss will determine whether the gs is updated based on the current gs status. If it meets the user-specified status, it will be updated; otherwise, it will not be updated.

API

type UpdateStrategy struct {
	// Type indicates the type of the StatefulSetUpdateStrategy.
	// Default is RollingUpdate.
	// +optional
	Type apps.StatefulSetUpdateStrategyType `json:"type,omitempty"`
	// RollingUpdate is used to communicate parameters when Type is RollingUpdateStatefulSetStrategyType.
	// +optional
	RollingUpdate *RollingUpdateStatefulSetStrategy `json:"rollingUpdate,omitempty"`
	// AutoUpdateStrategy means that the update process will be performed automatically without user intervention.
	// +optional
	AutoUpdateStrategy *AutoUpdateStrategy `json:"autoUpdateStrategy,omitempty"`
}

type AutoUpdateStrategy struct {
	//+kubebuilder:validation:Required
	Type AutoUpdateStrategyType `json:"type"`
	// Only GameServers in SpecificStates will be updated.
	// +optional
	SpecificStates []OpsState `json:"specificStates,omitempty"`
}

type AutoUpdateStrategyType string

const (
	// OnlyNewAutoUpdateStrategyType indicates exist GameServers will never be updated, new GameServers will be created in new template.
	OnlyNewAutoUpdateStrategyType AutoUpdateStrategyType = "OnlyNew"
	// SpecificStateAutoUpdateStrategyType indicates only GameServers with Specific OpsStates will be updated.
	SpecificStateAutoUpdateStrategyType AutoUpdateStrategyType = "SpecificState"
)

建议支持配置日志格式,支持 JSON 格式的结构化日志

我们使用 kibana 查找日志,只支持 JSON 格式的结构化日志,建议增加日志格式配置,支持配置为 JSON 格式输出,示例如下:

{"time":"2024-05-31T04:23:41.168044065Z","level":"INFO","source":{"function":"github.com/CloudNativeGame/kruise-game-open-match-director/pkg/logger.InfoContext","file":"/go/src/director/pkg/logger/logger.go","line":56},"msg":"begin FetchMatches","traceid":"3bc82e362d67a3cf6c55a9104d24456e","sampled":true}

Feat: Add GameServerConfig to patch different GameServers different labels/annotations

Background

Managing the configuration of game servers is a common issue after game containerization. The configuration of game servers can be presented through labels or annotations in Kubernetes (k8s), and then passed down to the containers using the Downward API for business awareness. However, in scenarios like PvE games or MMORPGs, each game server has its own unique configuration. This means that each game server requires distinct labels or annotations. Generally, the keys of these labels and annotations are the same across different game servers, only the values differ. We need a way to manage the different labels and annotations of different game servers in a batch, automatic, and persistent manner. Therefore, I propose a new custom resource definition (CRD) object called GameServerConfig.

Design

API


type GameServerConfigSpec struct {
	GameServerSetName string            `json:"gameServerSetName"`
	LabelConfigs      []StringMapConfig `json:"labelConfigs,omitempty"`
	AnnotationConfigs []StringMapConfig `json:"annotationConfigs,omitempty"`
}

type StringMapConfig struct {
	Type       StringMapConfigType `json:"type"`
	KeyName    string              `json:"keyName"`
	IdValues   []IdValue           `json:"idValues,omitempty"`
	RenderRule string              `json:"renderRule,omitempty"`
}

type StringMapConfigType string

const (
	SpecifyID StringMapConfigType = "SpecifyID"
	RenderID  StringMapConfigType = "RenderID"
)

type IdValue struct {
	IdList []int  `json:"idList,omitempty"`
	Value  string `json:"value,omitempty"`
}

type GameServerConfig struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   GameServerConfigSpec   `json:"spec,omitempty"`
	Status GameServerConfigStatus `json:"status,omitempty"`
}

type GameServerConfigList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []GameServer `json:"items"`
}

type GameServerConfigState string

const (
	Pending GameServerConfigState = "Pending"
	Succeed GameServerConfigState = "Succeed"
)

type GameServerConfigStatus struct {
	State              GameServerConfigState `json:"state,omitempty"`
	LastTransitionTime metav1.Time           `json:"lastTransitionTime,omitempty"`
}

Example

There are 8 GameServers managed by GameServerSet minecraft. If a GameServerConfig as follow is applied:

	gsc := GameServerConfig{
		Spec: GameServerConfigSpec{
			GameServerSetName: "minecraft",
			LabelConfigs: []StringMapConfig{
				{
					Type:    SpecifyID,
					KeyName: "zone-id",
					IdValues: []IdValue{
						{
							IdList: []int{1, 3, 4},
							Value:  "8001",
						},
						{
							IdList: []int{0, 2, 5},
							Value:  "8002",
						},
						{
							IdList: []int{6},
							Value:  "8003",
						},
						{
							IdList: []int{7},
							Value:  "8004",
						},
					},
				},
			},
			AnnotationConfigs: []StringMapConfig{
				{
					Type:       RenderID,
					KeyName:    "group-name",
					RenderRule: "group-<id>",
				},
			},
		},
	}

The GameServers' labels & annotations will be:

GameServer Name Label Annotation
minecraft-0 zone-id: 8002 group-name: group-0
minecraft-1 zone-id: 8001 group-name: group-1
minecraft-2 zone-id: 8002 group-name: group-2
minecraft-3 zone-id: 8001 group-name: group-3
minecraft-4 zone-id: 8001 group-name: group-4
minecraft-5 zone-id: 8002 group-name: group-5
minecraft-6 zone-id: 8003 group-name: group-6
minecraft-7 zone-id: 8004 group-name: group-7

[proposal] Autoscaler Improvement

Background

OKG autoscaler is implemented according to Keda's external scaler mechanism, which provides two interfaces, GetMetricSpec exposes Target, and GetMetrics exposes Value.

The current autoscaler of OKG uses the number of replicas of GameServerSet as Target in the GetMetricSpec method, and uses the number of replicas minus the number of WaitToBeDeleted as Value in the GetMetrics method. Since the calls of GetMetricSpec and GetMetrics are asynchronous, this will lead that, at some moments, the replicas obtained in GetMetricSpec and GetMetrics is not the same, and then, the desired replicas calculated by HPA does not meet expectations.

Improvement

An improvement method is proposed to fix the target value set by GetMetricSpec, whether to scale down or not is completely determined by GetMetrics, and change the type of the scaler from Value to AverageValue. After the improvement, the ratio of value to target will only be less than or equal to 1, and the scaling-down will only be performed when the ratio is less than 1, which solves the current problem of occasional unexpected scaling-up.

Feat | add SLB network plugin

Background 背景说明

Cloud load balancer, as a very mature network cloud product, has been well known by developers and has been widely used. However, in game scenarios, due to the stateful nature of game servers, user traffic cannot be balanced to different game servers, which runs counter to the concept of Service in Kubernetes.

The Service matches the corresponding Pod, and balances the traffic carried by the LB to different pods. As shown in the figure below, the port corresponding to the Service is 80, and the targetPort is 80. Only one port is opened on the LB.

云负载均衡器(LoadBalancer)作为极为成熟的网络云产品已经被开发者熟知,得到了广泛地应用。然而在游戏场景下,由于游戏服有状态的特性,用户流量是无法均衡到不同的游戏服上的,这与Kubernetes中Service的概念背道而驰。

Service匹配上对应的Pod,将LB承载的流量均衡到不同的pod上,如下图所示,Service对应的port是80,targetPort是80,LB实际上只开放了一个端口。

image

In the game server scenario, a single LB should open different ports and forward the traffic to the corresponding Pod. As shown in the figure below, the traffic is forwarded from port 555 of LB to port 80 of pod0, from port 556 of LB to port 80 of pod1, and from port 557 of LB to port 80 of pod2. This way of using LB is what the game server needs.

而在游戏服场景下,单个LB应该开放不同端口,将流量转发到对应的Pod上,如下图所示,LB的555端口转发到pod0的80端口;LB的556端口转发到pod1的80端口;LB的557端口转发到pod2的80端口。这种LB的使用方式才是游戏服所需要的。

Untitled

How to use

Using OKG [cloud provider & network plugin mechanism](#15) , used as follows:

Specify network configuration when deploying GameServerSet:

cat <<EOF | kubectl apply -f -
apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: gs-slb
  namespace: default
spec:
  replicas: 1
  updateStrategy:
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
  network:
    networkType: AlibabaCloud-SLB
    networkConf:
    - name: SlbIds
      #Fill in Alibaba Cloud LoadBalancer Id here
      value: "lb-xxxxxxxxxxxxxxxxx"
    - name: PortProtocols
      #Fill in the exposed ports and their corresponding protocols here. 
      #If there are multiple ports, the format is as follows: {port1}/{protocol1},{port2}/{protocol2}...
      #If the protocol is not filled in, the default is TCP
      value: "80"
    - name: Fixed
      #Fill in here whether a fixed IP is required [optional] ; Default is false
      value: "false"
  gameServerTemplate:
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/gs-demo/gameserver:network
          name: gameserver
EOF

Check network status in GameServer:

networkStatus:
    createTime: "2022-11-24T01:27:30Z"
    currentNetworkState: Ready
    desiredNetworkState: Ready
    externalAddresses:
    - ip: 47.97.167.217
      ports:
      - name: "80"
        port: "611"
        protocol: TCP
    internalAddresses:
    - ip: 172.16.0.17
      ports:
      - name: "80"
        port: "80"
        protocol: TCP
    lastTransitionTime: "2022-11-24T01:27:30Z"
    networkType: Ali-SLB

Detailed

Design Overview 设计概述

ACK(Alibaba Cloud Container Service for Kubernetes) supports the mechanism of SLB multiplexing in k8s. Different SVCs can use different ports of the same SLB. According to this, the Ali-SLB network plugin will record the port assignments corresponding to each SLB. For game servers that specify the network type as Ali-SLB, the Ali-SLB network plugin will automatically allocate a port and create a service object, after the public network IP in the svc ingress field is successfully created, the GameServer network is in the Ready state, and the process is completed.

阿里云容器服务支持在k8s中对SLB复用的机制,不同的svc可以使用同一个SLB的不同端口。由此,Ali-SLB network plugin将记录各SLB对应的端口分配情况,对于指定了网络类型为Ali-SLB,Ali-SLB网络插件将会自动分配一个端口并创建一个service对象,待svc ingress字段的公网IP创建成功后,GameServer的网络处于Ready状态,该过程执行完成。

Untitled

Fixed-IP 固定IP

When Fixed is specified as true in the network configuration of GameServerSet, the fixed IP function will take effect. Even if the Pod is deleted and rebuilt, the traffic path of the game server from SLB port to Pod port will not change.

When creating SVC, set the ownerReference of SVC according to Fixed. When Fixed is true, the owner of SVC is GameServerSet, and the SVC will be deleted only when GameServerSet is deleted; when Fixed is false, the owner of SVC is pod, and SVC will also be deleted when pod is deleted.

让GameServerSet的网络参数中指定了Fixed为true时,固定IP功能生效,即使Pod被删除重建,对于该游戏服的访问链路不会发生改变,外部IP、端口与内部IP、端口的映射关系维持固定。

在创建SVC时,根据Fixed设置SVC的ownerReference。当Fixed为true时,SVC的owner为GameServerSet,只有GameServerSet删除时该SVC才会被删除,与Pod的生命周期无关;当Fixed为false时,SVC的owner为Pod,当Pod被删除时SVC也将被删除。

Network Isolation 网络隔离

The Ali-SLB network plug-in provides network isolation. Even when the Pod is Ready, it can also remove the external network of the game server.

When the networkDisabeld field of GameServer.Spec is specified as true, the Ali-SLB network plugin will isolate the game server from the network, change the SVC network type from LoadBalancer to ClusterIP. This function is suitable for scenarios such as testing the game server after the game server updated when not reopened for players, cutting off traffic when the game server is abnorma.

Ali-SLB网络插件在SVC层面提供了网络隔离的能力,即使Pod为Ready时,也可以实现游戏服的对外网络的摘除。

在GameServer.Spec的networkDisabeld字段指定为true时,Ali-SLB网络插件将对该游戏服进行网络隔离,将对应SVC的网络类型由LoadBalancer改为ClusterIP,切断外部访问的流量。此功能适用于游戏服更新完成后测试游戏测试通过后再开服、以及游戏服异常时等切断流量等场景。

Feat | OpenKruiseGame Dashboard

OpenKruiseGame has been adopted by lots of outstanding companies and we plan to set up a new project named kruise-game-dashboard to help more developers.

kruise-game-dashboard is working in progress for two popular opensource kubernetes dashboards following the directions below.

Welcome developers to provide information and suggestions on the key features you are interested in

Wanted | What are the obstacles that block your game server migration to Kubernetes?| 什么是阻碍你的游戏服迁移Kubernetes的原因?

What are the obstacles that block your game server migration to Kubernetes?

First of all, thanks sincerely for watching Kruise-Game. We will try our best to keep Kruise-Game better, and keep community and eco-system growing.

The purpose of this issue

We’d like to listen to the community to make Kruise-Game better.
We want to attract more people to contribute to Kruise-Game.
We're willing to learn more Kruise-Game use scenarios for better planning.

What we expect from you

Please submit a comment in this issue to include the following information:

Your company, school or organization.
Your city and country.
Your contact info: blog, email, WeChat or Twitter (at least one).
What are the obstacles that block your game server migration to Kubernetes.

You can refer to the following sample answer for the format:

Orgnization/Company: Alibaba
Location: Hangzhou, China
Contact: [email protected]
Obstacles/Scenario:  game server hot upgrade and static ip/port.

Thanks again for your participation!
Kruise-Game Community


什么是阻碍您的游戏服迁移Kubernetes的原因?

首先,非常感谢大家持续关注Kruise-Game项目的发展。我们会竭尽所能让Kruise-Game更简单易用,保持整个社区和生态蓬勃发展。

这个提议的目的

我们希望听到社区的真实声音,让Krusie-Game变得更好。
我们希望吸引更多的开发者能够参与到Kruise-Game的研发中。
我们希望能够了解到更多Kruise-Game应该解决的场景,并进行需求的管理和规划。

我们希望获取到的信息

请按照如下的格式提价一个包含如下内容的评论

您的公司、学校或者组织
您的城市或者国家
您的联系方式:博客、邮件、微信或者Twitter都可以
游戏服云原生化过程遇到的核心困扰

您可以按照下面的样例进行提交

公司/组织:阿里巴巴
位置:杭州,**
联系方式:[email protected]
核心障碍:游戏服的热更新,网络的固定ip和端口

再次感谢您的参与
Kruise-Game社区

Feat | Using KubeVela to manage multiple GameServerSet

Background

In game area, servers are usually grouped into various partitions. Each partition of servers are set up for serving a specific range of players. In each partition, there could be multiple types of servers that provides different services and communicate with each other server in the same partition, like Battle servers or Scene servers.

image

Each type of server could contain multiple replicas and be modeled as a GameServerSet. The GameServerSet will create Open Kruise StatefulSet to pull up pods and each pod will be attached with an additional GameServer object to help set the operate state of the pod.

image
image

Operating/Managing a large amount of GameServerSets across different partitions can be laborious. To help alleviate the burden of repetitive operations for GameServerSets, we could introduce KubeVela to model the higher level application on top of the GameServerSets.

Architecture

In KubeVela, applications are used to model resources and manage their spec & lifecycles. Besides, there are also delivery pipeline that could describe the operation actions as codes and be reused.

Specifically, we could model the GameServerSets in each partition as a single KubeVela application and let it manage the desired state and delivery process of the GameServerSets, such as updates. On top of that, for the whole game, we could add another application to manage the partition applications.

image

With architecture, it will be easy to modify the desired state of the GameServerSets along with the partitions or the types (roles).

image

Implementation

First, to model the partition application, we need a KubeVela ComponentDefinition for the abstraction of GameServerSet. The below CUE templates defines how the GameServerSet is formed, which parameters are exposed and how the health state is evaluated.

"game-server-set": {
	alias: ""
	annotations: {}
	description: "The GameServerSet."
	type:        "component"
    attributes: {
        workload: type: "autodetects.core.oam.dev"
        status: {
            customStatus: #"""
                status: {
                    replicas: *0 | int
                } & {
                    if context.output.status != _|_ {
                        if context.output.status.readyReplicas != _|_ {
                            replicas: context.output.status.readyReplicas
                        }
                    }
                }
                message: "\(context.name): \(status.replicas)/\(context.output.spec.replicas)"
                """#
            healthPolicy: #"""
                status: {
                    replicas: *0 | int
                    generation: *-1 | int
                } & {
                    if context.output.status != _|_ {
                        if context.output.status.readyReplicas != _|_ {
                            replicas: context.output.status.readyReplicas
                        }
                        if context.output.status.observedGeneration != _|_ {
                            generation: context.output.status.observedGeneration
                        }
                    }
                }
                isHealth: (context.output.spec.replicas == status.replicas) && (context.output.metadata.generation == status.generation)
                """#
        }
    }
}

template: {
	parameter: {
        // +usage=The image of the Game Server
        image: string
        // +usage=The number of replicas
        replicas: *1 | int
	}
    output: {
        apiVersion: "game.kruise.io/v1alpha1"
        kind: "GameServerSet"
        spec: {
            updateStrategy: rollingUpdate: podUpdatePolicy: "InPlaceIfPossible"
            gameServerTemplate: spec: containers: [{
                image: parameter.image
                name: "\(context.name)"
            }]
        }
        metadata: name: "\(context.name)"
        spec: replicas: parameter.replicas
    }
}

On top of that, we could have the abstraction of partition applications, which is indicated as game-server-sets as below. The game-server-sets is a template for generating partition application, and exposes the configuration of different types of underlying GameServerSets in the parameter.

"game-server-sets": {
    alias: ""
    annotations: {}
    description: "The Game Server Sets of one region."
    type:        "component"
    attributes: {
        workload: type: "autodetects.core.oam.dev"
        status: {
            customStatus: #"""
                status: {
                    phase: *"initializing" | string
                } & {
                    if context.output.status != _|_ {
                        if context.output.status.status != _|_ {
                            phase: context.output.status.status
                        }
                    }
                }
                message: "\(context.name): \(status.phase)"
                """#
            healthPolicy: #"""
                status: {
                    phase: *"initializing" | string
                    generation: *-1 | int
                } & {
                    if context.output.status != _|_ {
                        if context.output.status.status != _|_ {
                            phase: context.output.status.status
                        }
                        if context.output.status.observedGeneration != _|_ {
                            generation: context.output.status.observedGeneration
                        }
                    }
                }
                isHealth: (status.phase == "running") && (status.generation >= context.output.metadata.generation)
                """#
        }
    }
}

template: {

    #GameServerSet: {
        // +usage=The image of the Game Server
        image: string
        // +usage=The number of replicas
        replicas: *1 | int
        // +usage=The dependencies of the Game Server
        dependsOn: *[] | [...string]
    }

    parameter: [string]: #GameServerSet

    output: {
        apiVersion: "core.oam.dev/v1beta1"
        kind: "Application"
        metadata: name: context.name
        spec: {
            components: [for role, gss in parameter {
                name: "\(context.name)-\(role)"
                type: "game-server-set"
                properties: {
                    image: gss.image
                    replicas: gss.replicas
                }

                _dependsOn: [for d in gss.dependsOn {"\(context.name)-\(d)"}]
                if len(_dependsOn) > 0 {
                    dependsOn: _dependsOn
                }
            }]
            workflow: steps: [{
                type: "deploy"
                name: "deploy"
                properties: policies: []
            }]
        }
    }
}

Finally, we have got the application that manages all the partition application as the user interface, shown as below

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: mmo
  namespace: game
spec:
  components:
    - type: game-server-sets
      name: partition-1
      properties:
        battle:
          image: nginx:1.17
          replicas: 2
        scenes:
          image: nginx:1.20
          replicas: 2
        ai:
          image: nginx:1.21
          replicas: 1
    - type: game-server-sets
      name: partition-2
      dependsOn: ["partition-1"]
      properties:
        battle:
          image: nginx:1.17
          replicas: 3
        scenes:
          image: nginx:1.20
          replicas: 3
        ai:
          image: nginx:1.21
          replicas: 2
    - type: game-server-sets
      name: partition-3
      properties:
        battle:
          image: nginx:1.17
          replicas: 1
        scenes:
          image: nginx:1.20
          replicas: 1
        ai:
          image: nginx:1.21
          replicas: 1
  policies:
    - type: override
      name: global-config
      properties:
        components:
          - properties:
              scenes:
                dependsOn: ["battle"]
    - type: apply-once
      name: apply-once
      properties:
        enable: true
  workflow:
    steps:
      - type: deploy
        name: deploy
        properties:
          policies: ["global-config"]

The dependsOn lines specify the delivery order between different partitions and different types of GameServerSets. In the above example, we requires the update of partition-2 is after partition-1, and the updates of Scenes GameServerSets are after the updates of Battle GameServerSets.

  • To start or stop a partition, we can just add more game-server-sets component to the top layer MMO application.
  • To update the image of a specific type of GameServerSets, we have two ways to do so. One is to directly update the image field in each game-server-sets component's configuration. This allows users to have fine-grained control for the image of each GameServerSet in each partition. The other way is to set the image field in the global-config policy, where users only need to config once and it will take effects across all partitions.
  • To scale the replicas of GameServerSets, the action we need to do is similar to above. We just set the replicas field in the component properties.

Orchestrate

Another action is to set the state of GameServer for the pod, to mark the pod's deletion state, priority and other configurations. This can be achieved with the use of KubeVela WorkflowRun. The reason for using WorkflowRun instead of Application for managing GameServers is because GameServer objects are not directly managed through Applications or GameServerSets. They are post-attached to pods. So using a sideway to manage them can be easier and more simple.

To define the operation behaviour, we use the follow CUE templates. The operate-gs defines the detailed update operation for updating the GameServer object. It first read the GameServer from Kubernetes and re-assemble it with the update fields.

import (
	"vela/op"
)

"operate-gs": {
	type: "workflow-step"
	description: "Operate GameServer."
}
template: {
    #Operation: {
        deletionPriority?: int
        opsState?: "None" | "WaitToBeDeleted"
        updatePriority?: int
    }

    handle: op.#Steps & {
        for gsName, o in parameter {
            "\(gsName)": op.#Steps & {
                read: op.#Read & {
                    value: {
                        apiVersion: "game.kruise.io/v1alpha1"
                        kind:       "GameServer"
                        metadata: {
                            name: gsName
                            namespace: context.namespace
                        }
                    }
                } @step(1)
                apply: op.#Apply & {
                    value: {
                        for k, v in read.value if k != "spec" {
                            "\(k)": v
                        }
                        if read.value.spec != _|_ {
                            spec: {
                                for k, v in read.value.spec {
                                    if k != "deletionPriority" && k != "opsState" && k != "updatePriority" {
                                        "\(k)": v
                                    }
                                }
                                
                                if o.deletionPriority != _|_ {
                                    deletionPriority: o.deletionPriority
                                }
                                if o.deletionPriority == _|_ && read.value.spec.deletionPriority != _|_ {
                                    deletionPriority: read.value.spec.deletionPriority 
                                }

                                if o.opsState != _|_ {
                                    opsState: o.opsState
                                }
                                if o.opsState == _|_ && read.value.spec.opsState != _|_ {
                                    opsState: read.value.spec.opsState 
                                }

                                if o.updatePriority != _|_ {
                                    updatePriority: o.updatePriority
                                }
                                if o.updatePriority == _|_ && read.value.spec.updatePriority != _|_ {
                                    updatePriority: read.value.spec.updatePriority 
                                }
                            }
                        }
                    }
                } @step(2)
            }
        }
    }

	parameter: [string]: #Operation
}

The use of the atomic action is as below

apiVersion: core.oam.dev/v1alpha1
kind: WorkflowRun
metadata:
  name: edit-gs
  namespace: game
spec:
  workflowSpec:
    steps:
      - type: operate-gs
        name: operate-gs
        properties:
          partition-1-scenes-0:
            opsState: WaitToBeDeleted 
          partition-2-battle-1:
            opsState: WaitToBeDeleted
          partition-2-scenes-0:
            deletionPriority: 20

This WorkflowRun is a one-time execution. It sets the opsState to WaitToBeDeleted for the first replica of the Scene GameServerSet in partition 1. Similar behaviors are applied to partition 2.

Extra Resource Relationship

In KubeVela, there is resource topology in KubeVela to be used to display the internal architecture of the application. To help visualize the relationships between GameServerSets and StatefulSets, Pods and GameServers, we could apply the following configuration into the KubeVela system. Then we will be able to visualize the full architecture of the KubeVela application.

apiVersion: v1
kind: ConfigMap
metadata:
  name: game-server-set-relation
  namespace: vela-system
  labels:
    "rules.oam.dev/resource-format": "yaml"
    "rules.oam.dev/resources": "true"
data:
  rules: |-
    - parentResourceType:
        group: game.kruise.io
        kind: GameServerSet
      childrenResourceType:
        - apiVersion: apps.kruise.io/v1beta1
          kind: StatefulSet
        - apiVersion: game.kruise.io/v1alpha1
          kind: GameServer
    - parentResourceType:
        group: apps.kruise.io
        kind: StatefulSet
      childrenResourceType:
        - apiVersion: v1
          kind: Pod

The number of nodeports for the service is insufficient

When using a load balancer (lb) to expose services, the generated service needs to allocate a NodePort, and the default range for NodePort is 2000+. When there are too many container service ports, it can lead to a shortage of NodePort ports.

By setting the 'allocateLoadBalancerNodePorts' parameter in the service, you can control the generated service to not expose NodePort. However, this is only applicable in scenarios where LB traffic passes directly to the pod.

https://kubernetes.io/docs/concepts/services-networking/service/

Feat | Differentiated updates to GameServers

Background

As is known to all, there are certain differences between each server in PvE games, and these differences become more apparent over time. This usually manifests in two scenarios:

  1. The number of players typically fluctuates over time, so the resource allocation of game servers needs to be adjusted accordingly to adapt to these changes and avoid anomalies in service quality or wasted resources. In this case, it is likely that the resource allocation between servers will be different in an ideal situation.

  2. Some games have the concept of gameplay modes, and these gameplay strategies often have variations. Additionally, there are concepts like test servers and experimental servers, where different versions of the same service may exist on different servers. In this case, the ideal situation would be that the Image versions between servers are likely to be different.

Proposal

Based on the above situations, I propose to enhance the targeted management capability of OKG to support different resource configurations and image versions for different GameServers under the same GameServerSet.

To achieve this, I suggest adding a new field called "Containers" to the GameServerSpec.

// GameServerSpec defines the desired state of GameServer
type GameServerSpec struct {
	OpsState         OpsState            `json:"opsState,omitempty"`
	UpdatePriority   *intstr.IntOrString `json:"updatePriority,omitempty"`
	DeletionPriority *intstr.IntOrString `json:"deletionPriority,omitempty"`
	NetworkDisabled  bool                `json:"networkDisabled,omitempty"`

	// Containers can be used to make the corresponding GameServer container fields
	// different from the fields defined by GameServerTemplate in GameServerSetSpec.
	Containers []GameServerContainer `json:"containers,omitempty"`
}

type GameServerContainer struct {
	// Name indicates the name of the container to update.
	Name string `json:"name"`
	// Image indicates the image of the container to update.
	Image string `json:"image,omitempty"`
	// Resources indicates the resources of the container to update.
	Resources corev1.ResourceRequirements `json:"resources,omitempty"`
}
  1. When the image or resources configuration of a GameServer's containers is different from the pod spec, the corresponding fields of the pod will be updated. In case of conflicts, the content declared in the GameServer takes precedence.

  2. Newly created GameServers will follow the default settings specified in the GameServerTemplate within the GameServerSet.

Please refer to the diagram below for an illustration of the proposed effects:

image

Resources update

As we all know, pod.containers[*].resources can't be updated by default.

However, Kubernetes 1.27 add a new feature-gate named InPlacePodVerticalScaling, which can help pod execute vertical scaling in-place(pod won not be recreated, containers could be restart or not), which means that GameServers could resize their resources, without affecting the players on those GameServers.

In order to avoid update failure, we should add a GameServer validating webhook, which only allow to update those containers with resizePolicy.

[Feat] User-defined GameServer’s recycle strategy

API

type GameServerTemplate struct {
	corev1.PodTemplateSpec `json:",inline"`
	VolumeClaimTemplates   []corev1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
         
        // new
	Owner                  GameServerOwner                `json:"owner"`
}

type GameServerOwner string

const (
	OwnerPod           GameServerOwner = "Pod"
	OwnerGameServerSet GameServerOwner = "GameServerSet"
)

Introduction

  • The owner is the Pod - created when the pod is created and deleted when the pod is deleted, consistent with the pod life cycle.

  • The owner is GameServerSet - created before the pod is created and deleted after the pod is actually deleted. Specific examples:

    • When the game server is generated, because the webhook verification fails, even if the pod is not generated, the gs will be generated.
    • Abnormal eviction of the game server, manual deletion of pods, rebuilding and updating, etc. are deletions that occur without changing replicas, and gs will not delete them.

Default Owner is Pod.

Improper k8s permission configuration

Summary

  The ack-kruise-game in ACK gave excessive authority when defining Service Account named "kruise-game-controller-manager". Besides, this Service Account is mounted in a pod named "kruise-game-controller-manager-675bb6974d-4m6d7", witch makes it possible for attackers to raise rights to administrators.
 

Detailed Analysis

  • The clusterrole named "kruise-game-manager-role" defines the "create" verb of "pods, statefulsets, mutatingwebhookconfiguration". And this clusterrole is bound to the Service Account named "kruise-game-controller-manager".

# Attacking Strategy
  If a malicious user controls a specific worker node which has the Pod mentioned above , or steals the Service Account token mentioned above. He/She can raise permissions to administrator level and control the whole cluster.
For example,

  • With the "create" verb of "pods, statefulsets", attacker can elevate privileges by creating a pod to mount and steal any Service Account he/she want.
  • With the "update" verb of "mutatingconfiguration, validatingconfiguration", attacker can elevate privileges by updating MutatingWebhookConfigurations to listen and modify any resource and event in the cluster.

Mitigation Discussion

  • Developer could use the rolebinding instead of the clusterrolebinding to restrict permissions to namespace.
  • Developer could delete the "create" verb of "pods, statefulsets, mutatingwebhookconfiguration" in clusterrole mentioned above.

# A few questions

  • Is it a real issue in ack-kruise-game?
  • If it's a real issue, can ack-kruise-game mitigate the risks following my suggestions discussed in the "mitigation discussion"?
  • If it's a real issue, does ack-kruise-game plan to fix this issue?

Feat | Add AlibabaCloud-NLB network plugin

Plugin name

AlibabaCloud-NLB

Cloud Provider

AlibabaCloud

Plugin description

  • AlibabaCloud-NLB enables game servers to be accessed from the Internet by using Layer 4 Network Load Balancer (NLB) of Alibaba Cloud. AlibabaCloud-NLB uses different ports of the same NLB instance to forward Internet traffic to different game servers. The NLB instance only forwards traffic, but does not implement load balancing.

  • This network plugin supports network isolation.

Network parameters

NlbIds

  • Meaning: the NLB instance ID. You can fill in multiple ids.
  • Value: in the format of nlbId-0,nlbId-1,... An example value can be "nlb-ji8l844c0qzii1x6mc,nlb-26jbknebrjlejt5abu"
  • Configuration change supported or not: yes. You can add new nlbIds at the end. However, it is recommended not to change existing nlbId that is in use.

PortProtocols

  • Meaning: the ports in the pod to be exposed and the protocols. You can specify multiple ports and protocols.
  • Value: in the format of port1/protocol1,port2/protocol2,... The protocol names must be in uppercase letters.
  • Configuration change supported or not: yes.

Fixed

  • Meaning: whether the mapping relationship is fixed. If the mapping relationship is fixed, the mapping relationship remains unchanged even if the pod is deleted and recreated.
  • Value: false or true.
  • Configuration change supported or not: yes.

AllowNotReadyContainers

  • Meaning: the container names that are allowed not ready when inplace updating, when traffic will not be cut.
  • Value: {containerName_0},{containerName_1},... Example:sidecar
  • Configuration change supported or not: It cannot be changed during the in-place updating process.

Plugin configuration

[alibabacloud]
enable = true
[alibabacloud.nlb]
# Specify the range of available ports of the NLB instance. Ports in this range can be used to forward Internet traffic to pods. In this example, the range includes 500 ports.
max_port = 1500
min_port = 1000

插件名称/类型

AlibabaCloud-NLB

Cloud Provider

AlibabaCloud

插件说明

  • AlibabaCloud-NLB 使用阿里云网络型四层负载均衡(NLB)作为对外服务的承载实体,在此模式下,不同游戏服将使用同一NLB的不同端口,此时NLB只做转发,并未均衡流量。
  • 该网络类型支持网络隔离

网络参数

NlbIds

  • 含义:填写nlb的id。可填写多个。
  • 填写格式:各个nlbId用,分割。例如:nlb-ji8l844c0qzii1x6mc,nlb-26jbknebrjlejt5abu,...
  • 是否支持变更:支持。可追加填写NLB实例id。建议不要更换正在被使用的实例id。

PortProtocols

  • 含义:pod暴露的端口及协议,支持填写多个端口/协议
  • 格式:port1/protocol1,port2/protocol2,...(协议需大写)
  • 是否支持变更:支持

Fixed

  • 含义:是否固定访问IP/端口。若是,即使pod删除重建,网络内外映射关系不会改变
  • 格式:false / true
  • 是否支持变更:支持

AllowNotReadyContainers

  • 含义:在容器原地升级时允许不断流的对应容器名称,可填写多个
  • 格式:{containerName_0},{containerName_1},... 例如:sidecar
  • 是否支持变更:在原地升级过程中不可变更。

插件配置

[alibabacloud]
enable = true
[alibabacloud.nlb]
#填写nlb可使用的空闲端口段,用于为pod分配外部接入端口,范围为500
max_port = 1500
min_port = 1000

The schematic diagram

image

Feat | add network type AlibabaCloud-EIP

Background

There're some scenarios:

  • A single game server needs to expose multiple ports in the form of a port segment, and it is difficult to maintain the access mapping relationship.
  • Multiple game servers using the same EIP are vulnerable to DDOS attacks, and the explosion radius is relatively large.

Each game server has an independent EIP, which is the best solution to solve the above problems.

Design

Plugin name

AlibabaCloud-EIP

Cloud Provider

AlibabaCloud

Plugin description

  • Allocate a separate EIP for each GameServer
  • The exposed public access port is consistent with the port monitored in the container
  • It is necessary to install the latest version of the ack-extend-network-controller component in the ACK cluster. For details, please refer to the component description page.

Network parameters

ReleaseStrategy

  • Meaning: Specifies the EIP release policy.
  • Value:
    • Follow: follows the lifecycle of the pod that is associated with the EIP. This is the default value.
    • Never: does not release the EIP. You need to manually release the EIP when you no longer need the EIP.
    • You can also specify the timeout period of the EIP. For example, if you set the time period to 5m30s, the EIP is released 5.5 minutes after the pod is deleted. Time expressions written in Go are supported.
  • Configuration change supported or not: no.

PoolId

  • Meaning: Specifies the EIP address pool. For more information. It could be nil.
  • Configuration change supported or not: no.

ResourceGroupId

  • Meaning: Specifies the resource group to which the EIP belongs. It could be nil.
  • Configuration change supported or not: no.

Bandwidth

  • Meaning: Specifies the maximum bandwidth of the EIP. Unit: Mbit/s. It could be nil. Default is 5.
  • Configuration change supported or not: no.

BandwidthPackageId

  • Meaning: Specifies the EIP bandwidth plan that you want to use.
  • Configuration change supported or not: no.

ChargeType

  • Meaning: Specifies the metering method of the EIP.
  • Value:
    • PayByTraffic: Fees are charged based on data transfer.
    • PayByBandwidth: Fees are charged based on bandwidth usage.
  • Configuration change supported or not: no.

Plugin configuration

None


背景

游戏服存在如下场景:

  • 单个游戏服需要以端口段的方式暴露多个端口,访问映射关系难以维护。
  • 多个游戏服使用同个EIP容易遭受DDOS攻击,爆炸半径较大。

每个游戏服具备独立的EIP,是解决上述问题的最佳方案。

设计

插件名称

AlibabaCloud-EIP

Cloud Provider

AlibabaCloud

插件说明

  • 为每个GameServer单独分配EIP
  • 暴露的公网访问端口与容器中监听的端口一致
  • 需要在ACK集群安装最新版本ack-extend-network-controller组件,详情请见组件说明页

网络参数

ReleaseStrategy

  • 含义:EIP回收策略。
  • 填写格式:
    • Follow:默认值,跟随游戏服生命周期。当游戏服被删除时,EIP也将被回收。
    • Never:不删除podEIP。当不需要时需要手动删除这个podEIP。
    • 可直接配置过期时间,例如:5m30s,表示Pod删除5.5分钟后删除podEIP。支持Go类型时间表达式。
  • 是否支持变更:否

PoolId

  • 含义:EIP地址池ID。可为空,则不使用EIP地址池。
  • 是否支持变更:否

ResourceGroupId

  • 含义:EIP资源组ID。可为空,则使用默认资源组。
  • 是否支持变更:否

Bandwidth

  • 含义:峰值带宽。单位:Mbps。可为空,默认为5
  • 是否支持变更:否

BandwidthPackageId

  • 含义:要绑定已有的共享带宽包ID。可为空,则EIP不绑定共享带宽包。
  • 是否支持变更:否

ChargeType

  • 含义:EIP的计费方式。
  • 填写格式:
    • PayByTraffic:按使用流量计费。
    • PayByBandwidth:按带宽计费,为默认值。
  • 是否支持变更:否

插件配置

OKG failing to correctly add network annotation to Pod when using readinessProbe

Background:

In my GameServer, I have configured two game processes. Process 1 relies on OKG network annotation and will only start after reading this annotation. Additionally, I have set up a readinessProbe to monitor whether this process's GRPC listen is ready. Process 2 depends on process 1, and will only start after process 1 is ready. This setup utilizes OKG's startup sequence control.

Problem:

In the given background, there is an occasional issue where Pods fail to retrieve network annotation, causing them to remain in a pending state indefinitely.
During my actual usage, when the GameServerSet replicas are set to 4, I encountered a situation where one Pod remains in the pending state while the others start up normally.

This is my GameServerSet yaml:

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: gss
  labels:
    gs-group: test
spec:
  replicas: 4
  updateStrategy:
    rollingUpdate:
      podUpdatePolicy: ReCreate
  network:
    networkType: Kubernetes-HostPort
    networkConf:
    - name: ContainerPorts 
      value: "process1:5000/TCP"
  gameServerTemplate:
    metadata:
      labels:
        gs-group: test
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: "kubernetes.io/hostname"
              labelSelector:
                matchLabels:
                  gs-group: test
      imagePullSecrets:
        - name: qcloudregistrykey
      containers:
        - image: IMAGE_1
          imagePullPolicy: IfNotPresent
          name: process1
          env:
          - name: KRUISE_CONTAINER_PRIORITY
            value: "2"
          readinessProbe:
            tcpSocket:
              port: 6000
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          volumeMounts:
            - name: network
              mountPath: /opt/network
        - image: IMAGE_2
          imagePullPolicy: IfNotPresent
          name: process2
          env:
          - name: KRUISE_CONTAINER_PRIORITY
            value: "1"
      - name: network
        downwardAPI:
          items:
          - path: "annotations"
            fieldRef:
              fieldPath: metadata.annotations['game.kruise.io/network-status']

Feat | add cloud provider & network plugin

Background

Multi-cloud hybrid cloud expands the boundary of cloud and has become one of the trends of cloud-native development. In order to reduce the user's access cost to cloud infrastructure (such as network), we designed the Cloud Provider module, which aims to integrate with cloud infrastructure The coupling part is pluggably integrated into OpenKruiseGame (hereinafter referred to as OKG). At the beginning of the design of OKG, we believed that the game server network is an important function that cannot be ignored for game operation and maintenance, so we first supported the network plugin of Cloud Provider. Users can specify the network type and configuration information by defining the network field of the workload GameServerSet, and get the corresponding network information in the GameServer object.

Architecture

Cloud Provider is integrated in the webhook of kruise-game-manager in an in-tree manner, and its architecture diagram is as follows:

image

  • provider manager: The manager of cloud provider, which can manage multiple different cloud provider objects
  • cloud provider: OKG currently plans to support native Kubernetes and Alibaba Cloud. Each cloud provider can manage several different network plugins.
  • network plugin: OKG currently plans to support Kubernetes' HostPort plugin and Alibaba Cloud's NatGw plugin. Developers can customize cloud providers and network plugins that meet their own needs through interface specifications.

Use

Create network

  1. Specify the network type and corresponding network configuration in GameServerSet, and send it when GameServerSet is created
...

spec:
  network:
    networkType: HostPortNetwork #This example uses the HostPort network
    networkConf: 
    #The network configuration is imported in the form of K-V, and each plugin specifies the incoming format
    #The format of HostPort is as follows, {containerName}:{port1}/{protocol1},{port2}/{protocol2},...
    - name: ContainerPorts 
      value: game-server:25565/TCP 

...
  1. After the creation is complete, you can view the corresponding network status in GameServer
...

status:
  networkStatus: 
    createTime: "2022-11-10T14:26:43Z"
    #The current state, as determined by the network plugin
    currentNetworkState: Ready 
    #The desired state, Ready/NotReady, is related to the GameServer.Spec.NetworkDisabled field. 
    #When NetworkDisabled is true, the expected state is NotReady; 
    #When NetworkDisabled is false, it is Ready, and the default is Ready when created.
    desiredNetworkState: Ready 
    externalAddresses:
    - ip: 38.111.149.177
      ports:
      - name: game-server-25565
        port: 8870
        protocol: TCP
    internalAddresses:
    - ip: 192.168.0.88
      ports:
      - name: game-server-25565
        port: 25565
        protocol: TCP
    lastTransitionTime: "2022-11-10T14:26:43Z"
    networkType: HostPortNetwork

...

Network Isolation

OKG supports users to set the network to be temporarily unavailable to achieve the effect of interrupting network traffic

  1. Set the NetworkDisabled field of GameServer
...

spec:
  networkDisabled: true

...
  1. Check the NetworkStatus of the GameServer
...

status:
  networkStatus: 
    createTime: "2022-11-10T14:26:43Z"
    currentNetworkState: NotReady #The network plugin senses the disabled operation, and returns to the current state after 
    completing the network isolation
    desiredNetworkState: NotReady 
    externalAddresses:
    - ip: 38.111.149.177
      ports:
      - name: game-server-25565
        port: 8870
        protocol: TCP
    internalAddresses:
    - ip: 192.168.0.88
      ports:
      - name: game-server-25565
        port: 25565
        protocol: TCP
    lastTransitionTime: "2022-11-10T14:29:01Z"
    networkType: HostPortNetwork

...

Principle

Create network

image

As shown in the figure above, when the user defines the network field when creating the GameServerSet:

(1) gameserverset-controler initiates a create pod request by creating an advanced statefulset, and the webhook intercepts and calls the OnPodCreate() function to create network. After the create request passes, the pod is successfully created

(2) The gameserver-controller perceives that the pod has a corresponding network status annotation, and writes it back to the GameServer Status

(3) When the GameServer is Ready, if the gameserver-controller finds that the desiredNetworkState is consistent with the currentNetworkState, the reconcile is terminated; when the two fields are inconsistent, the reconcile is retriggered, an update pod request is initiated, and the webhook intercepts the request and calls the OnPodUpdate() function. The network plugin returns the current up-to-date network status. The tuning interval is 5s, and the total waiting time is one minute. If the status is still inconsistent after more than one minute, the req will be abandoned.

Let's take a look at the changes in the fields of each object during the process:

image

(1) First, the user specifies the type and config of the GameServerSet.

(2) Specify the corresponding encoded annotation in the spec.template of the newly created Advanced StatefulSet.

(3) Through the pod mutating webhook, the pod adds the encoded status annotation. The network status fields that need to be returned by the plugin include externalAddress, internalAddress, and currentNetworkState.

(4) GameServer generates status.netwrokStatus for the status annotation of the Pod, in which externalAddress, internalAddress, and currentNetworkState are inherited from the annotations of pod, and others are generated or modified during the controller reconcile process.

(5) When the currentNetworkState is inconsistent with the desiredNetworkState, modify the Pod network-trigger-time annotation to the current time, trigger the update request(trigger interval defaults to 5 seconds), until the timeout (1 minute by default) or reaching the same state.

Network Isolation

image

As shown in the figure above, the networkDisabled field of the GameServer is specified when the user want to isolate/unisolate the network

(1) The gameserver-controller synchronizes the networkDisabled field to the annotation corresponding to the pod, triggers the update request intercepted by the webhook, and calls the OnPodUpdate() function. The network plugin will perform network isolation/unisolation according to the pod networkDisabled annotation, and change the currentNetworkState. The updated Pod will be with the latest status annotation, which includes whether the current network status is Ready.

(2) The gameserver-controller perceives the status change of the pod and synchronizes the networkStatus to the GameServer

(3) Similar to when creating process, decide whether to continue to initiate reconcile according to the comparison of network status consistency

Changes in the fields of each object during the process:

image

The meaning of the fields is similar to that of creating network, so I won’t repeat it in words.

Delete network

The process of deleting the network is very simple, as shown in the figure below, after the pod deletion request is intercepted by the webhook, the network plug-in calls the OnPodDelete() function to delete the network resources

image

Development Guide

OKG supports developers to customize cloud providers and network plugins according to their own needs. First of all, let's take a look at the call relationship diagram of each module in the webhook, so as to understand the meaning of each interface of the network plugin:

image

(1) When the webhook registers the mutating pod Handler, it initializes the provider manager and the corresponding cloud providers. At this point, the network plugins will register itself in the map of the corresponding cloud provider, waiting to be accessed.

(2) When the pod mutating req is triggered, the handler will first extract the network plugin name corresponding to the pod, and find the corresponding network plugin object according to the network plugin name. Then call the corresponding function according to the action of req:

  • The create request corresponds to calling OnPodCreate()
  • The update request corresponds to calling OnPodUpdate()
  • The delete request corresponds to calling OnPodDelete()

network plugin development

The network plugin needs to implement the following func

type Plugin interface {
	Name() string
	Alias() string
	Init(client client.Client) error
	OnPodAdded(client client.Client, pod *corev1.Pod) (*corev1.Pod, error)
	OnPodUpdated(client client.Client, pod *corev1.Pod) (*corev1.Pod, error)
	OnPodDeleted(client client.Client, pod *corev1.Pod) error
}

The significance of OnPodCreate(), OnPodUpdate(), OnPodDelete() will not be described in detail.

  • Name() returns the name of the network plugin, which is also the corresponding name when the user specifies the network type of GameServerSet
  • Alias() returns the network plugin alias. In order to support the scenario where multiple clouds and hybrid clouds use the same workload, different cloud providers have different implementations for a certain plugin. In this case, the network type needs to be unified. So for a common network mode, you need to specify a common name 【TODO】
  • Init() realizes the initialization of the network plugin. Some network plugins need to initialize their own cache according to the status of the cluster resource object. Init() helps to realize this function. It is worth noting the developer needs to judge whether the initialization has been performed inside the function to avoid repeated initialization. For network plugins that do not need to be initialized, return nil directly
  • OnPodCreate(), OnPodUpdate(), and OnPodDelete() methods are called according to pod mutating requests, so update/delete may be called multiple times, and developers need to pay attention to the data consistency of the cache.

cloud provider development

If OKG does not currently support the cloud provider you expect, you can access it by implementing the following methods

type CloudProvider interface {
	Name() string
	ListPlugins() (map[string]Plugin, error)
}
  • Name() returns the name of the cloud provider
  • ListPlugins() lists all corresponding plugins
  • Call func RegisterCloudProvider() in webhook's func NewProviderManager() to register the corresponding cloud provider

期望提供一个统一的视角来管理gs的差异化资源规格

长线运营之后,每组服务器需要的资源规格是差异比价大的,这就需要去为每个gs设置不同的资源规格,如:
https://openkruise.io/zh/kruisegame/best-practices/pve-game#%E5%AE%9A%E5%90%91%E6%9B%B4%E6%96%B0%E6%B8%B8%E6%88%8F%E6%9C%8D%E9%95%9C%E5%83%8F%E4%B8%8E%E8%B5%84%E6%BA%90%E8%A7%84%E6%A0%BC
中描述的方法。
但是通过edit yaml的方式去管理,当gs的数量比较多的时候就很不直观了,所以期望提供一个统一的视角来管理gs的差异化资源规格。
可能的实现方法:

  1. 比如在kubesphere dashboard的主界面上增加编辑和展示功能。最好能支持批量编辑,或者通过分组,每组配置不同的资源规格,然后gs可以配置所属的分组。
  2. 采用gitops流程,把差异化的资源定义代码化。

Whether serviceQuality can provide customized restart policies, such as Always, Onfailure, Never, to solve the problem of data inconsistency caused by some program crashes;

Whether serviceQualities can provide customized restart policies, such as Always, Onfailure, Never. You can choose to restart the whole pod when some programs crash to solve the problem of data inconsistency caused by some program crashes;

serviceQualities 是否可以提供自定义restartpolicry,如Always、Onfailure、Never,可以选择部分程序崩溃整个pod进行重启,解决部分程序崩溃造成数据不一致的问题;

Feat | Added a protection mechanism when scaling down

Background

The game server is stateful, and we need a protection mechanism to prevent the players who are playing the game from being affected when GameServerSet reducing the number of replicas.

API Define

Introduce the scale down strategy in GameServerSet, and add a new type of scale down strategy called protected. When scale down type is protected, the user can select protected objects to prevent from being deleted.

There will be two types of protection policies here, one is the threshold type, GameServers whose priority is less than the threshold value will be protected, and the other is the specified type, which can select GameServers with specified ids or matching labels to be protected.

type ScaleStrategy struct {
	ScaleDown     ScaleDownStrategy `json:"scaleDown,omitempty"`
...
}

type ScaleDownStrategy struct {
	Type              ScaleDownStrategyType `json:"type,omitempty"`
	ProtectedStrategy ProtectedStrategy     `json:"protectedPolicy,omitempty"`
}

type ScaleDownStrategyType string

const (
	ProtectedScaleDownStrategyType ScaleDownStrategyType = "Protected"
)

type ProtectedStrategy struct {
	Type      ProtectedStrategyType      `json:"type,omitempty"`
	ThresholdStrategy ThresholdProtectedStrategy `json:"thresholdStrategy,omitempty"`
	SpecifiedStrategy SpecifiedProtectedStrategy `json:"specifiedStrategy,omitempty"`
}

type ProtectedStrategyType string

const (
	SpecifiedProtectedStrategyType ProtectedStrategyType = "Specified"
	ThresholdProtectedStrategyType ProtectedStrategyType = "Threshold"
)

type ThresholdProtectedStrategy struct {
	OpsState         OpsState            `json:"opsState,omitempty"`
	DeletionPriority *intstr.IntOrString `json:"deletionPriority,omitempty"`
}

type SpecifiedProtectedStrategy struct {
	GameServerIds  []int `json:"gameServerIds,omitempty"`
	LabelSelector metav1.LabelSelector `json:"labelSelector,omitempty"`
}

example 1:

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
spec:
  scaleStrategy:
      scaleDown:
          type: protected
          protectedPolicy:
              type: threshold
              thresholdStrategy:
                  opsState: None
                  deletionPriority: 30

In this example, GameServers whose opsState is None and Maintaining, and GameServers whose deletionPriority is less than or equal to 30 will be protected.

example 2:

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
spec:
  scaleStrategy:
      scaleDown:
          type: protected
          protectedPolicy:
              type: specified
              specifiedStrategy:
                  gameServerIds:
                    - 2
                    - 3
                  labelSelector:
                      matchLabels:
                          GameServerLabel: "xxx"

In this example, GameServers with serial numbers 2 and 3, and GameServers with GameServerLabel: "xxx" key-value pairs will be protected.

Also note that the replicas of the GameServerSet may be different from the actual number of GameServers when scaleDown Strategy type is protected.

Multiple ops state probe may conflict with each other

e.g

serviceQualities: # 设置了一个idle的服务质量
    - name: healthy
      containerName: minecraft
      permanent: false
      exec:
        command: ["bash", "./healthy.sh"]
      serviceQualityAction:
        - state: false
          opsState: Maintaining
        - state: true
          opsState: None
    - name: idle
      containerName: minecraft
      initialDelaySeconds: 10
      permanent: false
      exec:
        command: [ "bash", "./idle.sh" ]
      serviceQualityAction:
        - state: true
          opsState: WaitToBeDeleted
        - state: false
          opsState: None

Currently,
idle and healthy may conflict.

If the healthy probe return Maintaining and idle next round idle return None but healthy still not healthy.

A better way could be make the ops state calls atomic, which means user only need one api and wrap those logic in the only api(sh or http etc)

Enhancement | Cloud Provider plugin add new function to determine whether allow to create pod when errors occurred

Background

As #15 mentioned, OKG already supported cloud providers & plugins mechanism by Kubernetes webhook. Plus, in order to increase network availability, OKG also supports the function of network asynchronous ready, running the plugin within a limited time to repeatedly establish & confirm the network until the network is ready.

However, the function of network asynchronous ready requires that the webhook can still allow the operation of creating pod when an error occurs, which actually is conflict with synchronous plugins, like Kubernetes-HostPort plugin, because synchronous plugins require pod & network ready at same time.

Proposal

Plugin interface add a new function IsSynchronous to determine whether allow to create pod when errors occurred.

The new Plugin interface would be:

type Plugin interface {
	Name() string
	// Alias define the plugin with similar func cross multi cloud provider
	Alias() string
	Init(client client.Client, options CloudProviderOptions, ctx context.Context) error
	// Pod Event handler
	OnPodAdded(client client.Client, pod *corev1.Pod, ctx context.Context) (*corev1.Pod, errors.PluginError)
	OnPodUpdated(client client.Client, pod *corev1.Pod, ctx context.Context) (*corev1.Pod, errors.PluginError)
	OnPodDeleted(client client.Client, pod *corev1.Pod, ctx context.Context) errors.PluginError

	// IsSynchronous determines whether allow to create pod when errors occurred.
	// If set to false, the webhook allows creating pods despite errors. If set to true, the webhook denies creating pods when errors occur.
	IsSynchronous() bool
}

kubesphere v3.4.1如果在 webui 上部署文档中的 minecraft 测试镜像

按照教程,我按照好了 kruise 和 kruise-game
image

但是,当我按照:https://openkruise.io/zh/kruisegame/installation,这个文档来部署游戏服服务时。
我使用下面的yaml:

apiVersion: v1
kind: GameServerSet
metadata:
  name: minecraft
  namespace: kruise-game-system
  labels:
    app: minecraft
spec:
  replicas: 3
  updateStrategy:
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
  gameServerTemplate:
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/acs/minecraft-demo:1.12.2
          name: minecraft

显示创建成功,但是没有创建:
image

我想请教下我的配置有什么问题么?

附:

  1. kruise-webhook-service yaml:
kind: Service
apiVersion: v1
metadata:
  name: kruise-webhook-service
  namespace: kruise-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubesphere.io/instance: kruise-cn7tvf
  annotations:
    meta.helm.sh/release-name: kruise-cn7tvf
    meta.helm.sh/release-namespace: okg-learn
spec:
  ports:
    - protocol: TCP
      port: 443
      targetPort: 9876
  selector:
    control-plane: controller-manager
  clusterIP: 10.233.28.39
  clusterIPs:
    - 10.233.28.39
  type: ClusterIP
  sessionAffinity: None
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  internalTrafficPolicy: Cluster
  1. kruise-game-controller-manager-metrics-service yaml:
kind: Service
apiVersion: v1
metadata:
  name: kruise-game-controller-manager-metrics-service
  namespace: kruise-game-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubesphere.io/instance: kruise-game-b68lck
    control-plane: kruise-game-controller-manager
  annotations:
    meta.helm.sh/release-name: kruise-game-b68lck
    meta.helm.sh/release-namespace: okg-learn
spec:
  ports:
    - name: https
      protocol: TCP
      port: 8443
      targetPort: https
  selector:
    control-plane: kruise-game-controller-manager
  clusterIP: 10.233.48.173
  clusterIPs:
    - 10.233.48.173
  type: ClusterIP
  sessionAffinity: None
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  internalTrafficPolicy: Cluster
  1. kruise-game-external-scaler yaml:
kind: Service
apiVersion: v1
metadata:
  name: kruise-game-external-scaler
  namespace: kruise-game-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubesphere.io/instance: kruise-game-b68lck
  annotations:
    meta.helm.sh/release-name: kruise-game-b68lck
    meta.helm.sh/release-namespace: okg-learn
spec:
  ports:
    - protocol: TCP
      port: 6000
      targetPort: 6000
  selector:
    control-plane: kruise-game-controller-manager
  clusterIP: 10.233.60.166
  clusterIPs:
    - 10.233.60.166
  type: ClusterIP
  sessionAffinity: None
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  internalTrafficPolicy: Cluster
  1. kruise-game-webhook-service yaml:
kind: Service
apiVersion: v1
metadata:
  name: kruise-game-webhook-service
  namespace: kruise-game-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubesphere.io/instance: kruise-game-b68lck
  annotations:
    meta.helm.sh/release-name: kruise-game-b68lck
    meta.helm.sh/release-namespace: okg-learn
spec:
  ports:
    - protocol: TCP
      port: 443
      targetPort: 9876
  selector:
    control-plane: kruise-game-controller-manager
  clusterIP: 10.233.22.252
  clusterIPs:
    - 10.233.22.252
  type: ClusterIP
  sessionAffinity: None
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  internalTrafficPolicy: Cluster

Network annotation may not be properly configured during Pod deletion and rebuild

This issue outlines that the annotation tied to the network may not be updated correctly during the deletion and rebuilding phases of a Pod.

Here's the sequence of events:

  1. Three Pods were established within the cluster using OKG's GSS and are currently performing as per normal. The network annotations for all these Pods have been correctly configured.
  2. Subsequently, I modified the GSS-associated image, which triggered an automatic rebuild of the three Pods (this 'ReCreate' action was aligned with the Pod update strategy).
  3. After the rebuild was completed, there were occasional instances when the network annotation for one or more Pods was not set correctly.

The current resource situation can be summarized as follows:

  1. A hostport allocation is observable from the Pod's yaml file.
  2. However, pertaining to the ‘GameServer’ resource (an OKG resource that should correspond on a one-to-one basis with the Pod), no rebuild was initiated after the new Pod was rebuilt. The 'age' parameter retains its pre-rebuild value, casting doubts on the correctness of this process.

Qps和Busrt设置

What happened:
当前服务与ApiServer的QPS和Burst使用默认配置(QPS=20, Burst=30); 在一些业务场景下,默认QPS和Burst可能不足以支撑,需要调整QPS和Burst。
What you expected to happen:
希望服务支持用户自定义QPS和Burst。例如在部署服务的command下设置
args:
- --api-server-qps=5
- --api-server-qps-burst=10

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.