Coder Social home page Coder Social logo

Comments (6)

patriksabol avatar patriksabol commented on September 26, 2024

I have observed that the issue is in a pod being prematurely added to the service endpoints while it is still in the initialization phase, specifically when the init container starts. This premature addition leads to client errors as the pod is not fully ready to handle requests.

Logs:

2024-05-29 11:29:00 vertex-triton-server-6d64b9586d-pjfl9   0/1     Pending           0          2m17s   <none>     gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:01 vertex-triton-server-6d64b9586d-pjfl9   0/1     Init:0/1          0          2m18s   <none>     gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:03 vertex-triton-server-6d64b9586d-pjfl9   0/1     Init:0/1          0          2m20s   10.4.4.4   gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>
2024-05-29 11:29:10 vertex-triton-server-6d64b9586d-pjfl9   0/1     PodInitializing   0          2m27s   10.4.4.4   gke-vertex-serving-cluster-gpupool-92732217-22df   <none>           <none>

During scaling up, when a new pod is in the Init:0/1 state, it is already being assigned an IP (10.4.4.4). This results in client errors as mentioned above:

tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed

However, I am able to get the READY status correctly from the pods. So, the issue is probably not related to the readiness probe.

from server.

whoisj avatar whoisj commented on September 26, 2024

@patriksabol very interesting problem. Pods should not be selected by a service until they're running, have passed their startup and readiness probes.

In your second post it appears that none of the pods are past the init container stage, yet you're seeing their readiness probe succeeding? Is that correct?

Given the specific error of tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed, I'd like to see the definition of your service as well. There could be something in that which is leading to the problem.

In the mean time, I'll review Triton code to see we've somehow introduced any timing issues w.r.t. readiness/liveness probes.

from server.

patriksabol avatar patriksabol commented on September 26, 2024

@whoisj In my second post, there is only one pod at four different times. I wanted to show that an IP address was assigned during the init state.

Meanwhile, I have removed the initContainer, and now the IP address is assigned to a running pod:

2024-05-29 15:13:45 vertex-triton-server-74c9fcf77f-q9gmf   0/1     ContainerCreating   0          4m45s   <none>      gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:07 vertex-triton-server-74c9fcf77f-q9gmf   0/1     Running             0          6m7s    10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf   0/1     Running             0          6m46s   10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>
2024-05-29 15:15:46 vertex-triton-server-74c9fcf77f-q9gmf   1/1     Running             0          6m46s   10.96.5.4   gke-vertex-serving-cluster-gpupool-a0253cf2-v5j9   <none>           <none>

But I am seeing READY status correctly for pods, meaning when READY is 1/1, using this command:

kubectl get pods -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")]}'

However, the issue with the "Socket closed" error still persists. When all pods are in the READY state, there are no "Socket closed" errors.

This is my client code:

with grpcclient.InferenceServerClient("IP_ADDRESS:8001") as client:  # Note the change to grpcclient and typically a different port
    input0 = grpcclient.InferInput("tile", [1, 3, image.shape[0], image.shape[1]], "UINT8")
    input0.set_data_from_numpy(np.expand_dims(np.moveaxis(image, -1, 0), axis=0))

    # Prepare the model_name input as an array of bytes
    model_name_bytes = np.array([model_name.encode('utf-8')])
    model_name_bytes = np.expand_dims(model_name_bytes, axis=0)
    try:
        input1 = grpcclient.InferInput("model_name", [1, 1], "BYTES")
        input1.set_data_from_numpy(model_name_bytes)

        outputs = [
            grpcclient.InferRequestedOutput("geojson_output")
        ]

        response = client.infer('cartographer_model', [input0, input1], outputs=outputs, model_version="1")

        geojson_result = response.as_numpy("geojson_output").tobytes().decode('utf-8')
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [INFO] Received GeoJSON response')
    except Exception as e:
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] {e}')
        print(f'{strftime("%Y-%m-%d %H:%M:%S")} [ERROR] Failed to receive GeoJSON response')

This is service definition:

apiVersion: v1
kind: Service
metadata:
  name: vertex-triton-server-service
  labels:
    app: vertex-triton-server
spec:
  type: LoadBalancer
  ports:
    - port: 8000
      targetPort: 8000
      name: http
    - port: 8001
      targetPort: 8001
      name: grpc
    - port: 8002
      targetPort: 8002
      name: metrics
  selector:
    app: vertex-triton-server

from server.

whoisj avatar whoisj commented on September 26, 2024

in your service definition, I believe targetPort should be the name of the port in the target container.

apiVersion: v1
kind: Service
metadata:
  name: vertex-triton-server-service
  labels:
    app: vertex-triton-server
spec:
  type: LoadBalancer
  ports:
    - port: 8000
      targetPort: http-triton
      name: http
    - port: 8001
      targetPort: grpc-triton
      name: grpc
    - port: 8002
      targetPort: metrics-triton
      name: metrics
  selector:
    app: vertex-triton-server

By specifying the numeric port number, you could somehow be bypassing the service's selector. I am not 100% sure, but I think it's worth trying the port names instead to see if it resolves the issue or not. Let me know.

from server.

patriksabol avatar patriksabol commented on September 26, 2024

Unfortunately, that does not work. It seems that passing values by name is just for clarity of configuration.

from server.

whoisj avatar whoisj commented on September 26, 2024

As I mentioned, given the error, it appears to be a problem with the service and not with Triton Server.

Perhaps could check Triton Server logs to see if any inference requests are even being sent to the pods in question.

from server.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.