I am deploying a very large application. I get a timeout on provision, any tips? <

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Timeout on provision about service-fabric-cli HOT 8 OPEN

jfalameda commented on July 21, 2024

Timeout on provision

from service-fabric-cli.

Comments (8)

Christina-Kang commented on July 21, 2024

Hi @jfalameda, sorry to hear that you're running into this issue. To help us debug, can you share the output from running this command with the --debug option? Could you also share the cluster traces and the approximate time at which you ran the provision command? The traces would be in /etc/servicefabric/FabricDataRoot. Thanks!

from service-fabric-cli.

jfalameda commented on July 21, 2024

Hi @Christina-Kang,

It seems that the provision is taking few minutes and the timeout for the call to provision is 100.

You can download the tracing from here:

https://drive.google.com/file/d/17cAPVzd6v3v7PULhgxyN2_h4SSGRIMmL/view?usp=sharing

Kind Regards,
/ José.

from service-fabric-cli.

Christina-Kang commented on July 21, 2024

Hi @jfalameda,

Thanks for your patience! It looks like the actual timeout duration is not the issue here, but rather, an issue with permissions, which holds up the operation until it eventually times out.

Issue is with certain folders not being ACLed correctly. From your traces, it looks like the source is this:
/home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store and the destination is this: /home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/ImageBuilderProxy/

There are also paths under different "nodes", represented in this path by "N0010" instead of the "N0030" in the previous path: /home/ClusterDeployer/ClusterData/Data/N0010/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store. The same is for the destination path.

Can you check all these paths (N0010, N0020, N0030) to make sure that users have the correct permissions, and if not, chmod everything under those paths to give user permissions to rwx?

You should be able to retry upload after this is done. You should not need to, but if the above doesn't work, try restarting fabric and seeing if that plus the permission changes allows provision to continue. Please let me know if this mitigates the issue for you.

This issue happens rarely, so if you were to have a cluster in azure, or on a different dev machine, this issue most likely won't reoccur.

Hope this helps!
Christina

from service-fabric-cli.

jfalameda commented on July 21, 2024

Hi @Christina-Kang ,

I can see the application provisioning now. I still get the timeout error as the application takes a lot of time to provision.

It is weird that I need to fix permission, as I am using the docker based service fabric.

Thanks.

Kind Regards,
/ José.

from service-fabric-cli.

Christina-Kang commented on July 21, 2024

Hi @jfalameda,

Is it timing out with the same error / stack as posted above still? Where the timeout is 100 for read timeouts, rather than the passed in time out.

Thanks!

from service-fabric-cli.

jfalameda commented on July 21, 2024

Hi @Christina-Kang ,

Yes, the application times out the same way. I can see on SF that the application is still provisioning. The application disappears when the provisioning is done without any error, therefore I cannot publish it.

PS: After checking the checksums it shows "unprovisioning." then it disappears from the applications list.

Kind Regards,
/ José.

from service-fabric-cli.

jfalameda commented on July 21, 2024

Hi @Christina-Kang

Unfortunately 7.1.0 didn't resolve this issue. The timeout is not about uploading the application, but to provision it. The application takes few minutes to provision, while sfctl timeouts before it has completed. By the time it is 100% provisioned, the application unprovision itself automatically.

This is the error I get on the client, I can see the applicaiton is still provisioning on the service fabric explorer.

Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
  File "/home/jose/.local/lib/python2.7/site-packages/knack/cli.py", line 206, in invoke
    cmd_result = self.invocation.execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/entry.py", line 81, in execute
    return super(SFInvoker, self).execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/invocation.py", line 188, in execute
    cmd_result = parsed_args.func(params)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 105, in __call__
    return self.handler(*args, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 212, in _command_handler
    result = op(client, **command_args) if client else op(**command_args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type
    request, header_parameters, body_content_sorted)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/service_client.py", line 336, in send
    pipeline_response = self.config.pipeline.run(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 197, in run
    return first_node.send(pipeline_request, **kwargs)  # type: ignore
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 150, in send
    response = self.next.send(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/requests.py", line 72, in send
    return self.next.send(request, **kwargs)

Kind Regards,
/ Jose.

from service-fabric-cli.

Christina-Kang commented on July 21, 2024

@jfalameda While the timeout may occur on the client side (this is a bug on our part, that the timeout can occur before the cluster times out), the cluster should continue its work with provisioning. There should be no unprovision happening, even on error, from the cluster side. If you are running the upload / provision / create process as part of a script, this may be causing the unprovisioning to happen. If this is not the case, are you available for a call with us so we can debug this faster? You can contact me at [email protected]

Thanks!

from service-fabric-cli.

Timeout on provision about service-fabric-cli HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent