Coder Social home page Coder Social logo

Comments (22)

markquezada avatar markquezada commented on May 19, 2024 2

@maltoe I like this warmup solution. Seems like the best compromise. 👍

I've added it and will report back after a few days too.

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024 1

Hi @maltoe, thanks for following up. It does still seem to be a problem for us. I haven't had time to circle back on this but I see that you released an update that could help me debug so I'll see if I can get you the debugging info you requested.

from chromic_pdf.

austinthecoder avatar austinthecoder commented on May 19, 2024 1

@markquezada We're having the same issue. So far your warmup solution seems to be working after a few builds. I'll let you know if it still looks good after a few days.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

@markquezada Are you running a lot of PDF-printing tests in parallel? Also, I haven't used github actions much myself, does it change anything on the test container when you ssh into it?

Also, since you mentioned discard_stderr, I'm assuming you don't see any Chrome output in your logs?

Sorry, nothing comes to mind at first thought :/

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

Thanks for the reply! Sorry, just saw it.

I'm literally just running the one test shown above... no other PDF tests in parallel or anything like that. I just wanted to confirm it worked in CI and that it picked up the chromium package, etc.

I don't believe anything changes when you SSH in. I'm using the tmate action in the step right before I'm running the tests and then logging in. The only thing I could think of was that there was some sort of boot lag for chrome that I unintentionally skirted by logging in and running the tests manually. But that's a total shot in the dark.

I double checked the logs and I did notice these which I hadn't noticed before:

[0902/080732.519703:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
[0902/080732.877365:WARNING:sandbox_linux.cc(376)] InitializeSandbox() called with multiple threads in process gpu-process.
[0902/080738.291913:ERROR:devtools_pipe_handler.cc(149)] Connection terminated while reading from pipe
[0902/080738.302286:ERROR:devtools_pipe_handler.cc(234)] Could not write into pipe
[0902/080738.302408:ERROR:devtools_pipe_handler.cc(234)] Could not write into pipe

I didn't notice them before because this output wasn't near the test itself, but rather intermingled in the test output. Does that help at all? The connection terminating while reading from pipe sounds... significant 😅

The sandbox one is odd since I was running with no_sandbox: true.

Maybe a tangent, but I was taking a look at the chrome docs and I thought these flags looked interesting. Not sure if they're relevant:

--disable-software-rasterizer --disable-dev-shm-usage --disable-setuid-sandbox

For completeness and in case it's relevant, I did want to note I have this in my test_helper.exs:

ExUnit.after_suite(fn _ -> Supervisor.stop(MyApp.Supervisor) end)

I saw it somewhere in the docs related to not properly shutting down.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

Hey @markquezada

  • I wouldn't worry about the InitializeSandbox, it probably spits that out before it pays attention to the --no-sandbox flag. In the past if we or someone else had issues with the sandbox, it would generally either just crash with a permission error, or not. So that should be fine.
  • The Could not write into pipe bits definitely stand out, don't they? :) I would assume though that these come from after your test suite has ran, and has closed the pipe from Erlang's side. So more a symptom than the cause of the problem.

But I'm also just guessing, to be honest. It sure looks like some kind of race condition at boot time.

Further debugging ideas / questions:

  • Does it help if you sleep 10s in your test before printing the PDF?
  • How do you start your supervision tree in the test?
  • I'd be curious to see which protocol is being executed when you hit the exception -> would it be possible for you to compile a version that inspects the protocol variable here?

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

Hi @markquezada

Can we close this? Any news?

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

Quick update: I added a sleep call in the test and so far none of our builds have failed. I'm going to let that run for another day or so and then I'll remove it to see if I can trigger the same error I was getting previously.

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

@maltoe FYI, I've been slowly reducing the sleep time when printing in test and now it's back at 0 and I'm getting this occasionally:

1) test it renders a pdf blob (MyApp.TranscriptPdfRendererTest)
Error:      test/my_app/transcript_pdf_renderer_test.exs:12
     ** (exit) exited in: NimblePool.checkout(#PID<0.663.0>)
         ** (EXIT) time out
     code: assert {:ok, _} = TranscriptPDFRenderer.render(transcript)
     stacktrace:
       (nimble_pool 0.2.6) lib/nimble_pool.ex:402: NimblePool.exit!/3
Warning:        (chromic_pdf 1.3.1) lib/chromic_pdf/api.ex:50: anonymous fn/3 in ChromicPDF.API.chrome_export/4
       (chromic_pdf 1.3.1) lib/chromic_pdf/api/telemetry.ex:10: anonymous fn/2 in ChromicPDF.Telemetry.with_telemetry/3
Warning:        (telemetry 1.1.0) /home/runner/work/my_app/my_app/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
       (my_app 0.1.0) lib/my_app/transcript_pdf_renderer.ex:21: MyApp.TranscriptPDFRenderer.render/1
Warning:        test/my_app/transcript_pdf_renderer_test.exs:20: (test)

I'm happy to simply keep the call to sleep in and call it a day but I thought you might want to see this error.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

Yeah, that one we could probably address. You're still using the session_pool: [timeout: 30_000] option, is that right? That isn't applied to the NimblePool.checkout!/3 call, see here. Mostly because I didn't bother and so far nobody complained, but also because it's a bit more tedious as I can't keep it in the pool state.

A bit of context: The :timeout option controls the time that the actual PDF printing process is allowed to take, namely the time it takes Chrome to load your assets, render the page, create a PDF from it, and return it. It is applied here on Channel.run_protocol/3 which internally calls GenServer.call/2. So that's why we're telling users who have long PDFs to set this option. However, it does not prevent you from maxing out the concurrency of the worker pool, i.e. by trying to print more PDFs when all your workers are already busy. It's this tiny statement in the docs:

Please note that these are non-queueing worker pools. If you intend to max them out, you will need a job queue as well.

So for you to fix this: Use a queue like Oban to ensure limiting your concurrent print_to_pdf calls.


@andreasknoepfle Regarding "that one we could probably address" -> I think we should rescue this error and explain this better. I'm unsure about adding a checkout_timeout option somewhere to control NimblePool, as an option like this would only mitigate the problem for users, but in a fragile way? Not sure if we want to support that.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

@markquezada addendum on my previous comment:

@andreasknoepfle pointed out that you were only running a single test on CI, which I completely forgot in the meantime :) So I guess it's not really plausible that you max out the concurrency of the pool?

In that case I'd offer the following theory:

  • When ChromicPDF is started as part of your supervision tree, occasionally it takes the worker init (= the SpawnSession protocol, see the SessionPool module) a long time
  • You have set init_timeout option to something greater than 5 seconds
  • Your test runs meanwhile, tries to print something, and runs into the checkout timeout

I could simulate this by reducing the checkout timeout to 1ms and running this on the console:

ChromicPDF.start_link(); ChromicPDF.print_to_pdf({:html, "test'})

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

FWIW I tried setting up Github actions for this repository, and look at the first error that jumps in my face:

image

#179

[edit] or look at this job: https://github.com/bitcrowd/chromic_pdf/actions/runs/3600997571/jobs/6066285321 all sorts of EXITs and timeouts

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

This one may be relevant microsoft/playwright#4345

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

Ah, interesting. Yes, I'm only running the one test so the nimble pool error seemed suspect.

Your theory sounds pretty spot on to me.

FYI, here's my current config in test:

config :my_app, ChromicPDF,
  on_demand: false,
  offline: true,
  discard_stderr: false,
  no_sandbox: true,
  chrome_args: "--disable-dev-shm-usage",
  session_pool: [init_timeout: 30_000, timeout: 30_000]

I even tried making Chromic the first thing in my supervisor tree so that it boots up first but I still get this error occasionally.

This one may be relevant microsoft/playwright#4345

Isn't that running headful chrome though? That ticket notes that headless chromimum is optimized for cold boots.

One thing I just realized... I'm not specifying what chromium binary to use and I'm using the ubuntu-latest github actions runner. Is it possible Chromic is picking a headful version of chromium to run against and that is what's causing the long boot time?

The software list for the ubuntu runner lists:

  • Google Chrome 107.0.5304.121
  • ChromeDriver 107.0.5304.62
  • Chromium 107.0.5288.0

... so headful chrome looks to be installed. Based on the way the binary is located in chrome_impl.ex, it doesn't seem like it would choose headful chrome first, but I haven't verified which binary is getting run. Just a shot in the dark.

Thanks for being receptive to this issue!

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

This one may be relevant microsoft/playwright#4345

Isn't that running headful chrome though? That ticket notes that headless chromimum is optimized for cold boots.

Huh? AFAIK there is no dedicated Chrome build for "headless only" operation. chrome --headless spawns chrome without its UI with any chrome binary.

I thought the ticket was related as it explicitly mentioned random delays on Github actions, but I have since experimented with that by measuring the time it takes ChromeImpl to start the port and it seemed pretty consistent.

To me it feels like there is something fishy on Github Actions. Chrome may be doing some work on start up that may delay it becoming responsive to devtools commands. Perhaps it's loading something from disk, or writing its profile, or the like; and for some reason on Github Actions that occasionally takes a long time. But that's really just a gutfeel, I'd like to investigate further, especially to find out where exactly it stalls the SpawnSession protocol. Will do that when I find the time, see the #179 branch.

Thanks for being receptive to this issue!

You're welcome. Sorry that it has not been the greatest experience for you yet. ChromicPDF usually runs pretty stable on production, but we and other people had a lot of issues in CI already, as quickly launching and stopping Chrome instances seems to be something that is difficult to get to work reliably.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

@markquezada Still on my way of getting the test suite to run on Github actions (see #179), but it's looking good, i.e. I don't have random delays anymore. Maybe give 1.4.0 a try and see if one of the command line options I copied over from Puppeteer makes a difference. Cargo-cult programming at its finest.

[EDIT] 😠 Nevermind, just got an init timeout. Forget what I said. Still in the complete dark about what's wrong here.

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

A bit funny... we have dependabot set up to auto-create PRs when dependencies update. I was excited when I saw it come in with a ChromIcPDF update. But then CI failed on the PR 😭

Probably not that helpful, but for reference, here's the relevant messages in the logs:

[1208/122123.396777:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
[1208/122123.934858:WARNING:sandbox_linux.cc(380)] InitializeSandbox() called with multiple threads in process gpu-process.
  1) test it renders a pdf blob (Novo.TranscriptPdfRendererTest)
Error:      test/novo/transcript_pdf_renderer_test.exs:7
     ** (ChromicPDF.Browser.ExecutionError) Caught EXIT signal from NimblePool.checkout!/4
     
           ** (EXIT) time out
     
     This means that your operation was unable to acquire a worker from the pool
     within 5000ms, as all workers are currently occupied.
     
     Two scenarios where this may happen:
     
     1) You suffer from this error at boot time. For instance, you're running tests
        on CI and occasionally Chrome takes a long time to spawn, exceeding the
        session pool's "init timeout". Hence, when you're trying to print a PDF
        in a test, the pool does not have any initialized worker yet.
     
        Unfortunately, there is currently very little you can do about this.
     
        See issue #160 for a discussion.
     
     2) You're experiencing this error randomly under load. This would indicate that
        the number of concurrent print jobs exceeds the total number of workers in
        the pool, so that all workers are occupied.
     
        To fix this, you need to increase your resources, e.g. by increasing the number
        of workers with the `session_pool: [size: ...]` option.
     
        However, please be aware that while ChromicPDF (rather, the underlying
        NimblePool worker pool) does perform simple queueing of worker
        checkouts, it is not suitable as a proper job queue. If you expect to
        print multiple PDFs concurrently and especially if you expect peaks in
        your load, a job queue like Oban will provide a better experience.
     
     Please also consult the worker pool section in the documentation.
     
     code: assert {:ok, _} = TranscriptPDFRenderer.render(transcript)
     stacktrace:
       (chromic_pdf 1.4.0) lib/chromic_pdf/pdf/browser/session_pool.ex:70: ChromicPDF.Browser.SessionPool.run_protocol/3
Warning:        (chromic_pdf 1.4.0) lib/chromic_pdf/api.ex:50: anonymous fn/3 in ChromicPDF.API.chrome_export/4
       (chromic_pdf 1.4.0) lib/chromic_pdf/api/telemetry.ex:10: anonymous fn/2 in ChromicPDF.Telemetry.with_telemetry/3
Warning:        (telemetry 1.1.0) /home/runner/work/novo/novo/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
       (novo 0.1.0) lib/novo/transcript_pdf_renderer.ex:21: Novo.TranscriptPDFRenderer.render/1
Warning:        test/novo/transcript_pdf_renderer_test.exs:15: (test)

[1208/122127.815462:ERROR:devtools_pipe_handler.cc(148)] Connection terminated while reading from pipe
[1208/122127.850429:ERROR:devtools_pipe_handler.cc(233)] Could not write into pipe
[1208/122127.850453:ERROR:devtools_pipe_handler.cc(233)] Could not write into pipe

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

@markquezada

I've looked into Github actions further today and established a few points:

  • The delay happens after Port.open, Chrome launches but takes a while to respond to our devtools messages.
  • The delay only happens on first launch of the executable. Subsequent runs are in the 100ms region.
  • The delay only happens sometimes. On some runs Chrome took just a second to respond. On others, it took between 10 and 15 seconds, which lead to the timeout.

-> It looks a lot like Chrome is initializing something on first run, perhaps some cache or so, and perhaps including writing to disk. And only sometimes Github Actions has trouble providing the required resources promptly. Perhaps some part of the filesystem is provisioned lazily. All just guesses though.

Solutions

  • we could set the default timeouts (@checkout_timeout and @init_timeout) to something beyond ~20s. However, I'd prefer not to, as this also means that in normal operation we would wait for 20s for every checkout. In normal operation however, I consider a checkout error to indicate that the user should up their game, i.e. add a proper queue or more resources (see the error message). Thus IMO the default timeout should be low (perhaps even lower than 5s, maybe 1s) to make people aware early. Personal opinion though, yours may vary.
  • The solution I've pursued today is a manual "warmup" step to be added to your CI pipeline, to make sure Chrome has initialized all its things.

Could you do me a favor (hopefully the last) and try out #193 in your pipeline and add the warmup/0 call to your test_helper.exs? 🙏

from chromic_pdf.

austinthecoder avatar austinthecoder commented on May 19, 2024

@maltoe Warmup has been working for a week now. I'd say we're good!

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

Merged. Heads up, I renamed it to warm_up/0. Also added a mix task that can be run directly from a CI script, in case you don't need any options.

Curiously, it has started happening on our own CI now 🎉 Maybe it's something that has changed in recent Chromes (I've updated the Chrome version in the CI maybe a month ago)...

Thanks for your help everyone! @markquezada, I'm going to interpret your silence as "it hasn't happened again" and going to close this issue for now. Feel free to reopen if necessary.

from chromic_pdf.

maltoe avatar maltoe commented on May 19, 2024

Released as v1.5.0

from chromic_pdf.

markquezada avatar markquezada commented on May 19, 2024

@maltoe sorry, just saw this as I've been out on vacation. Yep, it all still seems to be working well with the warmup option for us!

Thanks for your help and persistence in tracking this down and providing a fix!

from chromic_pdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.