Coder Social home page Coder Social logo

Comments (3)

whoahbot avatar whoahbot commented on June 6, 2024 1

Hi @rafael-ariascalles,

You shouldn't need to modify the entrypoint, as most of these configuration options already accept environment variables.

As an example, you can set the number of workers per process using the environment variable BYTEWAX_WORKERS_PER_PROCESS.

from bytewax.

Psykopear avatar Psykopear commented on June 6, 2024

Hi rafael.

The connector does not let kafka handle the offset with the group.id mechanism by default because we keep track of it in each partition internally. Bytewax does that to ensure that in the event of a restart, the dataflow can resume from the latest known checkpoint and ensure that all the messages will be processed at least once.
If you don't care about recovery, you can set group.id and enable auto.commit. But that means that when your dataflow restarts, you loose all the messages that were received but not processed, loosing the at-least-once semantics the operator normally offers.

The nice thing is that you don't need to use group.id to achieve the parallelization level you want, the connector already does that and creates a bytewax input partition for each partition in each topic, so if you have a single topic with 10 partitions, you can run the dataflow with 10 workers (or 10 processes), and each one will handle a single kafka partition.

The warning is related to the fact the we pass the same config used in the consumer to an AdminClient we use when bootstrapping the dataflow to retrieve the list of partitions for each topic (see here). That client is seen as a "producer", so you get the warning, but you can safely ignore that.

from bytewax.

rafael-ariascalles avatar rafael-ariascalles commented on June 6, 2024

Thanks for the response. @Psykopear

If I understand correctly, I will have to modify the entrypoint.sh used by the Image bytewax/bytewax:0.18.1-python3.11

#!/bin/sh

cd $BYTEWAX_WORKDIR
. /venv/bin/activate
python -m bytewax.run -w$NUM_WORKERS $BYTEWAX_PYTHON_FILE_PATH

echo 'Process ended.'

if [ "$BYTEWAX_KEEP_CONTAINER_ALIVE" = true ]
then
    echo 'Keeping container alive...';
    while :; do sleep 1; done
fi

NUM_WORKERS is the known number of partitions of a given topic.

so In a given Dockerfile I will just need to

FROM bytewax/bytewax:0.18.1-python3.11
ENV NUM_WORKERS=10
COPY entrypoint.sh /bytewax/entrypoint.sh

from bytewax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.