Coder Social home page Coder Social logo

Performance issues about tzdata HOT 22 CLOSED

lau avatar lau commented on July 22, 2024
Performance issues

from tzdata.

Comments (22)

michalmuskala avatar michalmuskala commented on July 22, 2024 1

I looked at the code a little bit, and I'm pretty sure it should be possible to optimise how ets is used. Right now a lot of data is copied on each access using ets basically as a blob store. I'm convinced that optimising access patterns for ets and storing data a bit smarter, it should be entirely possible to achieve performance comparable to the compiled version.

from tzdata.

lau avatar lau commented on July 22, 2024

Are the dates in the database dates or datetimes? What timezone are they in? All kinds of different timezones or mostly UTC? Do you have fields in the database for both the date and time and timezone? What version of Timex are you using?

If you do use the tzdata 0.1.x versions, I recommend specifying ~> 0.1.201605 instead of 0.1.8 and updating to the newest 0.1.x versions as they are released.

from tzdata.

arjan avatar arjan commented on July 22, 2024

They are datetimes and are stored without timezone. The timezone is stored in another options column (it's a SaaS application; each customer decides which timezone they want to use).

We're using the 3.x series of Timex.

So in theory I could retrieve the timezone once and the construct %DateTime structs by hand.Although there are no functions in Elixir's datetime for that, currently, it seems; from_naive/2 only supports "Etc/UTC" as argument.

from tzdata.

nulian avatar nulian commented on July 22, 2024

I was wondering if a setting for precompiling certain timezones could be for that we only really need high performance CET and UTC timezones. If we had to slower lookup other timezones it wouldn't probably be a problem.

from tzdata.

lau avatar lau commented on July 22, 2024

I have considered doing optional dynamic compiling. The data would be compiled in a way similar to the versions pre 0.5.x, but it could happen automatically with new versions too.

from tzdata.

arjan avatar arjan commented on July 22, 2024

Maybe my compiler_cache project could be of help with on-the-fly compilation, based on certain threshold (e.g. a timezone is looked up more than N times)

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

@lau I wonder if we can rollback to compilation but instead shard timezones. So for example, all of "America" going into a module, "Asia" to another, "Europe" to another , "Pacific" and then the remaining ones. That should break it apart enough to avoid requiring 2GB (or more) during compilation. It is also how we reduced the amount of memory used by the Unicode modules.

from tzdata.

lau avatar lau commented on July 22, 2024

@josevalim Yeah splitting them in to those handful of shards might be enough for the RAM issue. If not, another sharding algorithm splitting it into even more modules could be used.

The dynamic run time updates of new data was introduced at the same time as switching from compilation to ETS. That leads to a couple of more things being different than just RAM requirements for compilation.

If a system is running for a long time without restarts many modules with different versions will be created over time. When a new version is downloaded and e.g. Tzdata.Data2017b.Africa could be created. Then something like Tzdata.current_version_module/0 would need to be changed in runtime and return Tzdata.2017b instead of Tzdata.2017a. Then Tzdata.2017a and submodules can be purged.

Should Erlang's purge function be used http://erlang.org/doc/man/code.html#purge-1 so that all of the old data doesn't build up and take up memory? Are there any gotchas related to that?

BTW I remember than with the compiled versions (versions 0.1.x) when querying the compiled tzdata initially in the console it could take what felt like a second to get the first response. Whereas the ETS version (0.5.x) is relatively quick and consistent from the beginning. I didn't do extensive testing of that and maybe it isn't an issue.

from tzdata.

arjan avatar arjan commented on July 22, 2024

@lau in a project I'm working on we're still using the compiled version and indeed also experiencing this initial delay. Did not look into it yet but it might be due to the initial loading of the compiled module; if so, :code.ensure_loaded/1 could help (calling it on the module when the tzdata OTP app starts)

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

The dynamic run time updates of new data was introduced at the same time as switching from compilation to ETS. That leads to a couple of more things being different than just RAM requirements for compilation.

Right. If you want to recompile dynamically in production, you also need to be careful to not make the production servers run out of memory when doing so.

Regarding the versioning of modules, that should not be a problem. You can keep on using the same module names and when you compile the new versions, the existing versions will be marked as old. Once you do that again, the old versions are purge and the previous new now becomes old. Calling code:purge/1 before you start the process will make sure the old versions are removed upfront but you should not run into an issue in any case.

Another possibility is to consider to have two versions of tzdata, one that is precompiled and another that uses ets. The ets one can provide live updates. The precompiled one may still check for updates and emit periodic log messages or system alarms whenever there is a new version. Many teams are working on their software actively, so doing a new deployment when tzdata may not be a concern. YMMV.

from tzdata.

michalmuskala avatar michalmuskala commented on July 22, 2024

The basic optimisation would be to store e.g. {:rules, zone_name} instead of a zone names map under :rules. A more advanced usage would be to build match specs for particular lookups, so even less data is copied out of the table.

Additionally setting read_concurrency: true for the table, should also be beneficial.

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

Just to provide a more concrete example, code like this:

def zone(zone_name) do
  {:ok, zones()[zone_name]}
end

should be written like:

def zone(zone_name) do
  {:ok, simple_lookup({:zone, zone_name})}
end

and then if you need to find all zones, you can do a match lookup with {:zone, :_}.

Similar issue happens with the periods. We load all periods from ETS on every lookup, which generates a lot of garbage and copying. For optimizing period lookups, assuming it is always an integer range lookup plus a key lookup, we could use match specs and only load the zones we find from the database.

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

@lau do you have a script you use to build the table from the latest data? I can take a look at optimizing ETS if we can rebuilt it easily.

I would also like to comment I did some experiments in the old pre_ets branch and by splitting the modules part and using simpler data structures I was able to keep the maximum memory size during compilation to 260MB.

from tzdata.

lau avatar lau commented on July 22, 2024

@michalmuskala @josevalim Thanks for your inputs. The rules and zones entries are only used for "far future" calculations, so they won't affect performance for datetimes between year 0 and ~40 years into the future from when the ETS table is built. For the far future calculations the rules are used instead of the precalculated periods. Those far future ones could use optimization too, but it has not been as high a priority as past datetimes as well as in the present and some decades into the future.

For the pre-calculated periods there is currently an entry for each time zone with a list of periods. When looking for matching periods there can be 0, 1 or 2 matches.
I remember looking at the more advanced quering options for ETS. Also with a structure that will be described in a paragraph below with multiple entries per time zone. As far as I remember it seemed complicated to query - and it did not help to do comparisons between both integers and atoms :min and :max. The list worked and performance seemed acceptable at least to begin with. Of course knowing that performance improvements could always be made later.

The other day I ran some benchmarks (also using your benchmark repo as a start @arjan - thanks), that showed a vast difference in ETS calls depending on the size of the data in the ETS entry. It seemed very liniar. And for very small amounts of data it does look quite quick. Now I want to try setting up the ETS table structure for periods a bit different from today: Instead of one period entry per time zone there would be one entry per period. Multiple entries with the same key (the timezone id).

Then using ETS functions to get not all of the periods for a timezone, but a smaller subset of potential matching periods if not exactly all of the matching ones for a certain datetime. If necessary then other functions can further filter the returned periods.

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

@lau the data needs to be copied, which explains why it is behaving linearly and it also explains why it is bad to load all the data from ETS when using only part of it.

You should be able to measure improvements simply by doing {:zone, zone_name} lookups and using :ets.match/2 when you need to fetch all zones.

You can probably even further optimize lookups by using matchspecs but the above should be the lowest hanging fruit.

from tzdata.

michalmuskala avatar michalmuskala commented on July 22, 2024

Zones are not only used for "far future" calculations. For example, Timex will call zone_exists?/1 during each Timex.parse!/2 call.

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

I also want to clarify that I am simply using the zone as an example. Almost all entries can be likely keyed somehow instead of loading the whole data. :) So doing the timezone id for periods sounds like a great idea.

Also worth mentioning that you can set the table to be a duplicate_bag and you can store the data like this:

{{:period, time_zone_id}, beginning, ending, data}

You will have all entries with this format. Now in order to find a period between beginning and ending, you can use a matchspec that will check if the given argument is between beginning and ending. This means that the matchspec will run directly in the ets and you will only load the data you need to from the DB. You can still load all periods if you want to by simply looking up {:period, time_zone_id}.

If you use ETS like an actual table, you will probably gain a lot of benefits.

from tzdata.

lau avatar lau commented on July 22, 2024

Thanks for the feedback again.

Also worth mentioning that you can set the table to be a duplicate_bag and you can store the data like this:

Yes, this is what I meant by having one entry per period and multiple entries with the same time zone id as key. They are already basically like that within the list where the first element is the zone id. There are just multiple sets of beginning & ending values: for UTC, wall and "standard" (offset without DST).

Zones are not only used for "far future" calculations. For example, Timex will call zone_exists?/1 during each Timex.parse!/2 call.

That is probably not necessary to do for each of those calls.

You are right it can be used for other things, I was being imprecise: for most time zone calculations it is possible to use just the periods. However zone_exists? should still be fast.

from tzdata.

josevalim avatar josevalim commented on July 22, 2024

❤️ 💚 💙 💛 💜

from tzdata.

codeadict avatar codeadict commented on July 22, 2024

Was this performance improved? Needs some work?

from tzdata.

arjan avatar arjan commented on July 22, 2024

I think eventually we downgraded to an older tzdata version.

from tzdata.

lau avatar lau commented on July 22, 2024

@codeadict I have a branch which stores the data differently in ETS: #64 Lookups are about twice as fast with that as far as I remember.

For now if you need it to be faster then you can use the older versions e.g. ~> 0.1.201805 which use macros instead of ETS. This is faster than ETS by orders of magnitude. However the older versions do not allow automatic upgrades and can crash during compilation on machines with less than 1-2GB of RAM.

from tzdata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.