Coder Social home page Coder Social logo

chronicle-app / chronicle-etl Goto Github PK

View Code? Open in Web Editor NEW
122.0 3.0 2.0 413 KB

๐Ÿ“œ A CLI toolkit for extracting and working with your digital history

Home Page: https://chronicle.app/

License: MIT License

Ruby 99.91% Shell 0.09%
chronicle chronicle-etl personal-data etl cli archiving personal-archive quantified-self ruby data-liberation

chronicle-etl's Issues

Users should be able to persist credentials for third-party services

Connectors that interact with third-party services often require access tokens, user ids, or other secrets. Chronicle-ETL needs a lightweight secret management system so that jobs don't have to have these specified manually each time.

At minimum, we need namespaced key/value pairs. We can use one yml file per namespace, stored in ~/.config/chronicle/etl/secrets (or $XDG_CONFIG_HOME). By convention, we can use one namespace per provider.

Proposed UX

chronicle-etl secrets:set namespace key
chronicle-etl secrets:unset namespace key
chronicle-etl secrets:list namespace

# set with interactive prompt
chronicle-etl secrets:set pinboard access_token

# set from stdin
echo -n 'FOO' | chronicle-etl secrets:set pinboard access_token

# set from cli option + environment variable
chronicle-etl secrets:set pinboard access_token --body "$PINBOARD_ACCESS_TOKEN"

When running a job, secrets will be merged into the connector's option hash. Secrets can come from a few places. This is the load order:

  • secrets from the namespace matching the plugin provider's name
  • secrets from a custom namespace (specified in job options with something like secrets: pinboard-personal
  • via cli flags for common secrets (--access_token FOO)

Decisions

  • base64 encode secrets?
  • encrypt secrets?
  • save timestamps for each key/value pair?
  • save a chronicle-etl version in the yml file to make future migrations easier

When piping stdout, output can be interlaced with status messages on stderr

When running something like `$ chronicle-etl -e shell | grep "foo", we often get this sort of race condition:

result 
result
Completed job in 2.207684 secs
  Status:	Success
  Completed:	113
result
result

Options to fix this:

  • detect when stdout is piped (and actually being used!) and don't print a final status message
  • print status message before running loader.finish (but this might print "success" even if the loader then goes on to fail)
  • adding a sleep N before the status message (hacky)

Runner should be able to disable running `results_count` on an Extractor

results_count is used to build a nice progress bar but sometimes calculating the number of records is a computationally-expensive task (for example, multi gigabyte .mbox files). When we run a job in the background, we probably don't want the speed penalty that comes from calculating this number.

We should be able to pass Runner a configuration option to skip calling this method. If the job is running in a tty, the progress bar will just show the current count without a progress bar.

Should be installable via homebrew

  1. Create a custom tap (a new repo under github.com/chronicle-app)
  2. Add a chronicle-etl.rb formula that download the gem

The tool would become installable with brew install chronicle-app/etl/chronicle-etl

Jobs should be optionally allowed to continue when errors encountered

Some example cases where we might want to use this option:

  • data from the extractor is incomplete and the transformer can't produce a transformed record
  • we don't want to abandon the whole job if a random http request fails for a loader

In the runner, we'd just have to catch errors and use ensure to update the progress bar and log the error to stderr.

Plugins should be able to report their connectors without needing to be required

As identified in #24, activating all installed plugins at the same time will lead to dependency problems.

We often need to know which connectors are available (eg. $ chronicle-etl connectors:list) so we need a way for plugins to report their available connectors without the gem being activated/required.

A few options:

  • use the gem's metadata fields
  • requiring a special plugin.rb file that doesn't load any other dependencies
  • ???

Add a multithreaded job runner

Currently, jobs are run one record at a time and the runner can end up spending a lot of time waiting for a slow transformer or an IO-bound loader (ie POSTing to a slow API endpoint). Chronicle-ETL should offer a multithreaded mode as well (or by default?).

The most impactful change with least effort would be keeping extractors single-threaded but handing off records to a worker pool that can transform and then load records. This would require only minimal changes to:

  • the job runner UI
  • the job logging system
  • making sure loaders are thread-safe (use thread-safe arrays for TableLoader, etc)

Adding concurrency to extractors would be trickier since each extractor would have to specify how to chunk up the work (and wouldn't even be possible for certain extractors like reading from stdin or with an API that uses cursors for pagination)

Job output should be able to be filtered, sorted, and field-filtered

In line with CLI conventions, we should provide these options:

  • filter: only output records that match a set of conditions
  • sort: sort records by a given column (ascending, descending)
  • fields: only output values from a given set of fields

There are a few ways this could be implemented

  • global --filter, --sort, --fields flags for chronicle-etl
  • passing as options for the Loader through --loader-opts
  • treating them as multiple transformers that run before getting to the Loader (requires completion of #6)

We'd also make Loaders determine if output can be incremental or has to happen when the job is complete (for instance, if a sort option is used)

SQLite loader

To figure out:

  • best way to avoid N+1 INSERTs
  • how to handle schema

Shell transformer

Users should be able to specify a transformer that passes extracted data to a command's stdin and receive the command's stdout stream as the transformed data.

Add OAuth2 authenticator

Key thing to figure out: can we do this in a way that doesn't require spinning up a web server to catch the response (webrick?)

Jobs should be able to have multiple transformers

We might want to do something like extract data from an API, transform it into an activity stream, and transform it again by adjusting all the dates.

Changes:

  • change --transformer flag to be an array of transformer names
  • handle --transformer-opts when there are multiple (maybe special format for keys of the hash?)
  • make yml parser handle both single transformer and array of transformers

Dependencies of two different plugins can conflict

Right now, plugins aren't dependencies of chronicle-etl and we just check for which chronicle-* gems are available at runtime and try to load them all. The problem is that plugins can have conflicting dependencies.

A typical scenario: for two platforms foo.com and bar.com, the ruby client libraries ruby-foo and ruby-bar will require different versions of faraday. If a user runs gem install chronicle-foo and gem install chronicle-bar and then tries to use them both within chronicle-etl, we'll get a Gem::ConflictError.

Possible solutions:

  • Make this project a monolith. We centralize plugins into the chronicle-etl repo and let bundler handle dependency resolution and accept that our bundle size will be huge and potentially hard to install if any plugins require native extensions.
  • Only allow one plugin to be used at the same time. Because we won't load all available plugins, the Connector Registry won't know which connectors are available (because the register_connector macro won't be called) so we'll need another technique
  • Adding commonly-used plugins as dependencies in chronicle-etl.gemspec and we can guarantee that at least those ones can run well together.

In any case, it'll be important to provide enough infrastructure in chronicle-etl so that plugins will need only a few additional dependencies.

Plugins should be able to register authorizers

A lot of plugins will require an oauth2 flow to get authorization. If we add omniauth to Chronicle-ETL, plugins can include omniauth-* gems. Then we just need a convention for plugins registering their omniauth strategy to hook into the CLI authenticator flow (#48)

Jobs should track highest encountered ids/timestamps to use them for continuations

Often, we want to run a job to pull only the newest activities from a source. If we save the highest id or latest timestamp that we processed from a job, we can pass this to the extractor the next time the job is run and use it as a cursor.

To implement this, we'd need:

  • transformers should be able to report back the id and timestamp of the record it's processing
  • the Runner should capture the max() of both of these fields
  • we should persist these values to the filesystem
  • we need a configuration option to tell an Extractor to use this cursor. If this option is on, the Runner should look up the cursor and pass it to the Extractor when it is instantiated
  • Extractors should be able to take this cursor and use it to pull data that is newer

SQLite extractor

Config:

  • SQL query
  • secondary SQL for metadata
  • db connection string

Yields:

  • row as data
  • additional metadata

Plugins should be upgradable (semi)automatically

If a user upgrades chronicle-etl and then uses a plugin, the plugin's gemspec might resolve an older version of chronicle-etl and then a ConflictError will be raised when attempting to load it.

This should be pretty recoverable: we check if a plugin is installed but incompatible and then call Gem.install again (todo: figure out if we have to pass an explicit version).

Also, would be useful to have a $chronicle-etl plugins:update_all command. Also maybe a $chronicle-etl plugins:cleanup for removing old versions.

Users should be able to load arbitrary connector classes

Ways this could work:

  • using --extractor ./custom_extractor.rb as a flag
  • a known plugin directory in the config directory

We take that file, attempt to load it (and rescue from LoadError), and then set it as the connector in the JobDefinition.

This system would also need

  • basic validation system to ensure the right things are required and included

Limitations:

  • doesn't handle dependency management

Users should be able to specify options for jobs

Right now, we can pass in options for the Extractor, Transformer, and Loader components of a job.

We should also be able to specify high-level options for the job as a whole. Some examples would be continue_on_error or save_log booleans.

Installing on fresh installation of chronicle-app produces error

After installing using the commands in the README, I attempted to run:

chronicle-etl --extractor imessage --since "2022-02-07" --transformer imessage

It was them prompted with:

Plugin specified by job not installed.
Do you want to install chronicle-imessage and start the job? (Y/n)

This is a little odd, since I just installed it. However, when going with "Y" or "n", I both run into a stack trace:

/Users/userboy/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/chronicle-etl-0.4.2/lib/chronicle/etl/job_definition.
rb:42:in `validate!': Job definition is invalid (Chronicle::ETL::JobDefinitionError)

Plugin connectors should be able to specify different strategies

A plugin might have different ways of extracting the same type of records. For example, youtube history can come from a Google Takeout export, the official API, or a scraper.

Right now, specifying a connector is a plugin:identifier but we could add an optional middle item to specify a strategy. Syntax:
chronicle-etl -e youtube:scraper:likes. If strategy is left out, we can find the first connector that matches the identifier.

Alternatively, different strategies could just be housed in their own plugins. Invoking them this way:
chronicel-etl -e youtube-scraper:likes

Support for incremental extractions (only load records new since last run)

If using Chronicle-ETL to do incremental backups of personal data or syncing to other services, it's annoying to extract a full set of records each time a job is run. An incremental extraction system could let users extract records created/modified since the last time the job was run.

The guts of it (some already half-implemented):

  • a system for persisting results from a job (a row saved in a sqlite db stored in $XDG_DATA_HOME)
  • a way to specify that a job should continue from the last run
  • setting the since option on a job automatically based on results of the last run

This system won't work with ad-hoc jobs specified with only CLI flags. It would require either a stable name for a job or a config file (perhaps this is one and the same?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.