chronicle-app / chronicle-etl Goto Github PK
View Code? Open in Web Editor NEW๐ A CLI toolkit for extracting and working with your digital history
Home Page: https://chronicle.app/
License: MIT License
๐ A CLI toolkit for extracting and working with your digital history
Home Page: https://chronicle.app/
License: MIT License
Can use $stdin.stat.pipe?
Connectors that interact with third-party services often require access tokens, user ids, or other secrets. Chronicle-ETL needs a lightweight secret management system so that jobs don't have to have these specified manually each time.
At minimum, we need namespaced key/value pairs. We can use one yml file per namespace, stored in ~/.config/chronicle/etl/secrets
(or $XDG_CONFIG_HOME
). By convention, we can use one namespace per provider.
chronicle-etl secrets:set namespace key
chronicle-etl secrets:unset namespace key
chronicle-etl secrets:list namespace
# set with interactive prompt
chronicle-etl secrets:set pinboard access_token
# set from stdin
echo -n 'FOO' | chronicle-etl secrets:set pinboard access_token
# set from cli option + environment variable
chronicle-etl secrets:set pinboard access_token --body "$PINBOARD_ACCESS_TOKEN"
When running a job, secrets will be merged into the connector's option hash. Secrets can come from a few places. This is the load order:
secrets: pinboard-personal
--access_token FOO
)When running something like `$ chronicle-etl -e shell | grep "foo", we often get this sort of race condition:
result
result
Completed job in 2.207684 secs
Status: Success
Completed: 113
result
result
Options to fix this:
loader.finish
(but this might print "success" even if the loader then goes on to fail)sleep N
before the status message (hacky)Gem.install
failing takes as long as two minutes on my machine.
$ chronicle-etl jobs:run NONEXISTENT
starts running a default job right now. We should explicitly raise an exception and exit(1).
Currently, we run jobs:start with the stdin extractor which results in the CLI waiting for input (without even a prompt). This is confusing.
Perhaps the best thing here would be to show help
.
results_count
is used to build a nice progress bar but sometimes calculating the number of records is a computationally-expensive task (for example, multi gigabyte .mbox
files). When we run a job in the background, we probably don't want the speed penalty that comes from calculating this number.
We should be able to pass Runner
a configuration option to skip calling this method. If the job is running in a tty, the progress bar will just show the current count without a progress bar.
The tool would become installable with brew install chronicle-app/etl/chronicle-etl
Currently it outputs one record per line (https://jsonlines.org/)
We can add a line_separated
setting (default: true) to output a single JSON object with an array of records.
--silent
as a class_option
and handle it in #setup_log_level
Some example cases where we might want to use this option:
In the runner, we'd just have to catch errors and use ensure
to update the progress bar and log the error to stderr.
As identified in #24, activating all installed plugins at the same time will lead to dependency problems.
We often need to know which connectors are available (eg. $ chronicle-etl connectors:list
) so we need a way for plugins to report their available connectors without the gem being activated/required.
A few options:
Currently, $chronicle-etl --fields nonexist-field
will result in :header must be a non-empty array (TTY::Table::InvalidArgument)
Currently, jobs are run one record at a time and the runner can end up spending a lot of time waiting for a slow transformer or an IO-bound loader (ie POSTing to a slow API endpoint). Chronicle-ETL should offer a multithreaded mode as well (or by default?).
The most impactful change with least effort would be keeping extractors single-threaded but handing off records to a worker pool that can transform and then load records. This would require only minimal changes to:
TableLoader
, etc)Adding concurrency to extractors would be trickier since each extractor would have to specify how to chunk up the work (and wouldn't even be possible for certain extractors like reading from stdin or with an API that uses cursors for pagination)
In line with CLI conventions, we should provide these options:
filter
: only output records that match a set of conditionssort
: sort records by a given column (ascending, descending)fields
: only output values from a given set of fieldsThere are a few ways this could be implemented
--filter
, --sort
, --fields
flags for chronicle-etl
--loader-opts
We'd also make Loaders determine if output can be incremental or has to happen when the job is complete (for instance, if a sort
option is used)
In PluginRegistry#activate
, we can add gem 'chronicle-NAME'
and watch for a Gem::MissingSpecError
(or maybe via an #installed?
method)
Currently, we're only doing require 'chronicle/NAME'
which generates a LoadError
if the gem is missing OR if the gem fails while loading. This seems to be the problem in #33.
To figure out:
(What about existing files with insecure permissions? Warning? Overwrite permissions on write?)
Users should be able to specify a transformer that passes extracted data to a command's stdin and receive the command's stdout stream as the transformed data.
undefined method
truncate' for nil:NilClass`
Key thing to figure out: can we do this in a way that doesn't require spinning up a web server to catch the response (webrick?)
We might want to do something like extract data from an API, transform it into an activity stream, and transform it again by adjusting all the dates.
Changes:
until: 2022-05-03
in a yml file is parsed as a Date object
For QA and security purposes. Can be basis for PluginRegistry.exists?()
Right now, plugins aren't dependencies of chronicle-etl
and we just check for which chronicle-*
gems are available at runtime and try to load them all. The problem is that plugins can have conflicting dependencies.
A typical scenario: for two platforms foo.com and bar.com, the ruby client libraries ruby-foo and ruby-bar will require different versions of faraday. If a user runs gem install chronicle-foo
and gem install chronicle-bar
and then tries to use them both within chronicle-etl
, we'll get a Gem::ConflictError
.
Possible solutions:
register_connector
macro won't be called) so we'll need another techniquechronicle-etl.gemspec
and we can guarantee that at least those ones can run well together.In any case, it'll be important to provide enough infrastructure in chronicle-etl
so that plugins will need only a few additional dependencies.
--no-color
flagNO_COLOR
environment variableA lot of plugins will require an oauth2 flow to get authorization. If we add omniauth to Chronicle-ETL, plugins can include omniauth-* gems. Then we just need a convention for plugins registering their omniauth strategy to hook into the CLI authenticator flow (#48)
Run the extract and transform steps of a job but not do any loading
Often, we want to run a job to pull only the newest activities from a source. If we save the highest id or latest timestamp that we processed from a job, we can pass this to the extractor the next time the job is run and use it as a cursor.
To implement this, we'd need:
id
and timestamp
of the record it's processingmax()
of both of these fieldsAnd raise an exception if not provided. Typically oauth authorizers need at minimum a client_id
and client_secret
.
Config:
Yields:
data
If a user upgrades chronicle-etl
and then uses a plugin, the plugin's gemspec might resolve an older version of chronicle-etl
and then a ConflictError will be raised when attempting to load it.
This should be pretty recoverable: we check if a plugin is installed but incompatible and then call Gem.install
again (todo: figure out if we have to pass an explicit version).
Also, would be useful to have a $chronicle-etl plugins:update_all
command. Also maybe a $chronicle-etl plugins:cleanup
for removing old versions.
Ways this could work:
--extractor ./custom_extractor.rb
as a flagWe take that file, attempt to load it (and rescue from LoadError
), and then set it as the connector in the JobDefinition
.
This system would also need
Limitations:
This should just exit(1):
chronicle-etl -e alaksjdflksajd > output.txt
Right now, we can pass in options for the Extractor, Transformer, and Loader components of a job.
We should also be able to specify high-level options for the job as a whole. Some examples would be continue_on_error
or save_log
booleans.
For investigating structure of extractor results
Could also just add a flag to show only headers for json/table/csv loaders.
After installing using the commands in the README, I attempted to run:
chronicle-etl --extractor imessage --since "2022-02-07" --transformer imessage
It was them prompted with:
Plugin specified by job not installed.
Do you want to install chronicle-imessage and start the job? (Y/n)
This is a little odd, since I just installed it. However, when going with "Y" or "n", I both run into a stack trace:
/Users/userboy/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/chronicle-etl-0.4.2/lib/chronicle/etl/job_definition.
rb:42:in `validate!': Job definition is invalid (Chronicle::ETL::JobDefinitionError)
A plugin might have different ways of extracting the same type of records. For example, youtube history can come from a Google Takeout export, the official API, or a scraper.
Right now, specifying a connector is a plugin:identifier
but we could add an optional middle item to specify a strategy. Syntax:
chronicle-etl -e youtube:scraper:likes
. If strategy is left out, we can find the first connector that matches the identifier.
Alternatively, different strategies could just be housed in their own plugins. Invoking them this way:
chronicel-etl -e youtube-scraper:likes
--job job.yml
flag--extractor Foo
should override the job file)Runner.new
Right now, we only list extractor/transformer/loader but we should also display the options for each.
If using Chronicle-ETL to do incremental backups of personal data or syncing to other services, it's annoying to extract a full set of records each time a job is run. An incremental extraction system could let users extract records created/modified since the last time the job was run.
The guts of it (some already half-implemented):
$XDG_DATA_HOME
)since
option on a job automatically based on results of the last runThis system won't work with ad-hoc jobs specified with only CLI flags. It would require either a stable name for a job or a config file (perhaps this is one and the same?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.