thbar / kiba Goto Github PK

View Code? Open in Web Editor NEW

1.7K 46.0 86.0 222 KB

Data processing & ETL framework for Ruby

Home Page: https://www.kiba-etl.org

License: Other

Ruby 100.00%

etl etl-ruby ruby data rubydatascience

kiba's Introduction

Kiba ETL

Writing reliable, concise, well-tested & maintainable data-processing code is tricky.

Kiba lets you define and run such high-quality ETL (Extract-Transform-Load) jobs using Ruby.

Getting Started

Head over to the Wiki for up-to-date documentation.

If you need help, please ask your question with tag kiba-etl on StackOverflow so that other can benefit from your contribution! I monitor this specific tag and will reply to you.

Kiba Pro customers get priority private email support for any unforeseen issues and simple matters such as installation troubles. Our consulting services will also be prioritized to Kiba Pro subscribers. If you need any coaching on ETL & data pipeline implementation, please reach out via email so we can discuss how to help you out.

You can also check out the author blog and StackOverflow answers.

Supported Ruby versions

Kiba currently supports Ruby 2.5+, JRuby 9.2+ and TruffleRuby. See test matrix.

ETL consulting & commercial version

Consulting services: if your organization needs guidance on Kiba / ETL implementations, we provide consulting services. Contact at https://www.logeek.fr.

Kiba Pro: for vendor-backed ETL extensions, check out Kiba Pro.

License

Copyright (c) LoGeek SARL. Kiba is an Open Source project licensed under the terms of the LGPLv3 license. Please see http://www.gnu.org/licenses/lgpl-3.0.html for license text.

Contributing & Legal

(agreement below borrowed from Sidekiq Legal)

By submitting a Pull Request, you disavow any rights or claims to any changes submitted to the Kiba project and assign the copyright of those changes to LoGeek SARL.

If you cannot or do not want to reassign those rights (your employment contract for your employer may not allow this), you should not submit a PR. Open an issue and someone else can do the work.

This is a legal way of saying "If you submit a PR to us, that code becomes ours". 99.9% of the time that's what you intend anyways; we hope it doesn't scare you away from contributing.

kiba's People

Contributors

Stargazers

Watchers

Forkers

aguynamedryan xiaods dldinternet kakoni dentarg jboursiquot grey-rabbit juhomi ekolve jameswatling eduardodeoh pyac bcherrington rthbound jackross slabounty esaul chrisb meconlin sleighter shut000 futurelearn jakeonrails yoshyn technogeeky activefx jedld maniacs-satm hectoregm longthanhtran to1ne tdyer billthebest jpm8998 milner-group mbwatty xiazemin rakibulislam nicolasleger 3838438org tomiyamaa a-tommyyy qiusugang jenlij run-cmd nukturnal nunb sirfilip kevinqingyu amitabhpatil anneshachowdhury nelyj vfonic pariyat appcoreopc llabake deepj boob-sbcm litify dwtcourses svkmax bytedecoder janko w11th mystery-applicant yunwisdom miazzy tanmer sztheory codeami shawn-shipoffers wuarmin shreyanshrajkeshri duanshuaimin doytsujin firewood-camps novice87 bearerpipelinetest rickymal narasimhanft thousand-rubies greenriver alanzirek

kiba's Issues

Add a spec to assert that destination's write(row) output is ignored

As recently documented in this wiki page, each destination output is not passed to the next (for extra composability).

This should be asserted via a test to make sure this behaviour does not regress.

Define the Kiba Pro offering

As first mentioned on my blog I plan to launch a Kiba Pro offering. Please comment here or email me to discuss more details, manifest your interest and support, this will be greatly appreciated.

Sustainable OSS

Especially after maintaining activewarehouse-etl for quite a bit of time, I'm willing to find a sustainable way to keep a high quality ETL solution for Ruby for the years to come, without killing my solopreneur-business.

If I'd paraphrase Mike Perham's statement with Sidekiq-Pro (see Sidekiq-Pro FAQ), "Kiba Pro is (well, will be) an extension which adds a few nice features to Kiba. Kiba is open source and free for all to use but unfortunately it takes a lot of my time to maintain and support. Kiba Pro is a way for you to purchase really useful functionality while also ensuring Kiba will be supported for years to come".

A recurring yearly subscription similar to Sidekiq Pro would secure some of my time to maintain and improve the solution and avoid the fate of activewarehouse-etl.

Potential features

Features could include:

multi-threading (and later, multi-machines) - commonly requested
built-in sources/transforms/destinations for common tasks
- lookups
- upserts / bulk load modules
- connectors optimized for parallelism (HTTP pagination extraction)
- connectors for the cloud (RedShift, ...)
built-in helpers for common operations (debugging, limiting, caching...)
premium support

Interested? Please chime in!

I'm pushing this publicly for transparency and also to find the "good" feature set, the one that larger companies would be willing to support my work on Kiba while getting extra value.

Do not hesitate to email me if your company could be interested in supporting richer features.

Allow block-form for source

I find myself wrapping sources like this:

require 'oj'

class JSONSource
  def initialize(file)
    @file = file
  end

  def each
    File.open(@file) do |file|
      Oj.load(file).each do |row|
        yield row
      end
    end
  end
end

where I'd prefer to be able to write this (at least for one-off scripts):

source do
  File.open(source_file) do |file|
    Oj.load(file).each do |row|
      yield row
    end
  end
end

Here the semantic would be that the block must yield each row. Another possible semantic would be this:

source { Oj.load(IO.read(source_file)) }

where the block is expected to return something that would respond to .each.

Maybe both semantics would be nice to have, with maybe a different keyword or some kind of parameter.

I must think with more depth about this for now, just dropping a note.

Ship a Kiba cookbook

While working on consulting doing ETL (either Kiba-based or not) in the past 6 months, I've gathered common topics of interests.

I'm in the process of writing a cookbook (free to read online, paid to get a PDF version or show support).

This issue will track progress with regard to the cookbook.

Current outline

Not ordered by section nor by priority at this point and more of a brain-dump:

Remaining steps

Find a writing platform (now using GitBook)
Sort out legal & finance bits
Find a way to sell it with limited overhead for me (as a EU, VAT-enabled seller)
Publish first 3 chapters
Structure the rest of the book

Improve legal statement

I was just checking out Kiba - very interesting! - and the legal statement at the end caught my eye. It says "assign the copyright". However, in Germany (and maybe in other countries as well), copyright is a personal, non-waivable right belonging to the author. One can grant all rights to modify and distribute the contributed code, though.
Since your intention is to not prevent anyone from contributing code, I think it might be a good idea to put this into a form that is actually applicable to all possible contributors.
IANAL but I assume there should be some legal wordings out there (maybe in the BSD or MIT licenses) that have the intended effect.

The "kiba" binary command will be deprecated

(this is a roadmap notice for Kiba users, as much as a reminder for myself)

In 4 years of production use with Kiba, I've had a lot of time to investigate the various patterns of usage on real cases.

Up to v1.0.0, Kiba's way to run jobs was to use bundle exec kiba my_script.etl, which involves reading and evaluating the .etl file, as seen here:

kiba/lib/kiba/cli.rb

Lines 10 to 13 in 378aca8

    
           filename = args[0] 
        
           script_content = IO.read(filename) 
        
           job_definition = Kiba.parse(script_content, filename) 
        
           Kiba.run(job_definition)

In the mean-time, I've introduced official support for a programmatic API, an API that was initially introduced to allow in-process testing.

The programmatic API allows everything the "command" mode supports, plus much more, and actually encourage better coding practices.

For instance:

API mode allows to pass live variables (rather than just ENV configuration from command line or JSON configs from files)
Doing so permits to wrap resources open/close around running a job
API mode makes it easier to run testing on an ETL process (via minitest/rspec) directly in-process (which allows stubbing/webmock etc), rather than via a command call
API mode enforces use of clean modules with explicit loading, rather than polluting the top-level namespace with global methods (https://github.com/thbar/kiba/wiki/How-do-you-define-ETL-jobs-with-Kiba%3F)
API mode allows to run jobs from Sidekiq or background job systems, from an HTTP call (if the job is fast), without necessarily waiting for a command line binary to run - this supports more dynamic interactions (e.g. a job is created in reaction to an external event received via HTTP or a websocket)

For all these reasons, I plan to deprecate the kiba command in an upcoming version (probably v3).

Ideally I'd like to provide a new gem that will bring back that functionality as opt-in, to ensure v3 can be retrofitted easily, but that gem will likely not be supported anymore when we reach v4.

Full vs. Incremental Loading

Hi!

Kiba works with both full and incremental loads? Where can I find more information about this?

Cheers!

Verify that BugSnag at_exit works with Kiba

A user has reported that the at_exit hook that you can use e.g. with BugSnag may not report properly errors when using the command-line.

I haven't yet tried to reproduce this, but I consider this to be an important point to check.

Support sub-row declaration processing

At times a field can contain a blob of data that I have to process and to convert into an array of cleaned values. It would be very nice to be able to use the exact same syntax on the field itself, something like:

transform do |r|
  r[:field] = [1, 3, 4, 8]
  r
end

sub_process(:field) do
  # this will get called for 1, 3, 4 and 8
  transform do |r|
    r * 10
  end
  transform do |row|
    row > 100 ? row : nil
  end
end

Allow transforms to yield more than one row

Currently and unlike activewarehouse-etl, it is not possible to yield multiple rows from a transform.

I'd like to implement such feature because it would be useful, but I need to fully think about the consequences first. For instance, it could work by yielding an Array of a specific type (so that the row itself could be an Array too, without risk of collision).

sample installation

Hello ,
When i try to follow similarly to the tutorial i have landed on a error . can you look into it and let me know now where i have gone wrong ?

Fix the Windows (AppVeyor) build

Discussion: https://help.appveyor.com/discussions/problems/11287-the-build-phase-is-set-to-msbuild-mode-default-but-no-visual-studio-project-or-solution-files-were-found

Add support for "close" on transforms

Sources and destinations both support some form of "close":

sources because they are only called once via #each (so one can close as part of this)
destinations via #close

In contrast, transforms cannot currently close.

This will be interesting in particular with Kiba v2 StreamingRunner, to implement some forms of buffering in the transform itself (e.g: keeping N records before doing a grouped query, aggregating N records), in which case close is a way to ensure we flush out the buffer.

See https://stackoverflow.com/questions/49422860/transforming-a-table-into-a-hash-of-sets-using-kiba-etl for an example of use.

Test files fail when ran separately

If I run ruby test/test_parser.rb or ruby test/test_runner.rb alone, I will get failures. Must investigate.

Question: All-Or-Nothing pipelines

Hi there...

I'm really excited to start using this gem and have a few questions...

I have a set of CSV's and or XLSX docs that I'll be turning into AR objects and then writing to a DB. I was wondering if it's possible to wrap the call to destination MyActiveRecordDestination in a transaction block, as I only want to write to the destination if all records are successfully transformed and loaded.

In your opinion, would you do it this way via a transaction, or is there a better way to have an all-or-nothing pipeline?

I was thinking a second option would be to make sure each row is .valid? during the transform phase of the pipeline, and aborting if there's errors there, but in my case, that won't catch DB level integrity failures.

I'm happy to contribute to the wiki with the outcomes of this issue.

Thank you so much for the great work on Kiba - I wish I knew about this project 6 months ago, but I'm glad I do now.

Multiple Transformations

Hello Sir ,
I am trying to have multiple transformations and load to multiple destinations in parallel , for example
step I : transform the sample_movies.csv to a better_movies.csv file with the average rating
step II : again transform the better_movies.csv to an rated_movies.csv which has higher rating than 8 .
when i try this login , my both the destination.csv files are getting the same output ?
can you guide ,how to rectify this problem .
please refer the below link for the config files. :
http://pastebin.com/NbgRDmTW

Transforming a complete DB

Hi,

in my ETL script I try to do something like this:

%w[table1 table2 table3].each do |t|
  source Source,'db', t
  case t
    when 'table1'
      transform { ... }
  end
  destination Destination,'db2', t
end

It looks like that is not how it's supposed to work. Do I have to define seperate ETL jobs for each table?

Chreers,
Christopher

Deprecate non-live Sequel connection passing

Today it is possible in Kiba Pro to pass a simple database url string to SQL components (SQL Source, SQL single record Upsert destination, SQL bulk insert / upsert destination), like this:

database: ENV.fetch('DATABASE_URL')

For the same reasons the kiba command line is getting deprecated, I'll instead encourage to pass live connections here, created via a surrounding block:

Sequel.connect(xxx) do |db|
  Kiba.run(ETL::MyJob.setup(db: db))
end

This will allow automatic connection closing, which is a better pattern, and also brings more control over how the connection can be tweaked (e.g. extensions).

Keeping this note as a roadmap item.

Allow to decorate sources to normalize rows

Since #4 is harder than I first expected, I could still provide reusable normalization right after the source, quite easily. Potential syntax:

source Kiba::Normalizer, CsvSource, 'input.csv' do |row|
  tags = row.delete(:tags).split(',')
  tags.each { |tag| yield(row.deep_copy.merge(tag: tag) }
end

This is less powerful than #4, but will still cover most of the cases.

Make StreamingRunner the default

The StreamingRunner introduced in #44 is much more powerful than the default runner.

I intent to 1/ make it the default and 2/ evaluate the potential removal of the current runner (but I'll have to carry more CPU & memory benchmarks first).

Targeted for v3.0.0.

Support pre_process

Like post_process, it's handy to be able to run a piece of processing right before the source starts to be read from.

Figure out how to package reusable assets

Types of assets:

commonly used sources, transforms and destinations (e.g.: CSVSource)
syntactic sugar like show_me!, limit(10) or field_transform(xxx)

These could be either part of kiba itself, or a kiba-common gem, or maybe separate repositories, sometimes provided by 3rd parties.

Consider command line arguments

Just creating an issue for now: I want to take the time to think this through.

The current way to pass arguments to tweak the settings of a given script is via ENV[''] variables.

Another possibility would be to add args passing in the kiba CLI itself, like @bcherrington did in 97758bb and ca5240a.

I must also document this in all cases, since this is a fairly standard need.

1st attempt at a nice solution for progress bars

So I've started using kiba for some good ETL work, and I wrote a solution that inserts progress bars pretty easily without much fuss. I'm not sure if this is the best way to implement this, but it works.

And now, an example ETL. This is where all the ugliness is. It all revolves around the fact that one wants to know the total number of records (so that we can count them), but putting this code in initialize isn't enough: that doesn't actually get called in the correct order. So I do this hack where I get total out of the file independently, and then keep passing it around.

etl/customers-to-contacts.etl

#!/usr/bin/env ruby

require 'pry'

require_relative 'SourceCustomers'
require_relative 'DestinationContacts'
require_relative 'Transformations'


source_file = 'C:\\NetClipDev\\CLIPv\\MAINDBF\\customer.dbf'

@progressbar = ProgressBar.create(      title: "#{self.class.name}",
                                        total: SourceCustomers.count(source_file),
                                        autofinish: false
                                        )



pre_process do
   @progressbar.log '[PRE]'
end


source SourceCustomers, source_file, progresses_with: @progressbar, total: SourceCustomers.count(source_file)


transform SkipBadCustomers, '', progresses_with: @progressbar, total: SourceCustomers.count(source_file)


destination DestinationContacts, ''


post_process do
   @progressbar.log '[POST]'
end

SourceCustomers.rb

require 'dbf'
require 'pry'
require_relative './dbfutil'
require_relative './Progressive'

class SourceCustomers
   prepend Progressive
   @total = 0

   def initialize(input_file, options)
      @table ||= DBF::Table.new(input_file)
      @ignored << 'NOTES'
   end

   def self.count(handle)
      @table ||= DBF::Table.new(handle)
      @total = @table.record_count
   end

   def each
     @table.each do |row|
         progress!
         yield(row.attributes_ignoring(@ignored)) unless row.nil?
     end
     @table.close; stop!
   end 
end

Transformations.rb

class SkipBadCustomers
   prepend Progressive

   def initialize(handle, options)
   end

   def process(row)
      row[:num] == -1 ? nil : row 
   end
end

And last, the module. Note that you don't have to change anything to get started:

Progressive.rb

module Progressive
   def initialize(handle, options = {})
        @handle = handle
        @options = options
        @pb = @options[:progresses_with]
        @total = @options[:total]
        @format = '%t: |%B| %c/%C'
        @title  = "#{self.class.name}"
        @pb.reset
        super
   end

   def each
       @pb.format   = @format
       @pb.title    = @title
       @pb.progress = 0
       @pb.total    = self.class.count(@handle)
       super
   end

   def process(row)
       if ([email protected]? && @pb.progress == @pb.total)
           stop!
           @pb.reset
           @pb.progress = 0
           @pb.format = @format
           @pb.title = @title
           @pb.total = @total
        end
        progress!
        super
   end

   def progress!
       @pb.increment unless (@pb.nil? || @pb.progress == @pb.total)
   end

   def stop!
        puts ''
   end
end

So, in this case I am applying the prepend Progressive to both a source and a transform class. This creates weird behavior though, namely: You have to wait twice as long overall, because you have to watch the progress bar go through both classes. If you remove this, it takes the expected amount of time, and you never see the SkipBadCustomers bar.

So in this case, you get something like:

$ bundle exec kiba etl/customers-to-contacts.etl
[PRE]
SourceCustomers: |================================================ | 14838/14842
SkipBadCustomers: |=============================================== | 14827/14842
[POST]

Again, this is rather inefficient. I think there's some funny stuff going on when I do this with the transform. I really wanted it to skip the few bad customers that are there (there should be ~ 10 of these records), and go really fast.

Note there is a bug that's hidden here. If you try to include the elapsed time or the ETA in the format string, it crashes ruby-progressbar. I don't know why, it seemed to be caused by ruby-progressbar's dependence on activerecord or activesupport for time.

Anyway, I'm sure there is room for improvement. Let me know what you think.

Execution lists

Hi,

First off thank you so much for your hard work on Kiba. It's a wonderful tool that's making my life a lot easier. I am wondering if there is a way to write a bunch of ETL modules and then define the order that they are run?

Example: I have 3 different tables that are being transformed into 5 tables. I have a specific order I need to run them in production, but while developing them I can do each table 1 at a time. I would like it if I could define a run list that executes them in a specific order. Thoughts? Thanks!

Improperly detects nil, as block

Given

Kiba.run(Kiba.parse do
  source MySource
  destination MyDestination
  transform nil
end)

Kiba will throw an error that "Class and block form cannot be used together at the moment", but nil is not a block, it should throw a unique error that it found neither a block, or transformer.

Support symbol as class for source/transform/destination

Hi,

Thanks for Kiba makes us easier to manage data processing, when we write Kiba jobs, we noticed there are many times defining the same class for source/transform/destination, the class name is not short to write or remember. So we try to add this feature to make life beter:

# config/initializers/kiba.rb
Kiba.register_sources :http => Kiba::Tanmer::Sources::HttpClient
Kiba.register_transforms :parse_doc => Kiba::Tanmer::Transforms::ParseDoc,
                         :xml_select => Kiba::Tanmer::Transforms::XMLSelector,
                         :link => Kiba::Tanmer::Transforms::LinkToHash
Kiba.register_destinations :model, Kiba::Tanmer::Destinations::ModelStore

# app/etls/my_job.etl
Kiba.parse do
  source :http, 'https://google.com'
  transform :parse_doc
  transform :xml_select, selector: 'a'
  transform :link
  destination :model, Link, key: :href
end

See changes: tanmer@325b23e

Is this possible to merge this feature to Kiba?

Introduce Contributor Agreement

To encourage contributions all while securing the codebase from a legal standpoint (both for users & for myself), I'm in the process (with a lawyer) of designing a "contributor agreement" for kiba, kiba-common and kiba-pro.

This will be ready shortly & I will update this issue once it's done.

Agreeing to the "contributor agreement" will be required to allow the merge of a PR.

This supersedes #65.

Documentation rewrite

(if you read this - please do not work on this - this is already in the works ; this issue is here to communicate with Kiba users)

The patterns of use & recommended implementation guidelines have evolved quite a bit since Kiba v1 was released 4 years ago.

I'm rewriting the documentation from scratch to ensure we have a better newcomer experience and we can encourage the patterns that I've seen work in production.

Eager-loading source cannot read file generated by pre-process step

A source reading data straight from its constructor will raise an exception if the file is generated by the pre-processor. This can be addressed by re-ordering the flow like pointed out by @bcherrington in 5f8fe89.

Automatically `cd` into folder containing the `.etl` script at runtime?

When working with macro-ETL jobs which use system! on underlying jobs (e.g. job chaining) which are often contained in subfolders, and an underlying job fails, I get a stacktrace which isn't super helpful:

from extract.etl:38:in `block (3 levels) in parse'

(but doesn't specify which file it is).

This is done because I cd into a folder, then run kiba from there, called by the macro-job.

It could be interesting that by default, kiba would automatically cd to the folder containing the .etl script (e.g using Dir.chdir block form to get automatic rollback), to get a full path here (a bit like rake does today).

Not sure if it should become the default or not, or just a simple type of flag or extension.

Allow to pass parameters to run when called programmatically

When calling Kiba.run programmatically (which apparently a growing number of users are doing), it would be very useful to be able to pass parameters to configure the run.

Points to be investigated

Must think about the best way to implement this.
Some internal classes (Control) should be made public and documented properly, maybe renamed (JobDefinition?).
This will encourage calling run multiple time on the same Control, so I must consider the potential issues with this scenario (e.g. if you store some @state in the control, you'd have to initialize it)
When used from Sidekiq or MT in general, give advice to eager load classes?
Will this leak memory or not?

Potential work-arounds for now

See 97758bb and ca5240a which may provide a way to do that now.

Related issues

Implementation points to consider

At this point I believe an input is mostly useful at parse time, for instance:

to loop over a passed variable and declare one source per item.
to use the input to feed a Source configuration.

We should also make sure that there is a way to ultimately pass that from both API calls and command-line.

Allow to parse and run jobs in a single step via Kiba.run

When working using the new programmatic API, I find myself writing quite often:

job = Kiba.parse do
  # SNIP
end
Kiba.run(job)

I would find it useful to be able to instead write:

Kiba.run do
  # SNIP
end

I must verify the impact to ensure a strict backward compatibility, but will try to implement that.

If extractor returns an instance of Enumerable instead of yielding items, the ETL pipeline stops

(I'm not sure whether this should be in SO or here.)

Is there a way to allow:

def each
  items
end

instead of:

def each
  items.each do |item|
    yield item
  end
end

Would that make sense for the gem to support?

Misleading test

I just found this test class and I think it might be misleading, not doing what it's described to do:
https://github.com/thbar/kiba/blob/a2b5573f49857e57bd3739b908587fa5539ba8f1/test/test_buffering_transform.rb

It's describing Buffering Transform, but it's actually using Streaming Transform.

Error handling, logging, and alerting?

You should add other blocks like ensure, on_error, and flags such as stop_on_error, log etc

Sometimes it's OK to skip one bad row and let the rest of the file process other times not so much. Also most of the ETL tends to be run unattended to there should be hooks for email alerts or sending data to a monitor.

Thanks.

Add shortcut for field transform

It's common to just transform one field like this:

transform do |r|
  r[:bought_for] = r[:bought_for].scan(/\[([^\]]+)\]/).flatten
  r
end

It could be written shorter like this:

# n = name, v = value, r = row like in aw-etl
field(:bought_for) { |n,v,r| v.scan(/\[([^\]]+)\]/).flatten }

I'm not sure if I should pick field or field_transform, yet.

Ruby 2.7+ compatibility (keyword arguments)

Due to the recent changes announced here:

https://www.ruby-lang.org/en/news/2019/12/12/separation-of-positional-and-keyword-arguments-in-ruby-3-0/

(and tracked here initially https://bugs.ruby-lang.org/issues/14183)

it is possible to get warnings (and later with Ruby 2.8+, most likely errors), depending on how the components are implemented (and this is the case with kiba-common, for instance).

I'm opening this issue to track this down.

For now I would recommend to stick with Ruby 2.6 if possible.

Passing arguments on the command line

I think this would be useful but dont currently see a way to do it. I'm wrapping Kiba scripts in Rake tasks for convenience and to be able to do other actions. So the paths to input/output file locations are in Rakefiles and duplicated in ETL files. Passing in from the command line would be handy.

Allow to return values when called programmatically

It would be great to be able to return parameters back from the ETL script.

An example of a usefuly return value would be the number of rows processed.

If you use Kiba, your testimonial would be appreciated! ❤️

Now that Kiba ETL v3 is out and documentation has been rewritten, I'm planning to redesign the Kiba ETL website. A /r/ruby reddit post the other day reminded me that a few up-to-date testimonials would be helpful on that website.

If you are using Kiba as a business, and want to help out, you can share here:

A few lines explaining what is the value provided to your company by Kiba or Kiba Pro
Your company name and logo (with authorisation for use on the website)

(or alternatively, reach me by email at [email protected] if you prefer)

Many thanks in advance, as increased adoption will help me support Kiba!

Multithreaded

Currently a kiba job takes hours. It would be nice to reduce that with threading.

Profiling of data

What is the best way to generate a data profile using Kiba? Before I start to build an etl job i like to determine various facts about the dataset like uniqueness, completeness, data types, null counts, formats, min, max, central tendency, possible keys etc. If there is not an existing method in Kiba, are there any suggestions on creating a profile in a ruby way? There are many commercial tools that can profile a dataset but I want to move the profiling step into my ruby codebase. Any thoughts? How do you all handle this now?

Unable to yield inside transform block

The following code fails with etl.rb:X:in block (3 levels) in setup: no block given (yield) (LocalJumpError)

# etl.rb
module ETL
  module_function

  def setup(config = {})
    Kiba.parse do
      # ... input

      transform do |row|
        2.times do |i|
          yield({ foo: i })
        end
      end

      # ... output
    end
  end
end

This works:

# etl.rb
class ExplodingTransform
  def process(row)
    2.times do |i|
      yield({ foo: i })
    end
  end
end

module ETL
  module_function

  def setup(config = {})
    Kiba.parse do
      # ... input

      transform ExplodingTransform

      # ... output
    end
  end
end

	filename = args[0]
	script_content = IO.read(filename)
	job_definition = Kiba.parse(script_content, filename)
	Kiba.run(job_definition)