meltanolabs / tap-csv Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 14.0 1.07 MB

A Singer Tap for extracting data from CSV files built using the Meltano SDK.

License: Apache License 2.0

Python 100.00%

tap-csv's People

Contributors

Stargazers

Watchers

Forkers

songpa cwegener mixulo youcruit klinskyc seajhawk bartman0 visch tyshkan bigfish807029946 sbalnojan siimtolk durableprogramming

tap-csv's Issues

Hard to find errors

(This is a partial dup of another issue, but I'll let that one focus on file encoding)

I'm getting an error when trying to meltao run - can you find it?

How about now?

If the green info text accurately matched the level=WARNING or level=CRITICAL it would be infinitely easier to figure out what's going on.

unable to get the version when invoking in Meltano

Hello,

I have installed the tap-csv using:
meltano add extractor tap-csv

It appears to successfully install. However when I use the command to invoke the tap and then pull the version, it's listed as an unrecognized argument:
meltano invoke tap-csv --version
tap-csv: error: unrecognized arguments: --version

I can still invoke and it gives me options, but none appear (at least from what I see here) to be able to provide this:

usage: tap-csv [-h] [-c CONFIG] [-s STATE] [-p PROPERTIES] [--catalog CATALOG] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Config file
  -s STATE, --state STATE
                        State file
  -p PROPERTIES, --properties PROPERTIES
                        Property selections
  --catalog CATALOG     Catalog file
  -d, --discover        Do schema discovery

Is there a change here to be aware of? Thanks!

INFO:tap-csv:Skipping deselected stream 'payers'.

Hi!
I used the files: configuration first in meltano to load one file 'patients.csv', and all was good.
Then I used the csv_files_definition: to load two files 'payors.csv' and 'patients.csv.
but then it gave me this message:

$ meltano run tap-csv target-postgres
2022-12-29T19:07:04.462545Z [warning  ] Failed to create symlink to 'meltano.exe': administrator privilege required
2022-12-29T19:07:04.593348Z [info     ] Environment 'dev' is active
2022-12-29T19:07:07.404553Z [info     ] INFO:tap-csv:Skipping deselected stream 'payers'. cmd_type=elb consumer=False name=tap-csv producer=True stdio=stderr string_id=tap-csv
2022-12-29T19:07:07.490515Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True

I checked this directory in meltano
run>.meltano>run>tap-csv>state,json:

{
  "bookmarks": {
    "patients": {}
  }
}

Not sure where to make it select whatever in the csv_definition.json file:

[
    {
      "entity": "patients",
      "path": "./extract/synthea/patients.csv",
      "keys": ["Id"],
      "encoding": "UTF-8"
    },
    {
      "entity": "payers",
      "path": "./extract/synthea/payers.csv",
      "keys": ["Id"],
      "encoding": "UTF-8"
    }

  ]

any ideas ?

Default to `utf-8` encoding

Slack thread https://meltano.slack.com/archives/C01TCRBBJD7/p1665147787434089

OG issue #44

Remove State Capability

Currently state is a listed capability but we dont bookmark files. Try to implement state tracking for files in an OS agnostic way.

Or just remove the state capability from the README and MeltanoHub and note that it doesnt manage state.

fix: set proper types for metadata columns

Following up #125

It would be much convenient to set proper types in advance since we know all of them ahead

Implement State Capability

Related to discussions in #11, we should implement state here.

Improve config definitions and defaults

We extract configs like encoding and others through the code (e.g. encoding default definitions) and cast to the defaults. It would be better to do the default coalescing at the top most level for consistency, to allow --about to return those defaults, and for better readability.

Also we define the files array in meltano.yml without the rich type definitions. We should update that to include the full output of --about --format=json.

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81

While testing Meltano I ran into this issue:

[info ] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5719: character maps to <undefined>

Why is this surfacing as [info] instead of error?
I wondered if this might happen because in the Pandas based extractor/loader I wrote before trying Meltano I had to play with encodings because of non-UTF8 chars.

I think a reasonable fix would be to add a config setting of encoding: for tap-csv, and then if set use the specified encoding it in the call to open() in the get_rows() function in client.py. e.g:
with open(file_path, "r", encoding=encoding) as f:

Need a way to handle files with transient columns

I've already shared this in the #windows slack channel, but I want to make sure that my use case isn't lost.

The files I'm loading have about 20% of the columns that come and go.

If I load one file with one set of columns, then another file that has a new column into the same table the new column is ignored [info ] time=2022-07-19 16:35:06 name=tap-csv level=WARNING message=Property 'Column Name (2021))' was present in the 'table_meltano' stream but not found in catalog schema. Ignoring..

In a loader I wrote before trying Meltano, I handled the transient columns (20% of the columns) by using Pandas and combining them into a single json column in the dataframe before loading. Then I use dbt to run some SQL that extracts that data back out into its own, properly designed table for reporting.

I think I could figure out how to create my own tap-csv to implement my method of handling the transient columns, but I've probably spent much longer working on ELT than I should and need to move on to the phase of my project where I'm setting up Superset, Hex, and perhaps Lightdash so others can start exploring the data.

Fix bug / add feature for recursive folder scan

Documentation states “path: Local path to the file to be ingested. Note that this may be a directory, in which case all files in that directory and any of its subdirectories will be recursively processed”, but in the code I see os.listdir function used to get files list from the path (so only top-folder files processed).

As @pnadolny13 mentioned the code ported from the legacy version https://gitlab.com/meltano/tap-csv/-/blob/master/tap_csv/__init__.py#L35 and it seems like the recursion function is missing https://gitlab.com/meltano/tap-csv/-/blob/master/tap_csv/__init__.py#L46.

Discussion in Slack

Allow using tap-csv with Python 3.10

Issue Description

Allow using tap-csv in environments that use the standard Python.org interpreter version 3.10. Python.org have released version 3.10 more than 4 months ago by now, and some environments only have Python.org 3.10 available as the only interpreter. More and more environments will move to Python 3.10 as the only available interpreter in the very near future (e.g. Ubuntu will move to Python 3.10 as the default in less than 60 days with the release of Ubuntu 22.04 LTS)

Current behaviour

Launch an OSE that only has Python 3.10 as the default interpreter (at the time of writing this, that applies to Arch Linux
Install meltano and create a new project with meltano init
Add the tap-csv extractor into the new project with meltano add extractor tap-tsv
Installation of tap-csv fails with error:

Added extractor 'tap-csv' to your Meltano project
Variant:        meltanolabs (default)
Repository:     https://github.com/MeltanoLabs/tap-csv
Documentation:  https://hub.meltano.com/extractors/csv.html

Installing extractor 'tap-csv'...
Extractor 'tap-csv' could not be installed: failed to install plugin 'tap-csv'.
  Running command git clone --filter=blob:none --quiet https://github.com/MeltanoLabs/tap-csv.git /tmp/pip-req-build-47nr2w7s
ERROR: Package 'tap-csv' requires a different Python: 3.10.2 not in '<3.10,>=3.6.2'

Failed to install plugin(s)

Expected behaviour

Same as steps 1-3 above.
4. Installation of tap-csv succeeds

feat: add source file metadata columns

From a slack thread https://meltano.slack.com/archives/C01TCRBBJD7/p1679410813099339

Its sometimes helpful to have additional metadata about the files that the records were extracted from. Theres precedence in s3-csv and sftp already.

I'd vote to leave this off by default to keep current behavior and allow it to be turned on using a config boolean. The other implementations do it by default but I dont know the implications to existing users and whether this would cause problems if a new property started being extracted.

Support batch messaging

CSV seperators & dev activity

hi @pnadolny13,

You seem to be the most active maintainer on this project so I took the liberty to ping you.
I am wondering if there is currently any dev activity on this project? most of the commits apears to be dependabot generated.

If there is dev activity I was wondering if there is anything going on with regards to adding support for other csv sepperators, spessifically ;.

Thanks in advance for your help.