datamill-co / target-redshift Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 17.0 89 KB

A Singer.io Target for Redshift

License: MIT License

Shell 0.32% Python 99.68%

etl json-schema redshift singer stream

target-redshift's People

Contributors

Stargazers

Watchers

Forkers

alexandermann mirelagrigoras chrisnordqvist limess rthirumulu stefankeidel aroder joaoheron vitoravancini luisdomin5 garden-of-delete chriskl friendkak sagedata-ood dexhorthy

target-redshift's Issues

Column Sizing

Hello all. I am working with target-redshift, however, was looking to resize columns more accurately.

I noticed target-redshift uses default_column_length, but this seems very limited to setting all varchar columns the same size.

Do I need to add something into the tap's catalog.json?

It would be a great addition to implement some functionality for accurate column sizing.

Table comments breaking target

Hello everyone, I've been using target-postgres for almost an year and I'm now beggining to use target-redshift. Thank you for all the awesome work.

I'm having an issue when the target code tries to read the table comment metadata. I'm not sure why this is happening, and I was able to make it work changing two lines of code, maybe showing which lines I've changed and my json schema you could point me to something I'm getting wrong.

I've changed from file postgres.py around line 710

 comment = cur.fetchone()[0]

if comment:
    try:
        comment_meta = json.loads(comment)
    except:
        self.LOGGER.exception('Could not load table comment metadata')
        raise
else:
    comment_meta = None

comment = cur.fetchone()
        
if comment and len(comment) > 0:
    comment = comment[0]
    try:
        comment_meta = json.loads(comment)
    except:
        self.LOGGER.exception('Could not load table comment metadata')
        raise
else:
    comment_meta = None

and my json schema is

{ "streams": [{ "stream_alias": "batidas", "stream": "batidas", "tap_stream_id": "batidas", "schema": { "selected": true, "type": ["null", "object"], "properties": { "NSR": { "type": ["null", "string"] }, "NREP": { "type": ["null", "string"] }, "PIS": { "type": ["null", "string"] }, "Data": { "type": ["null", "string"] }, "Hora": { "type": ["null", "string"] }, "Latitude": { "type": ["null", "string"] }, "Longitude": { "type": ["null", "string"] } } }, "metadata": [{ "metadata": { "selected": true, "table-key-properties": ["NSR", "NREP", "PIS"] }, "breadcrumb": [] }] }] }

the error message is the following:

ERROR Exception writing records
Traceback (most recent call last):
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 302, in write_batch
    {'version': target_table_version})
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 824, in write_batch_helper
    metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 84, in upsert_table_helper
    log_schema_changes=log_schema_changes)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 400, in upsert_table_helper
    self.add_table(connection, table_path, table_name, _metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 107, in add_table
    json_schema.make_nullable({'type': json_schema.BOOLEAN}))
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 727, in add_column_mapping
    metadata = self._get_table_metadata(cur, table_name)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 713, in _get_table_metadata
    comment = cur.fetchone()[0]
TypeError: 'NoneType' object is not subscriptable
CRITICAL ('Exception writing records', TypeError("'NoneType' object is not subscriptable",))
Traceback (most recent call last):
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 302, in write_batch
    {'version': target_table_version})
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 824, in write_batch_helper
    metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 84, in upsert_table_helper
    log_schema_changes=log_schema_changes)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 400, in upsert_table_helper
    self.add_table(connection, table_path, table_name, _metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 107, in add_table
    json_schema.make_nullable({'type': json_schema.BOOLEAN}))
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 727, in add_column_mapping
    metadata = self._get_table_metadata(cur, table_name)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 713, in _get_table_metadata
    comment = cur.fetchone()[0]
TypeError: 'NoneType' object is not subscriptable

Thank you in advance!

Add support for Postgres array types

When using tap-postgres on a table with the string array type varchar[], the Singer schema message produced uses a sdc_recursive_string_array. target-redshift fails in this scenario with the error:

CRITICAL ('`schema` is an invalid JSON Schema instance: {...} is recursive')

In the future it would be great to add support for postgres array types.

Memory consumption increase

Expected Behavior

When running target-redshift for big and long streams extract,
the target is able to load all the data with a stable resources consumption.

Current Behavior

When running target-redshift==0.2.1 for one big and long stream extract on the Python Docker image python:3.7.5-buster,
the target is ultimately killed by Unix because of an OOM error.
Then, the target send a SIGPIPE to the tap, causing a BrokenPipeError when calling singer.write_message()

    sys.stdout.flush()
BrokenPipeError: [Errno 32] Broken pipe

Possible Solution

We either have something like a weird data-shape which is causing us to hang onto old pointers/refs OR there's an honest to god memory leak

Steps to Reproduce

Build the following Docker image with docker build -t memory_test .

FROM python:3.7.5-buster

RUN apt-get update --yes \
  && apt-get upgrade --yes \
  && apt-get install --yes \
    vim \
    htop

RUN pip install target-redshift==0.2.1
RUN echo '#!/bin/bash\n'\
'# Print the schema\n'\
'echo "{\"type\": \"SCHEMA\", \"stream\": \"engagements\", \"schema\": {\"properties\": {\"engagement_id\": {\"type\": [\"null\", \"integer\"]}, \"engagement\": {\"properties\": {\"id\": {\"type\": [\"null\", \"integer\"]}, \"portal_id\": {\"type\": [\"null\", \"integer\"]}, \"active\": {\"type\": [\"null\", \"boolean\"]}, \"created_at\": {\"type\": [\"null\", \"integer\"]}, \"last_updated\": {\"type\": [\"null\", \"integer\"]}, \"created_by\": {\"type\": [\"null\", \"integer\"]}, \"modified_by\": {\"type\": [\"null\", \"integer\"]}, \"owner_id\": {\"type\": [\"null\", \"integer\"]}, \"type\": {\"type\": [\"null\", \"string\"]}, \"timestamp\": {\"type\": [\"null\", \"integer\"]}, \"activity_type\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"associations\": {\"properties\": {\"contact_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"company_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"deal_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"owner_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"workflow_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"ticket_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"content_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"quote_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}}, \"type\": [\"null\", \"object\"]}, \"attachments\": {\"items\": {\"properties\": {\"id\": {\"type\": [\"null\", \"integer\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"metadata\": {\"properties\": {\"body\": {\"type\": [\"null\", \"string\"]}, \"from\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}, \"first_name\": {\"type\": [\"null\", \"string\"]}, \"last_name\": {\"type\": [\"null\", \"string\"]}}, \"type\": [\"null\", \"object\"]}, \"to\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"cc\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"bcc\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"subject\": {\"type\": [\"null\", \"string\"]}, \"html\": {\"type\": [\"null\", \"string\"]}, \"text\": {\"type\": [\"null\", \"string\"]}, \"status\": {\"type\": [\"null\", \"string\"]}, \"for_object_type\": {\"type\": [\"null\", \"string\"]}, \"start_time\": {\"type\": [\"null\", \"integer\"]}, \"end_time\": {\"type\": [\"null\", \"integer\"]}, \"title\": {\"type\": [\"null\", \"string\"]}, \"to_number\": {\"type\": [\"null\", \"string\"]}, \"from_number\": {\"type\": [\"null\", \"string\"]}, \"external_id\": {\"type\": [\"null\", \"string\"]}, \"duration_milliseconds\": {\"type\": [\"null\", \"integer\"]}, \"external_account_id\": {\"type\": [\"null\", \"string\"]}, \"recording_url\": {\"format\": \"uri\", \"type\": [\"null\", \"string\"]}, \"disposition\": {\"type\": [\"null\", \"string\"]}}, \"type\": [\"null\", \"object\"]}}, \"type\": \"object\"}, \"key_properties\": [\"engagement_id\"]}"\n'\
'# Print an infinity of record\n'\
'let i=0 \n'\
'while true; do\n'\
'let i=i+1\n'\
'echo "{\"type\": \"RECORD\", \"stream\": \"engagements\", \"record\": {\"engagement\": {\"id\": ${i}, \"portal_id\": 123456, \"active\": true, \"created_at\": 1502097241136, \"last_updated\": 1566630888209, \"created_by\": 123456, \"modified_by\": 123456, \"owner_id\": 123456, \"type\": \"NOTE\", \"timestamp\": 1502097241136}, \"associations\": {\"contact_ids\": [], \"company_ids\": [], \"deal_ids\": [], \"owner_ids\": [123456], \"workflow_ids\": [], \"ticket_ids\": [], \"content_ids\": [], \"quote_ids\": []}, \"attachments\": [], \"metadata\": {\"body\": \"Hello im a body\"}, \"engagement_id\": ${i}, \"time_extracted\": \"2019-11-05T09:46:28.095053Z\"}}"\n'\
'done' >> /tmp/record_generator.sh

RUN chmod +x /tmp/record_generator.sh

RUN echo '{\n'\
'  "default_column_length": 65535,\n'\
'  "logging_level": "INFO",\n'\
'  "invalid_records_detect": false,\n'\
'  "invalid_records_threshold": 0,\n'\
'  "persist_empty_tables": true,\n'\
'  "max_batch_rows": 100000,\n'\
'  "redshift_host": "your_cluster",\n'\
'  "redshift_port": 5439,\n'\
'  "redshift_database": "your_db",\n'\
'  "redshift_username": "your_user",\n'\
'  "redshift_password": "your_password",\n'\
'  "redshift_schema": "backfill_hubspot",\n'\
'  "target_s3": {\n'\
'    "aws_access_key_id": "your_key",\n'\
'    "aws_secret_access_key": "your_access_key",\n'\
'    "bucket": "your_bucket",\n'\
'    "key_prefix": "your_prefix"\n'\
'  }\n'\
'}\n' >> /tmp/target_config.json

Log into the container in interactive mode with docker run -it memory_test /bin/bash
Update the /tmp/target_config.json file with your own credentials
Run the channel in background with nohup bash -c "/tmp/record_generator.sh | target-redshift --config /tmp/target_config.json" &
Monitor the resource consumption with htop

Context (Environment)

I was trying to backfill a big stream (engagements from the tap-hubspot), but the job always fails

Disable COMPUPDATE and STATUPDATE during copy for performance?

Per this article, performance may be improved by adding COMPUPDATE OFF STATUPDATE OFF to COPY queries.

Note that PipelineWise takes this approach in their target-redshift implementation

invalid_records_detect in config not being respected

According to the readme by adding invalid_records_detect set to false in the config Redshift won't crash when there is an invalid records. I've added it to my config file but each time Redshift encounters a malformed row I get the following error and my pipe crashes:

ERROR Exception writing records
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.

CRITICAL ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/bin/target-redshift", line 11, in <module>
    sys.exit(cli())
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 54, in cli
    main(args.config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 25, in main
    stream_to_target(input_stream, target, config=config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 69, in stream_to_target
    raise e
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 60, in stream_to_target
    _flush_streams(streams, target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 88, in _flush_streams
    _flush_stream(target, stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 81, in _flush_stream
    target.write_batch(stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 58, in write_batch
    return PostgresTarget.write_batch(self, nullable_stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 195, in write_batch
    raise PostgresError(message, ex)
target_postgres.postgres.PostgresError: ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))

From stl_load_errors: Missing newline: Unexpected character 0xffffffbd found at location 34

Meltano Recharge Issue: Datamill

62851352222d36006fb739dc-1747877889-250822-2142.pdf
Attached is a summary of an Issue we are facing with one of our meltano pipelines using a target-redshift datamill loader

Support IAM Roles

Support IAM Role authentication for S3 COPY

Add support for patternProperties

Issue:
If a tap have some fields not explicitly declared, but declared in a

"patternProperties": {
  ".+": {}
}

block, they won't be loaded into Redshift.
While, when using another target (like target-csv), the fields are available.

How to reproduce:

data.jsonl.txt
( a file with a schema and an example record)
issues.csv.txt
A CSV generated by the following command
cat data.jsonl | target-csv

Version:
Python 3.7.3
target-csv==0.3.0
target-redshift==0.0.7

Documentation:
The link to the target-csv flatten function

Performance: Use `CREATE TABLE ... LIKE ...`

Motivation

While working on datamill-co/target-snowflake#12 we figured out that leveraging the SQLInterface table upsert logic is far slower for the tmp staging tables than using a single command in remote.

Snowflake has the CREATE TABLE ... LIKE ... syntax that Postgres has as well.

Integrate with RedShfit Spectrum

Integrating with RedShift Spectrum will make this target very efficient.

Use Case
If you have large amounts of data that is not accessed frequently, it is more efficient to store it in S3 instead of keeping it loaded in RedShift. When data needs to be queried, it can be queried with RedShfit Spectrum that provides really fast querying by leveraging MPP. In order for this work, the data needs to be stored in S3 in a structured columnar format - I suggest Parquet. Moreover, these files can be gzipped leading to more efficient storage without slowing down query times.

Enhancements
RedShfit Spectrum is also able to query nested fields for files that are stored in S3 as parquet (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html). Looking at the code, seems like that's not something currently provided, so this will be an additional enhancement.

Integration Steps:

Instead of saving the files as CSVs in S3, store them as gzipped parquet files
Create an external table in RedShift that references the parquet file in S3. To define a table, we will need a schema, this can be specified by the user in the target config (along with a table name override).

Changes I foresee

We will need to update target_tools to be able to append each batch of the stream into a parquet file. This is doable using pandas and s3fs. But we can look at other options as well.
After the stream has been processed, we can take the parsed schema and create an external table. Creating the table should be just executing SQL on the cluster. This is already being done from what I can see, but we will need to update it create an external table in the RedShift cluster. (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html)

I think this can make target-redshift very scalable. Infact, we may be able to separate the intermediate code, and create a target-s3-parquet that simply stores the stream as an s3 parquet file (similar to the s3-csv tap).

Talked with @AlexanderMann about this in slack.

Support for Redshift types like NUMERIC(20, 0)

Thanks for the work on this project! We're just trying out Singer for moving data from MySQL to Redshift. In MySQL we have a column type of bigint(18) unsigned. Some values in this column don't fit it Redshift's bigint column type and we get errors like Overflow (Long valid range -9223372036854775808 to 9223372036854775807)

Typically we declare a Redshift column as NUMERIC(20, 0) to hold these values. Is there a way to tell target-redshift to use that type for a particular Redshift column?

Minor quality of life improvement

Can this column name be a bit more terse? It being the first column and being super long makes exploratory SELECT *s out of the table annoying. Given the reliance on target-postgres and the performance characteristics of Redshift, I assume removing it or moving it to the end aren't simple changes to make.

target-redshift/target_redshift/redshift.py

Line 39 in 1dd1906

CREATE_TABLE_INITIAL_COLUMN = '_sdc_target_redshift_create_table_placeholder'

singer target-redshift, not able to connect to Redshift due to issue with SSL

The connectivity was good to the redshift from dbeaver with a set of creds and was able to intract on the redshift. But the same creds and details when used with singer target-redshift, not able to connect.
The current target-redshift do not support passing "SSL true". Could you help with the same ?

ImportError: cannot import name 'Mapping' from 'collections'

I have a shopify store and I am trying to read the data from it using singer.io and dump it into the redshift cluster. I was able to run it successfully using postgresql but redshift is throwing an error that I am unable to understand and fix. Any help or suggestions as to how to get around this?

ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

Below is my redshfit_config.json

{
   "redshift_host":"redshift-cluster-1.cepdr3dnmmcm.us-east-1.redshift.amazonaws.com",
   "redshift_port":5439,
   "redshift_database":"dev",
   "redshift_username":"awsuser",
   "redshift_password":"Password",
   "redshift_schema":"singer",
   "default_column_length":1000,
   "target_s3":{
      "aws_access_key_id":"*Aws key id*",
      "aws_secret_access_key":"*aws access key*",
      "bucket":"singer-redshift",
      "key_prefix":"__tmp"
   }

and when trying to run i am getting this error

root@e0e876755adf:/# ~/.virtualenvs/tap-shopify/bin/tap-shopify -c config.json  --catalog catalog.json | target-redshift --config redshift_config.json
INFO GET https://testnoman.myshopify.com/admin/api/2024-01/shop.json
Traceback (most recent call last):
  File "/usr/local/bin/target-redshift", line 5, in <module>
    from target_redshift import cli
  File "/usr/local/lib/python3.10/dist-packages/target_redshift/__init__.py", line 8, in <module>
    from target_redshift.s3 import S3
  File "/usr/local/lib/python3.10/dist-packages/target_redshift/s3.py", line 3, in <module>
    import boto3
  File "/usr/local/lib/python3.10/dist-packages/boto3/__init__.py", line 16, in <module>
    from boto3.session import Session
  File "/usr/local/lib/python3.10/dist-packages/boto3/session.py", line 17, in <module>
    import botocore.session
  File "/usr/local/lib/python3.10/dist-packages/botocore/session.py", line 29, in <module>
    import botocore.configloader
  File "/usr/local/lib/python3.10/dist-packages/botocore/configloader.py", line 19, in <module>
    from botocore.compat import six
  File "/usr/local/lib/python3.10/dist-packages/botocore/compat.py", line 25, in <module>
    from botocore.exceptions import MD5UnavailableError
  File "/usr/local/lib/python3.10/dist-packages/botocore/exceptions.py", line 15, in <module>
    from botocore.vendored import requests
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/__init__.py", line 58, in <module>
    from . import utils
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/utils.py", line 26, in <module>
    from .compat import parse_http_list as _parse_list_header
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/compat.py", line 7, in <module>
    from .packages import chardet
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/__init__.py", line 3, in <module>
    from . import urllib3
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/__init__.py", line 10, in <module>
    from .connectionpool import (
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py", line 38, in <module>
    from .response import HTTPResponse
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/response.py", line 9, in <module>
    from ._collections import HTTPHeaderDict
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/_collections.py", line 1, in <module>
    from collections import Mapping, MutableMapping
ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)
INFO --> 200 OK 1517b
CRITICAL BrokenPipeError
Traceback (most recent call last):
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 213, in main
    sync()
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 149, in sync
    singer.write_schema(stream["tap_stream_id"],
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/messages.py", line 264, in write_schema
    write_message(
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/messages.py", line 227, in write_message
    sys.stdout.flush()
BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/.virtualenvs/tap-shopify/bin/tap-shopify", line 33, in <module>
    sys.exit(load_entry_point('tap-shopify==1.10.0', 'console_scripts', 'tap-shopify')())
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/utils.py", line 235, in wrapped
    return fnc(*args, **kwargs)
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 228, in main
    raise ShopifyError(exc) from exc
tap_shopify.exceptions.ShopifyError: BrokenPipeError

If I switch the target from redshift to postgres it works fine.

I am running python 3.10

CRITICAL cursor already closed / connection already closed

Hello all! I have been receiving an error while using target-redshift where the connection is established at the start of a run, but once the data has been gathered and preparing to upload the run breaks due to the connection already being closed.

CRITICAL cursor already closed
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
During handling of the above exception, another exception occurred:
psycopg2.InterfaceError: cursor already closed
During handling of the above exception, another exception occurred:
psycopg2.InterfaceError: connection already closed

If anyone could shine some light on how to resolve this issue that would be great!

connection closing before write

Currently using tap-facebook from singer.io to read from facebook marketing api and writing to Redshift using the target-redshift package. The connection establishes with Redshift and reads the data, but before i get to the write step I'm getting the error:

Traceback (most recent call last):
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 129, in write_batch
    cur.execute('BEGIN;')
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 25, in main
    stream_to_target(input_stream, target, config=config)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 69, in stream_to_target
    raise e
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 63, in stream_to_target
    _flush_streams(streams, target, force=True)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 88, in _flush_streams
    _flush_stream(target, stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 81, in _flush_stream
    target.write_batch(stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 58, in write_batch
    return PostgresTarget.write_batch(self, nullable_stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 192, in write_batch
    cur.execute('ROLLBACK;')
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InterfaceError: cursor already closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "venv--target-redshift/bin/target-redshift", line 10, in <module>
    sys.exit(cli())
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 54, in cli
    main(args.config)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
psycopg2.InterfaceError: connection already closed

Automatic varchar widening

For a better user experience, and efficient handling of text data, target-redshift should implement automatic varchar widening, which starts off at a default (say 255) and automatically widen the varchar max length column as it observes field text lengths in incoming batches.

Steps:

Start with a default max length for each new varchar column.
On each new batch, for each table:
a) Get the current length of each varchar column in the target table.
b) At some point before updating the schema, find the max of the length of all the record values for every varchar column.
c) If it’s past the limit (64k), warn or throw an Exception.
d) If it's under the limit (64k), update the column max length during schema update.

Add "timestamp without time zone" as supported data type

Amazon Redshift has a data type "timestamp without time zone" which isn't supported by the target yet (only "timestamp with time zone" is supported via target-postgres)

Data Type	Aliases	Description
TIMESTAMP	TIMESTAMP WITHOUT TIME ZONE	Date and time (without time zone)
TIMESTAMPTZ	TIMESTAMP WITH TIME ZONE	Date and time (with time zone)

Source: https://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html

So when target-redshift scan the destination schema, the following error occurs:

File "/app/target/.venv/lib/python3.7/site-packages/target_postgres/postgres.py", line 782, in sql_type_to_json_schema
    raise PostgresError('Unsupported type `{}` in existing target table'.format(sql_type))
target_postgres.exceptions.PostgresError: Unsupported type `timestamp without time zone` in existing target table