Coder Social home page Coder Social logo

target-redshift's People

Contributors

alexandermann avatar awm33 avatar limess avatar magrigoras-bd avatar pcorbel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

target-redshift's Issues

Column Sizing

Hello all. I am working with target-redshift, however, was looking to resize columns more accurately.

I noticed target-redshift uses default_column_length, but this seems very limited to setting all varchar columns the same size.

Do I need to add something into the tap's catalog.json?

It would be a great addition to implement some functionality for accurate column sizing.

Table comments breaking target

Hello everyone, I've been using target-postgres for almost an year and I'm now beggining to use target-redshift. Thank you for all the awesome work.

I'm having an issue when the target code tries to read the table comment metadata. I'm not sure why this is happening, and I was able to make it work changing two lines of code, maybe showing which lines I've changed and my json schema you could point me to something I'm getting wrong.

I've changed from file postgres.py around line 710

 comment = cur.fetchone()[0]

if comment:
    try:
        comment_meta = json.loads(comment)
    except:
        self.LOGGER.exception('Could not load table comment metadata')
        raise
else:
    comment_meta = None

to

comment = cur.fetchone()
        
if comment and len(comment) > 0:
    comment = comment[0]
    try:
        comment_meta = json.loads(comment)
    except:
        self.LOGGER.exception('Could not load table comment metadata')
        raise
else:
    comment_meta = None


and my json schema is

{ "streams": [{ "stream_alias": "batidas", "stream": "batidas", "tap_stream_id": "batidas", "schema": { "selected": true, "type": ["null", "object"], "properties": { "NSR": { "type": ["null", "string"] }, "NREP": { "type": ["null", "string"] }, "PIS": { "type": ["null", "string"] }, "Data": { "type": ["null", "string"] }, "Hora": { "type": ["null", "string"] }, "Latitude": { "type": ["null", "string"] }, "Longitude": { "type": ["null", "string"] } } }, "metadata": [{ "metadata": { "selected": true, "table-key-properties": ["NSR", "NREP", "PIS"] }, "breadcrumb": [] }] }] }

the error message is the following:

ERROR Exception writing records
Traceback (most recent call last):
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 302, in write_batch
    {'version': target_table_version})
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 824, in write_batch_helper
    metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 84, in upsert_table_helper
    log_schema_changes=log_schema_changes)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 400, in upsert_table_helper
    self.add_table(connection, table_path, table_name, _metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 107, in add_table
    json_schema.make_nullable({'type': json_schema.BOOLEAN}))
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 727, in add_column_mapping
    metadata = self._get_table_metadata(cur, table_name)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 713, in _get_table_metadata
    comment = cur.fetchone()[0]
TypeError: 'NoneType' object is not subscriptable
CRITICAL ('Exception writing records', TypeError("'NoneType' object is not subscriptable",))
Traceback (most recent call last):
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 302, in write_batch
    {'version': target_table_version})
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 824, in write_batch_helper
    metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 84, in upsert_table_helper
    log_schema_changes=log_schema_changes)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/sql_base.py", line 400, in upsert_table_helper
    self.add_table(connection, table_path, table_name, _metadata)
  File "/home/vitor/projects/target-redshift/target_redshift/redshift.py", line 107, in add_table
    json_schema.make_nullable({'type': json_schema.BOOLEAN}))
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 727, in add_column_mapping
    metadata = self._get_table_metadata(cur, table_name)
  File "/home/vitor/projects/target-redshift/venv/lib/python3.6/site-packages/singer_target_postgres-0.2.4-py3.6.egg/target_postgres/postgres.py", line 713, in _get_table_metadata
    comment = cur.fetchone()[0]
TypeError: 'NoneType' object is not subscriptable

Thank you in advance!

Add support for Postgres array types

When using tap-postgres on a table with the string array type varchar[], the Singer schema message produced uses a sdc_recursive_string_array. target-redshift fails in this scenario with the error:

CRITICAL ('`schema` is an invalid JSON Schema instance: {...} is recursive')

In the future it would be great to add support for postgres array types.

Memory consumption increase

Expected Behavior

When running target-redshift for big and long streams extract,
the target is able to load all the data with a stable resources consumption.

Current Behavior

When running target-redshift==0.2.1 for one big and long stream extract on the Python Docker image python:3.7.5-buster,
the target is ultimately killed by Unix because of an OOM error.
Then, the target send a SIGPIPE to the tap, causing a BrokenPipeError when calling singer.write_message()

    sys.stdout.flush()
BrokenPipeError: [Errno 32] Broken pipe

Possible Solution

We either have something like a weird data-shape which is causing us to hang onto old pointers/refs OR there's an honest to god memory leak

Steps to Reproduce

  1. Build the following Docker image with docker build -t memory_test .
FROM python:3.7.5-buster

RUN apt-get update --yes \
  && apt-get upgrade --yes \
  && apt-get install --yes \
    vim \
    htop

RUN pip install target-redshift==0.2.1
RUN echo '#!/bin/bash\n'\
'# Print the schema\n'\
'echo "{\"type\": \"SCHEMA\", \"stream\": \"engagements\", \"schema\": {\"properties\": {\"engagement_id\": {\"type\": [\"null\", \"integer\"]}, \"engagement\": {\"properties\": {\"id\": {\"type\": [\"null\", \"integer\"]}, \"portal_id\": {\"type\": [\"null\", \"integer\"]}, \"active\": {\"type\": [\"null\", \"boolean\"]}, \"created_at\": {\"type\": [\"null\", \"integer\"]}, \"last_updated\": {\"type\": [\"null\", \"integer\"]}, \"created_by\": {\"type\": [\"null\", \"integer\"]}, \"modified_by\": {\"type\": [\"null\", \"integer\"]}, \"owner_id\": {\"type\": [\"null\", \"integer\"]}, \"type\": {\"type\": [\"null\", \"string\"]}, \"timestamp\": {\"type\": [\"null\", \"integer\"]}, \"activity_type\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"associations\": {\"properties\": {\"contact_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"company_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"deal_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"owner_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"workflow_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"ticket_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"content_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}, \"quote_ids\": {\"items\": {\"type\": [\"null\", \"integer\"]}, \"type\": [\"null\", \"array\"]}}, \"type\": [\"null\", \"object\"]}, \"attachments\": {\"items\": {\"properties\": {\"id\": {\"type\": [\"null\", \"integer\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"metadata\": {\"properties\": {\"body\": {\"type\": [\"null\", \"string\"]}, \"from\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}, \"first_name\": {\"type\": [\"null\", \"string\"]}, \"last_name\": {\"type\": [\"null\", \"string\"]}}, \"type\": [\"null\", \"object\"]}, \"to\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"cc\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"bcc\": {\"items\": {\"properties\": {\"email\": {\"type\": [\"null\", \"string\"]}}, \"type\": \"object\"}, \"type\": [\"null\", \"array\"]}, \"subject\": {\"type\": [\"null\", \"string\"]}, \"html\": {\"type\": [\"null\", \"string\"]}, \"text\": {\"type\": [\"null\", \"string\"]}, \"status\": {\"type\": [\"null\", \"string\"]}, \"for_object_type\": {\"type\": [\"null\", \"string\"]}, \"start_time\": {\"type\": [\"null\", \"integer\"]}, \"end_time\": {\"type\": [\"null\", \"integer\"]}, \"title\": {\"type\": [\"null\", \"string\"]}, \"to_number\": {\"type\": [\"null\", \"string\"]}, \"from_number\": {\"type\": [\"null\", \"string\"]}, \"external_id\": {\"type\": [\"null\", \"string\"]}, \"duration_milliseconds\": {\"type\": [\"null\", \"integer\"]}, \"external_account_id\": {\"type\": [\"null\", \"string\"]}, \"recording_url\": {\"format\": \"uri\", \"type\": [\"null\", \"string\"]}, \"disposition\": {\"type\": [\"null\", \"string\"]}}, \"type\": [\"null\", \"object\"]}}, \"type\": \"object\"}, \"key_properties\": [\"engagement_id\"]}"\n'\
'# Print an infinity of record\n'\
'let i=0 \n'\
'while true; do\n'\
'let i=i+1\n'\
'echo "{\"type\": \"RECORD\", \"stream\": \"engagements\", \"record\": {\"engagement\": {\"id\": ${i}, \"portal_id\": 123456, \"active\": true, \"created_at\": 1502097241136, \"last_updated\": 1566630888209, \"created_by\": 123456, \"modified_by\": 123456, \"owner_id\": 123456, \"type\": \"NOTE\", \"timestamp\": 1502097241136}, \"associations\": {\"contact_ids\": [], \"company_ids\": [], \"deal_ids\": [], \"owner_ids\": [123456], \"workflow_ids\": [], \"ticket_ids\": [], \"content_ids\": [], \"quote_ids\": []}, \"attachments\": [], \"metadata\": {\"body\": \"Hello im a body\"}, \"engagement_id\": ${i}, \"time_extracted\": \"2019-11-05T09:46:28.095053Z\"}}"\n'\
'done' >> /tmp/record_generator.sh

RUN chmod +x /tmp/record_generator.sh

RUN echo '{\n'\
'  "default_column_length": 65535,\n'\
'  "logging_level": "INFO",\n'\
'  "invalid_records_detect": false,\n'\
'  "invalid_records_threshold": 0,\n'\
'  "persist_empty_tables": true,\n'\
'  "max_batch_rows": 100000,\n'\
'  "redshift_host": "your_cluster",\n'\
'  "redshift_port": 5439,\n'\
'  "redshift_database": "your_db",\n'\
'  "redshift_username": "your_user",\n'\
'  "redshift_password": "your_password",\n'\
'  "redshift_schema": "backfill_hubspot",\n'\
'  "target_s3": {\n'\
'    "aws_access_key_id": "your_key",\n'\
'    "aws_secret_access_key": "your_access_key",\n'\
'    "bucket": "your_bucket",\n'\
'    "key_prefix": "your_prefix"\n'\
'  }\n'\
'}\n' >> /tmp/target_config.json
  1. Log into the container in interactive mode with docker run -it memory_test /bin/bash
  2. Update the /tmp/target_config.json file with your own credentials
  3. Run the channel in background with nohup bash -c "/tmp/record_generator.sh | target-redshift --config /tmp/target_config.json" &
  4. Monitor the resource consumption with htop

Context (Environment)

I was trying to backfill a big stream (engagements from the tap-hubspot), but the job always fails

invalid_records_detect in config not being respected

According to the readme by adding invalid_records_detect set to false in the config Redshift won't crash when there is an invalid records. I've added it to my config file but each time Redshift encounters a malformed row I get the following error and my pipe crashes:

ERROR Exception writing records
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.

CRITICAL ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))
Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 186, in write_batch
    {'version': target_table_version})
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/sql_base.py", line 831, in write_batch_helper
    metadata)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 515, in write_table_batch
    csv_rows)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 136, in persist_csv_rows
    cur.execute(copy_sql)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InternalError: Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/bin/target-redshift", line 11, in <module>
    sys.exit(cli())
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 54, in cli
    main(args.config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 25, in main
    stream_to_target(input_stream, target, config=config)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 69, in stream_to_target
    raise e
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 60, in stream_to_target
    _flush_streams(streams, target)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 88, in _flush_streams
    _flush_stream(target, stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 81, in _flush_stream
    target.write_batch(stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 58, in write_batch
    return PostgresTarget.write_batch(self, nullable_stream_buffer)
  File "/Users/dorcielovinsky/.virtualenvs/target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 195, in write_batch
    raise PostgresError(message, ex)
target_postgres.postgres.PostgresError: ('Exception writing records', InternalError("Load into table 'tmp_5c96ed60_77f8_4b70_8246_6d8dc5e83f0d' failed.  Check 'stl_load_errors' system table for details.\n"))

From stl_load_errors: Missing newline: Unexpected character 0xffffffbd found at location 34

Add support for patternProperties

Issue:
If a tap have some fields not explicitly declared, but declared in a

"patternProperties": {
  ".+": {}
}

block, they won't be loaded into Redshift.
While, when using another target (like target-csv), the fields are available.

How to reproduce:

  • data.jsonl.txt
    ( a file with a schema and an example record)

  • issues.csv.txt
    A CSV generated by the following command
    cat data.jsonl | target-csv

Version:
Python 3.7.3
target-csv==0.3.0
target-redshift==0.0.7

Documentation:
The link to the target-csv flatten function

Integrate with RedShfit Spectrum

Integrating with RedShift Spectrum will make this target very efficient.

Use Case
If you have large amounts of data that is not accessed frequently, it is more efficient to store it in S3 instead of keeping it loaded in RedShift. When data needs to be queried, it can be queried with RedShfit Spectrum that provides really fast querying by leveraging MPP. In order for this work, the data needs to be stored in S3 in a structured columnar format - I suggest Parquet. Moreover, these files can be gzipped leading to more efficient storage without slowing down query times.

Enhancements
RedShfit Spectrum is also able to query nested fields for files that are stored in S3 as parquet (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-nested-data-create-table.html). Looking at the code, seems like that's not something currently provided, so this will be an additional enhancement.

Integration Steps:

  1. Instead of saving the files as CSVs in S3, store them as gzipped parquet files
  2. Create an external table in RedShift that references the parquet file in S3. To define a table, we will need a schema, this can be specified by the user in the target config (along with a table name override).

Changes I foresee

  1. We will need to update target_tools to be able to append each batch of the stream into a parquet file. This is doable using pandas and s3fs. But we can look at other options as well.
  2. After the stream has been processed, we can take the parsed schema and create an external table. Creating the table should be just executing SQL on the cluster. This is already being done from what I can see, but we will need to update it create an external table in the RedShift cluster. (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html)

I think this can make target-redshift very scalable. Infact, we may be able to separate the intermediate code, and create a target-s3-parquet that simply stores the stream as an s3 parquet file (similar to the s3-csv tap).

Talked with @AlexanderMann about this in slack.

Support for Redshift types like NUMERIC(20, 0)

Thanks for the work on this project! We're just trying out Singer for moving data from MySQL to Redshift. In MySQL we have a column type of bigint(18) unsigned. Some values in this column don't fit it Redshift's bigint column type and we get errors like Overflow (Long valid range -9223372036854775808 to 9223372036854775807)

Typically we declare a Redshift column as NUMERIC(20, 0) to hold these values. Is there a way to tell target-redshift to use that type for a particular Redshift column?

Minor quality of life improvement

Can this column name be a bit more terse? It being the first column and being super long makes exploratory SELECT *s out of the table annoying. Given the reliance on target-postgres and the performance characteristics of Redshift, I assume removing it or moving it to the end aren't simple changes to make.

CREATE_TABLE_INITIAL_COLUMN = '_sdc_target_redshift_create_table_placeholder'

ImportError: cannot import name 'Mapping' from 'collections'

I have a shopify store and I am trying to read the data from it using singer.io and dump it into the redshift cluster. I was able to run it successfully using postgresql but redshift is throwing an error that I am unable to understand and fix. Any help or suggestions as to how to get around this?

ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

Below is my redshfit_config.json

{
   "redshift_host":"redshift-cluster-1.cepdr3dnmmcm.us-east-1.redshift.amazonaws.com",
   "redshift_port":5439,
   "redshift_database":"dev",
   "redshift_username":"awsuser",
   "redshift_password":"Password",
   "redshift_schema":"singer",
   "default_column_length":1000,
   "target_s3":{
      "aws_access_key_id":"*Aws key id*",
      "aws_secret_access_key":"*aws access key*",
      "bucket":"singer-redshift",
      "key_prefix":"__tmp"
   }

and when trying to run i am getting this error

root@e0e876755adf:/# ~/.virtualenvs/tap-shopify/bin/tap-shopify -c config.json  --catalog catalog.json | target-redshift --config redshift_config.json
INFO GET https://testnoman.myshopify.com/admin/api/2024-01/shop.json
Traceback (most recent call last):
  File "/usr/local/bin/target-redshift", line 5, in <module>
    from target_redshift import cli
  File "/usr/local/lib/python3.10/dist-packages/target_redshift/__init__.py", line 8, in <module>
    from target_redshift.s3 import S3
  File "/usr/local/lib/python3.10/dist-packages/target_redshift/s3.py", line 3, in <module>
    import boto3
  File "/usr/local/lib/python3.10/dist-packages/boto3/__init__.py", line 16, in <module>
    from boto3.session import Session
  File "/usr/local/lib/python3.10/dist-packages/boto3/session.py", line 17, in <module>
    import botocore.session
  File "/usr/local/lib/python3.10/dist-packages/botocore/session.py", line 29, in <module>
    import botocore.configloader
  File "/usr/local/lib/python3.10/dist-packages/botocore/configloader.py", line 19, in <module>
    from botocore.compat import six
  File "/usr/local/lib/python3.10/dist-packages/botocore/compat.py", line 25, in <module>
    from botocore.exceptions import MD5UnavailableError
  File "/usr/local/lib/python3.10/dist-packages/botocore/exceptions.py", line 15, in <module>
    from botocore.vendored import requests
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/__init__.py", line 58, in <module>
    from . import utils
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/utils.py", line 26, in <module>
    from .compat import parse_http_list as _parse_list_header
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/compat.py", line 7, in <module>
    from .packages import chardet
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/__init__.py", line 3, in <module>
    from . import urllib3
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/__init__.py", line 10, in <module>
    from .connectionpool import (
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py", line 38, in <module>
    from .response import HTTPResponse
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/response.py", line 9, in <module>
    from ._collections import HTTPHeaderDict
  File "/usr/local/lib/python3.10/dist-packages/botocore/vendored/requests/packages/urllib3/_collections.py", line 1, in <module>
    from collections import Mapping, MutableMapping
ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/__init__.py)
INFO --> 200 OK 1517b
CRITICAL BrokenPipeError
Traceback (most recent call last):
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 213, in main
    sync()
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 149, in sync
    singer.write_schema(stream["tap_stream_id"],
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/messages.py", line 264, in write_schema
    write_message(
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/messages.py", line 227, in write_message
    sys.stdout.flush()
BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/.virtualenvs/tap-shopify/bin/tap-shopify", line 33, in <module>
    sys.exit(load_entry_point('tap-shopify==1.10.0', 'console_scripts', 'tap-shopify')())
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/singer/utils.py", line 235, in wrapped
    return fnc(*args, **kwargs)
  File "/root/.virtualenvs/tap-shopify/lib/python3.10/site-packages/tap_shopify/__init__.py", line 228, in main
    raise ShopifyError(exc) from exc
tap_shopify.exceptions.ShopifyError: BrokenPipeError

If I switch the target from redshift to postgres it works fine.

I am running python 3.10

CRITICAL cursor already closed / connection already closed

Hello all! I have been receiving an error while using target-redshift where the connection is established at the start of a run, but once the data has been gathered and preparing to upload the run breaks due to the connection already being closed.

CRITICAL cursor already closed
psycopg2.OperationalError: SSL SYSCALL error: EOF detected
During handling of the above exception, another exception occurred:
psycopg2.InterfaceError: cursor already closed
During handling of the above exception, another exception occurred:
psycopg2.InterfaceError: connection already closed

If anyone could shine some light on how to resolve this issue that would be great!

connection closing before write

Currently using tap-facebook from singer.io to read from facebook marketing api and writing to Redshift using the target-redshift package. The connection establishes with Redshift and reads the data, but before i get to the write step I'm getting the error:

Traceback (most recent call last):
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 129, in write_batch
    cur.execute('BEGIN;')
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 25, in main
    stream_to_target(input_stream, target, config=config)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 69, in stream_to_target
    raise e
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 63, in stream_to_target
    _flush_streams(streams, target, force=True)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 88, in _flush_streams
    _flush_stream(target, stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/target_tools.py", line 81, in _flush_stream
    target.write_batch(stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/redshift.py", line 58, in write_batch
    return PostgresTarget.write_batch(self, nullable_stream_buffer)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 192, in write_batch
    cur.execute('ROLLBACK;')
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_postgres/postgres.py", line 53, in execute
    return super(_MillisLoggingCursor, self).execute(query, vars)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/psycopg2/extras.py", line 440, in execute
    return super(LoggingCursor, self).execute(query, vars)
psycopg2.InterfaceError: cursor already closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "venv--target-redshift/bin/target-redshift", line 10, in <module>
    sys.exit(cli())
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 54, in cli
    main(args.config)
  File "/singer.io/venv--target-redshift/lib/python3.7/site-packages/target_redshift/__init__.py", line 48, in main
    target_tools.main(redshift_target)
psycopg2.InterfaceError: connection already closed

Automatic varchar widening

For a better user experience, and efficient handling of text data, target-redshift should implement automatic varchar widening, which starts off at a default (say 255) and automatically widen the varchar max length column as it observes field text lengths in incoming batches.

Steps:

  1. Start with a default max length for each new varchar column.
  2. On each new batch, for each table:
    a) Get the current length of each varchar column in the target table.
    b) At some point before updating the schema, find the max of the length of all the record values for every varchar column.
    c) If it’s past the limit (64k), warn or throw an Exception.
    d) If it's under the limit (64k), update the column max length during schema update.

Add "timestamp without time zone" as supported data type

Amazon Redshift has a data type "timestamp without time zone" which isn't supported by the target yet (only "timestamp with time zone" is supported via target-postgres)

Data Type Aliases Description
TIMESTAMP TIMESTAMP WITHOUT TIME ZONE Date and time (without time zone)
TIMESTAMPTZ TIMESTAMP WITH TIME ZONE Date and time (with time zone)

Source: https://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html

So when target-redshift scan the destination schema, the following error occurs:

File "/app/target/.venv/lib/python3.7/site-packages/target_postgres/postgres.py", line 782, in sql_type_to_json_schema
    raise PostgresError('Unsupported type `{}` in existing target table'.format(sql_type))
target_postgres.exceptions.PostgresError: Unsupported type `timestamp without time zone` in existing target table

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.