awslabs / aws-athena-query-federation Goto Github PK

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.

License: Apache License 2.0

Java 98.52% Shell 0.42% JavaScript 0.11% StringTemplate 0.04% Python 0.54% Dockerfile 0.01% TypeScript 0.35%

aws-athena-query-federation's Introduction

Amazon Athena Query Federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own code. This enables you to integrate with new data sources, proprietary data formats, or build in new user defined functions. Initially these customizations will be limited to the parts of a query that occur during a TableScan operation but will eventually be expanded to include other parts of the query lifecycle using the same easy to understand interface.

Athena Federated Queries are now available where Athena is supported. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.

tldr; Get Started:

Ensure you have the proper permissions/policies to deploy/use Athena Federated Queries
Navigate to Servless Application Repository and search for "athena-federation". Be sure to check the box to show entries that require custom IAM roles.
Look for entries published by the "Amazon Athena Federation" author.
Deploy the application
To use Federated Queries, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.
Run a query "show databases in `lambda:<func_name>`" where <func_name> is the name of the Lambda function you deployed in the previous steps.

For more information please consult:

We've written integrations with more than 20 databases, storage formats, and live APIs in order to refine this interface and balance flexibility with ease of use. We hope that making this SDK and initial set of connectors Open Source will allow us to continue to improve the experience and performance of Athena Query Federation.

Serverless Big Data Using AWS Lambda

Queries That Span Data Stores

Imagine a hypothetical e-commerce company who's architecture uses:

Payment processing in a secure VPC with transaction records stored in HBase on EMR
Redis is used to store active orders so that the processing engine can get fast access to them.
DocumentDB (e.g. a mongodb compatible store) for Customer account data like email address, shipping addresses, etc..
Their e-commerce site using auto-scaling on Fargate with their product catalog in Amazon Aurora.
Cloudwatch Logs to house the Order Processor's log events.
A write-once-read-many datawarehouse on Redshift.
Shipment tracking data in DynamoDB.
A fleet of Drivers performing last-mile delivery while utilizing IoT enabled tablets.
Advertising conversion data from a 3rd party source.

Customer service agents begin receiving calls about orders 'stuck' in a weird state. Some show as pending even though they have delivered, others show as delivered but haven't actually shipped. It would be great if we could quickly run a query across this diverse architecture to understand which orders might be affected and what they have in common.

Using Amazon Athena Query Federation and many of the connectors found in this repository, our hypothetical e-commerce company would be able to run a query that:

Grabs all active orders from Redis. (see athena-redis)
Joins against any orders with 'WARN' or 'ERROR' events in Cloudwatch logs by using regex matching and extraction. (see athena-cloudwatch)
Joins against our EC2 inventory to get the hostname(s) and status of the Order Processor(s) that logged the 'WARN' or 'ERROR'. (see athena-cmdb)
Joins against DocumentDB to obtain customer contact details for the affected orders. (see athena-docdb)
Joins against DynamoDB to get shipping status and tracking details. (see athena-dynamodb)
Joins against HBase to get payment status for the affected orders. (see athena-hbase)

WITH logs 
     AS (SELECT log_stream, 
                message                                          AS 
                order_processor_log, 
                Regexp_extract(message, '.*orderId=(\d+) .*', 1) AS orderId, 
                Regexp_extract(message, '(.*):.*', 1)            AS log_level 
         FROM 
     "lambda:cloudwatch"."/var/ecommerce-engine/order-processor".all_log_streams 
         WHERE  Regexp_extract(message, '(.*):.*', 1) != 'WARN'), 
     active_orders 
     AS (SELECT * 
         FROM   redis.redis_db.redis_customer_orders), 
     order_processors 
     AS (SELECT instanceid, 
                publicipaddress, 
                state.NAME 
         FROM   awscmdb.ec2.ec2_instances), 
     customer 
     AS (SELECT id, 
                email 
         FROM   docdb.customers.customer_info), 
     addresses 
     AS (SELECT id, 
                is_residential, 
                address.street AS street 
         FROM   docdb.customers.customer_addresses),
     shipments 
     AS ( SELECT order_id, 
                 shipment_id, 
                 from_unixtime(cast(shipped_date as double)) as shipment_time,
                 carrier
        FROM lambda_ddb.default.order_shipments),
     payments
     AS ( SELECT "summary:order_id", 
                 "summary:status", 
                 "summary:cc_id", 
                 "details:network" 
        FROM "hbase".hbase_payments.transactions)
         
SELECT _key_            AS redis_order_id, 
       customer_id, 
       customer.email   AS cust_email, 
       "summary:cc_id"  AS credit_card,
       "details:network" AS CC_type,
       "summary:status" AS payment_status,
       status           AS redis_status, 
       addresses.street AS street_address, 
       shipments.shipment_time as shipment_time,
       shipments.carrier as shipment_carrier,
       publicipaddress  AS ec2_order_processor, 
       NAME             AS ec2_state, 
       log_level, 
       order_processor_log 
FROM   active_orders 
       LEFT JOIN logs 
              ON logs.orderid = active_orders._key_ 
       LEFT JOIN order_processors 
              ON logs.log_stream = order_processors.instanceid 
       LEFT JOIN customer 
              ON customer.id = customer_id 
       LEFT JOIN addresses 
              ON addresses.id = address_id 
       LEFT JOIN shipments
              ON shipments.order_id = active_orders._key_
       LEFT JOIN payments
              ON payments."summary:order_id" = active_orders._key_

License

This project is licensed under the Apache-2.0 License.

aws-athena-query-federation's People

Contributors

Stargazers

Watchers

Forkers

henrywu2019 gillisda manueca premkishore trucnguyenlam oflenake solushunzguru55 maimoonaiqbal2000 coladaff muneer0072000 erupare ivan046 soojinj nithgovindasivan fabatef simonjang vito0302 sethtdenney georgeerickson latheswar shurvitz codingphun s4saurabh danielhaviv kylep-dremio emersongaudencio dreamsoftech brianterry jongwony tom-kita ravi77o leonsun128 pastrylabs ray-pan-bci sampatboddu kanga333 erikcdlm genums tkcmada seanprendi rocker252 irrade farahsamat sandeepveldi rohit-masur bernardorzechowski abhishekpradeepmishra gmsharpe dkul108 batrahs joselarian swatijp29 ancamillo revealmobile tom-s-powell awsbigdata atennak1 maheshda-aws sbourgin jseppi ahuwaa prayala1 ahggns joelc dwickizer geskunkworks n2bbrasil verdantforce stonedot brianok-aws hartreepartners fagan2888 declark1 youngchen7 mchlsun micah-williamson npow chadmorrow ssheff rubensmabueno sivankumar86 kziemski yukimatsuyoshi joserfjuniorllms binuidicula laurihak elliot35 sanjc okera rbpavan augustine-institute timwukp preliefp nileshacademic brodypearman akrymskiy juliezwy abel-liu rafiron ningqian-dev

aws-athena-query-federation's Issues

Connector Validator uses hardcoded Regions

If the user uploaded the lambda function to a region except US-EAST-2 validation will fail because the Lambda client is explicitly hardcoded to US-EAST-2.

Query dynamodb with list/map datatypes

Getting below error when querying dynamodb table with list and map data types. Works fine with regular data types.

[ErrorCode: INTERNAL_ERROR_QUERY_ENGINE] Amazon Athena an experienced an internal error while executing this query. Please contact AWS support for further assistance. You will not be charged for this query. We apologize for the inconvenience.

NullPointerException with nested null fields with connector DynamoDB

Describe the bug
The connector works just fine until I try to query a table that has a nested NULL field inside of a JSON object (in DynamoDB its of type 'Map').
An example is bellow.

To Reproduce
Steps to reproduce the behavior:

Create a DynamoDB table.
Create the following item:

{
  "id": "not_working",
  "some_json": {
    "nested_null_value": null
  },
  "top_layer_null": null
}

Deploy Athena DynamoDB connector.
Query the table from Athena console.
Error occurs.

Expected behavior
I would expect to get a response in the Athena console, but instead I get the error.

Screenshots / Exceptions / Errors
Your query has the following error(s):
GENERIC_USER_ERROR: Encountered an exception[java.lang.NullPointerException] from your LambdaFunction[dynamo_catalog] executed in context[retrieving meta-data] with message[java.lang.NullPointerException]

Connector Details (please complete the following information):

Version: 2020.04.01
Name AthenaDynamoDBConnector

[FEATURE] Create a columnar variant of BlockWriter/BlockSpiller

Currently, BlockWriter and BlockSpiller encourage a row wise approach to writing results. These interfaces are often viewed as simpler than there would be columnar equivalents. Even though many of the systems that we've integrated with using this SDK do not themselves support columnar access patterns, there is value in offering a a variant of these mechanisms that provide the skeleton for columnar writing of results.

The current SDK versions take the approach that experts can drop into 'native' Apache Arrow mode and simply not use these abstractions. This approach of making common things easy and still enabling access to a 'power user' mode is one we'd like to stick with but we'd also like to make it easier for customers that can/want a more columnar experience to be able to do so more easily.

Some of the key goals of this new facility would be to alleviate the performance penalty associated with all the field vector lookups and type conversion object overhead that the current row wise convince facades introduce. Depending on the source system being integrated with, these changes can improve cells/second throughput between 20% - 30% in our testing. The improvement is more dramatic when there is limited parallelism / pipelining available to hide this inefficiency.

[FEATURE] Add support for Timestampw/TZ

Add support for Timestampw/TZ to the SDK, Athena, and relevant connectors (e.g. jdbc)

[FEATURE] Add PartitionProjection support to SDK and GlueMetadataHandler.

There are many cases where PartitionProjection can be a useful optimization for tables who's partitioning scheme is formulaic. This task is to port PartitionProjection from Athena's core engine to the athena-federation-sdk module so that any connector can make use of it. Additionally, we should enhance the GlueMetadataHandler to offer PartitionProjection based on Glue table properties as a first class option.

Additional tasks to complete as part of this issue should include:

Updating the javadoc on GlueMetadataHandler which currently says that the class does not offer any support for partition information.
Update the github wiki to include documentation on how/when/why someone may want to use PartitionProjection.

ThrottlingInvoker to conservative with cloudwatch-metrics

When testing the cloudwatch-metrics connector I noticed it was too easy for me to write a query that would result in thrashing against cloudwatch getMetricData(...) API. This was surprising to me because ThrottlingInvoker was indeed kicking in and reducing the TPS to 1 for each running Lambda. However, it seems as though in aggregate Athena's minimum concurrency when peak congestion control active was still high enough to overwhelm the typical cloudwatch-metrics account limits. As such this task is to:

Reproduce the issue and confirm the above behavoir.
Use lower (bigger back off, less frequent calls) for the ThrottlingInvoker used by cloudwatch-metrics. It currently uses the default ThrottlingInvoker config. Don't change the default, just use the builder in the cloudwatch-metrics connector to build a custom one for cw metrics.
Update the README for the connector to recommend using a lower default concurrency in the Lambda console for this connector so as to override Athena's minimum concurrency.
Update our docs to have a better example query which is less likely to catch dozens of individual metrics and thus produces lower concurrency.
We should also consider changing how we form splits so that we make fewer calls to GetMetricData by pulling up to 30 metrics in each call as recommended by CW team

[FEATURE] Refactor ValueSet and its implementations to improving filtering performance

The current ConstraintEvaluator makes use of constructs like ValueSet which are heavily inspired by analogs in Presto. When these constructs were adapted for use with Apache Arrow we identified several places where unnecessary object churn takes place. This task is to refactor ValueSet's interfaces that are used by ConstraintEvaluator to reduce object churn.

[FEATURE] Enhance publish tool to support aws profiles

Some customers may need to assume a role or use a specific profile when publishing to SAR. We should enhance the publish tool to allow that by accepting and passing along the --profile [PROFILE NAME] flag to the aws cli and sam tool.

[BUG] HBClient Failure (replica 0 offline) with connector hbase

Describe the bug
If I run an Hbase query then wait a few minutes before running another hbase query, my query fails. If i immediate retry the query it suceeds.

To Reproduce
Steps to reproduce the behavior:

Run an hbase query
Wait >1 minute
Run a 2nd hbase query - it will fail with an exception about replica 0 offline.
Rerun the query and it will succeed.

Expected behavior
I expect us to properly detect expired clients, recreate them, and not fail the query.

Connector Details (please complete the following information):

Version: all versions
Name hbase

Additional context
I'm pretty sure this relates to how the hbase connector cache clients to reduce query delay. When attempting to reuse a cached client we run a quick health check but that check either isn't catching the faulty client or it throws and prevents us from re-creating the client. Naieve solution is to stop caching, better solution would be to do a better health check.

[FEATURE] Refactor JDBC & BigQuery Connectors to use StringTemplate

Refactor JDBC & BigQuery Connectors to use StringTemplate since that will allow us to make dialect specific templates for the basic logic of translating constraints to SQL fragments we can push down into other SQL engines we federate to.

Float/Real/Float4 constraint push down does not work in JDBC connector

I was able to reproduce this issue with real data type in Redshift and float data type in MySql. Zero results are returned with no failure.

[QUESTION] General question

I have a question that I don't think is related to a bug or feature request.
I am getting the following error when trying to use the postgres jdbc connection

Encountered an exception[java.lang.IllegalArgumentException] from your LambdaFunction executed in context[retrieving meta-data] with message[No enum constant com.amazonaws.connectors.athena.jdbc.connection.JdbcConnectionFactory.DatabaseEngine.POSTGRESQL]

[FEATURE] JDBC connector only works with lowercase table names.

Implement something similar to DynamoDBTableResolver to resolve case sensitive table names and also column names. We may be able to handle column names but using schema meta-data so that we dont really need a table name cache like we do with other connectors.

FederatedIdentity is not always populated

Certain call patterns result in the principle arn portion of the FederatedIdentity not being set when Athena calls a connector or being set to "unknown". This is a known issue and one that we hope to resolve in the weeks after the preview opens to the public.

In general we are still gathering feedback on the best way to federate identity to connectors. Please feel free to share your thoughts or suggestions on this issue. The actual fix for ensuring this field is populated will be made within Athena, not within this repo.

All queries failing: DynamoDB connector

Describe the bug
All queries failing with the DynamoDB connector:

Your query has the following error(s):

[ErrorCode: INTERNAL_ERROR_QUERY_ENGINE] Amazon Athena an experienced an internal error while executing this query. Please contact AWS support for further assistance. You will not be charged for this query. We apologize for the inconvenience.

This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: aed23b45-7960-4d3f-8669-13ad1f84d738.

This seems to have started within the last 3 hours.

Maybe this is a temporary issue with the service, but I don't see anything in any health dashboard, so reporting it here to see if anyone can cast light on the exact issue.

Lambda functions return success on malformed payload

When the lambda functions are called using malformed payload, the function will continue to return a status code of 200 indicating success. This will make it hard to triage issues in the interface. The log indicates that there was an error, but it's not propagated correctly to the caller.

Example: Calling the lambda function with an empty JSON Object

Result:

HTTP Code: 200
Status: 200
Error: Unhandled
LogResult: START RequestId: 47b86135-4419-4975-8020-420c52cdc56e Version: $LATEST
2019-11-18 07:07:43 <47b86135-4419-4975-8020-420c52cdc56e> ERROR CompositeHandler:Request data: {}
2019-11-18 07:07:43 <47b86135-4419-4975-8020-420c52cdc56e> WARN  CompositeHandler:handleRequest: Completed with an exception.
com.fasterxml.jackson.databind.JsonMappingException: Unexpected token (END_OBJECT), expected FIELD_NAME: missing property '@type' that is to contain type id  (for class com.amazonaws.athena.connector.lambda.request.FederationRequest)
 at [Source: {}; line: 1, column: 2]
	at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:256)
	at com.fasterxml.jackson.databind.DeserializationContext.wrongTokenException(DeserializationContext.java:1061)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedUsingDefaultImpl(AsPropertyTypeDeserializer.java:147)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:99)
	at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:142)
	at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:63)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3807)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2797)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:85)
	at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:888)
	at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:293)
	at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:64)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:114)
com.fasterxml.jackson.databind.JsonMappingException: Unexpected token (END_OBJECT), expected FIELD_NAME: missing property '@type' that is to contain type id  (for class com.amazonaws.athena.connector.lambda.request.FederationRequest)
 at [Source: {}; line: 1, column: 2]: java.lang.RuntimeException
java.lang.RuntimeException: com.fasterxml.jackson.databind.JsonMappingException: Unexpected token (END_OBJECT), expected FIELD_NAME: missing property '@type' that is to contain type id  (for class com.amazonaws.athena.connector.lambda.request.FederationRequest)
 at [Source: {}; line: 1, column: 2]
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:91)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Unexpected token (END_OBJECT), expected FIELD_NAME: missing property '@type' that is to contain type id  (for class com.amazonaws.athena.connector.lambda.request.FederationRequest)
 at [Source: {}; line: 1, column: 2]
	at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:256)
	at com.fasterxml.jackson.databind.DeserializationContext.wrongTokenException(DeserializationContext.java:1061)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedUsingDefaultImpl(AsPropertyTypeDeserializer.java:147)
	at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:99)
	at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:142)
	at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:63)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3807)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2797)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:85)

END RequestId: 47b86135-4419-4975-8020-420c52cdc56e
REPORT RequestId: 47b86135-4419-4975-8020-420c52cdc56e	Duration: 5.87 ms	Billed Duration: 100 ms	Memory Size: 3008 MB	Max Memory Used: 157 MB

java.lang.RuntimeException with nested numbers - DynamoDB connector

Logs from the lambda function:
lambda_logs.txt

Describe the bug
Having a nested number in DynamoDB causes glue crawler to create an invalid schema, resulting in DynamoDB connector failure.

To Reproduce
Steps to reproduce the behavior:

Create a DynamoDB table with the following item:

{
  "id": "secret_id",
  "some_number": 123,
  "some_object": {
    "nested_number": 456
  }
}

Create a Glue Crawler and let it crawl the table and create a schema
The schema ends up looking like this:

"FieldSchema": [
				{
					"name": "some_number",
					"type": "bigint",
					"comment": ""
				},
				{
					"name": "some_object",
					"type": "struct<nested_number:bigint>",
					"comment": ""
				},
				{
					"name": "id",
					"type": "string",
					"comment": ""
				}
			]

Use a DynamoDB connector from Athena and receive the following error:

java.lang.RuntimeException

Notice that both "some_number" and "nested_number" got recognized as BIGINT type by the Glue Crawler, which seems to be incorrect.

Screenshots / Exceptions / Errors
Your query has the following error(s):

GENERIC_USER_ERROR: Encountered an exception[java.lang.RuntimeException] from your LambdaFunction[function_name] executed in context[S3SpillLocation{bucket='bucket-name', key='athena-spill/---**************/----', directory=true}] with message[Error while processing field some_object]

Connector Details (please complete the following information):

Version: 2020.04.01
Name: AthenaDynamoDBConnector

Additional context
Querying the mentioned DynamoDB table without using a Glue Schema works without throwing any errors, the response looks like this:

some_number: 123.000000000
some_object: {nested_number=456.000000000}
id: secret_id

Notice that both integers seem to be shown as DOUBLE type.

Also, because of the issue #112 I'm forced to use Glue Crawler schema to query the DynamoDB tables, which is why I am basically stuck. Any work around ideas are welcome, thanks.

[FEATURE] Support for BlazingSQL possible?

https://docs.blazingdb.com/docs/using-blazingsql

Arrow Schema Serialization for Arrow 0.15.0 and up

While building a Python (and Rust) driver for the federation lambda functions, I stumbled across the issue that there seems to be an incompatibility when deserializing old IPC messages.

See: Columnar “Streaming Protocol” Change since 0.14.0 in http://arrow.apache.org/blog/2019/10/06/0.15.0-release/

I think, it would be great to either document this issue or check if it would be possible to upgrade the Arrow version to 0.15+.

[FEATURE] Use Set instead of List for partitionCols in GetSplitsRequest

Currently, GetSplitsRequest requires a List<String> partitionCols argument, but GetTableLayoutRequest requires a Set<String> partitionCols argument, and GetTableResponse returns a Set<String> of partition columns.

This task is to change GetSplitsRequest to require a Set<String> instead of List<String> to match the convention used elsewhere in the SDK.

[FEATURE] Unnecessary maven dependencies

At times I've noticed unexpected dependencies getting downloaded from maven central when I am building connectors. I'm not 100% sure if this is just an artifact of how maven does caching for multi-module projects or if there are actually extra, unnecessary, jars making their way into connector builds. More specifically, when I was building athena-example I saw BigQuery jars getting downloaded by maven.

This task is to:

see if you can reproduce what I mentioned above
see if this happens for any other connectors
If confirmed, make the appropriate fixes to our pom layout so that the modules remain independent as intended.

I'm tagging this as performance because larger jars mean longer cold start times in Lambda though if they get large enough they may exceed Lambda's limit.

Migrate UDFHandler to use GeneratedRowWriter

This PR generalized the pseudo code generation optimization that was started in WriterUtils and first used by UDFs.

[BUG] Cloudwatch Logs Connector Request Size Exceeded

In the ANT222r workshop a customer ran into an issue querying the all_log_streams table. Mert / Anthony have the query id that generated the error. On the surface we suspect that the query created too many partitions or one of the log lines was so big that it exceeded the Lambda response size but somehow did not get spilled. this error needs to be investigated as it probably represents an edge case in the sdk or connector that our scale tests did not catch.

[QUESTION] Athena TypeORM Tooling

I have a question that I don't think is related to a bug or feature request.

I would like to use TypeORM with AWS Athena as the backend for it.

I was wondering if there were any internal tools that the AWS Athena team has for working with TypeORM that they would be willing to open source.

[BUG] Redshift 'char' column constraint pushdown does not work

Execution with constraint on char column returns empty results.

Verified that correct value is set in the prepared statement.

Similar push down works in MySql.

Add SecretsManager as SSL Cert source for JDBC SSL Support

This task it to test that all RDS JDBC sources we support work with and without SSL. Anecdotal testing seems to suggest that the required Amazon CAs for RDS provided certs are already present in Lambda. For non-RDS MySQL, Postgres, etc... we should add a mechanism for providing your own certs, likely via SecretsManager which we already support for connections strings and credentials.

[FEATURE] Workaround for spilling to S3

We're investigating the ways to integrate the Raptor connector (https://github.com/prestodb/presto/tree/master/presto-raptor) to Athena. We store the shards as ORC files in S3 and use Mysql as a metadata backend that stores the BRIN indexes separately.

We would like to write a connector for Athena that applies predicate & aggregate pushdown to the query and returns the ORC files (shards) from the S3 but the 6mb limit makes this connector useless.

I see that you're considering an alternative to spilling feature so I wanted to create an issue in order to be able to keep track of the status.

Address Log4J vulnerability

https://github.com/awslabs/aws-athena-query-federation/network/alert/athena-federation-sdk-tools/pom.xml/log4j:log4j/open

JDBC AWS Secrets Replacement Needs Escaping

Passwords retrieved from AWS Secrets Manager that contain escape sequences for Java Regex replacements cause the following exception:

java.lang.IllegalArgumentException: Illegal group reference
	at java.util.regex.Matcher.appendReplacement(Matcher.java:857)
	at java.util.regex.Matcher.replaceAll(Matcher.java:955)
	at com.amazonaws.connectors.athena.jdbc.connection.GenericJdbcConnectionFactory.getConnection(GenericJdbcConnectionFactory.java:92)
	at com.amazonaws.connectors.athena.jdbc.manager.JdbcMetadataHandler.doGetTable(JdbcMetadataHandler.java:205)
	at com.amazonaws.connectors.athena.jdbc.MultiplexingJdbcMetadataHandler.doGetTable(MultiplexingJdbcMetadataHandler.java:113)
	at com.amazonaws.athena.connector.lambda.handlers.MetadataHandler.doHandleRequest(MetadataHandler.java:243)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:132)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100)
	at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:906)
	at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:341)
	at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:63)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:114)

Replacements needs to use Pattern.quote to avoid this.

[BUG] Redshift timestamp column is not supported

Although document said it is supported, it is not for Redshift.

https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-jdbc#data-types-support

NOT_SUPPORTED: Unsupported Arrow Type [Timestamp(MILLISECOND, null)] in Lambda Data Source

DynamoDB Connector does not handle Glue BigInt column correctly

I crawled a table containing a year column which the Glue crawler decided was a bigint. I see the following stacktrace in CloudWatch:

Error while processing field year: java.lang.RuntimeException
java.lang.RuntimeException: Error while processing field year
	at com.amazonaws.athena.connectors.dynamodb.DynamoDBRecordHandler.lambda$readWithConstraint$1(DynamoDBRecordHandler.java:185)
	at com.amazonaws.athena.connector.lambda.data.S3BlockSpiller.writeRows(S3BlockSpiller.java:174)
	at com.amazonaws.athena.connectors.dynamodb.DynamoDBRecordHandler.readWithConstraint(DynamoDBRecordHandler.java:146)
	at com.amazonaws.athena.connector.lambda.handlers.RecordHandler.doReadRecords(RecordHandler.java:191)
	at com.amazonaws.athena.connector.lambda.handlers.RecordHandler.doHandleRequest(RecordHandler.java:157)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:135)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100)
Caused by: java.lang.RuntimeException: Unable to set value for field year using value 1933
	at com.amazonaws.athena.connector.lambda.data.BlockUtils.setValue(BlockUtils.java:330)
	at com.amazonaws.athena.connector.lambda.data.Block.offerValue(Block.java:175)
	at com.amazonaws.athena.connectors.dynamodb.DynamoDBRecordHandler.lambda$readWithConstraint$1(DynamoDBRecordHandler.java:176)
	... 6 more
Caused by: java.lang.ClassCastException: java.math.BigDecimal cannot be cast to java.lang.Long
	at com.amazonaws.athena.connector.lambda.data.BlockUtils.setValue(BlockUtils.java:288)
	... 8 more

This is because the ItemUtils.toSimpleValue util that is used to convert from AttributeValue to Java object returns a BigDecimal for any DDB Number attribute.

Handle Glue table name coercion

When Glue Crawler crawls a table like a DynamoDB table, it will replace hyphens in the the table name with underscores. Also it is likely that Glue will not allow hyphens during manual table creation. Our current implementations of GlueMetadataHandler in all the current connectors currently assume the table name will be the same between Glue and the data source (allowing for case insensitivity). I can think of three ways to address this:

Crawled tables maintain the original name of the table in the table's StorageDescriptor->Location property. We can leverage this if present and in a known form (like an AWS ARN).
In the Table "Resolver" that some connectors have, we can explicitly fall back to replacing underscores with hyphens if a table lookup fails. This can get messy and expensive if there's a mix of hyphens and underscores in the actual table name.
We can add an environment variable that can be used to provide a manual mapping of Glue table name to target table name.

I'm leaning toward a combination of 1 & 3.

DynamoDB still chokes on some Glue tables

Hey there,

I've been testing a custom-built connector based on the recent merged PR's and in general, it's been running pretty smooth. The main issue I have seen until today was random Athena query fails, or really slow results, which I attribute to too many queries at the same time, hitting the limit of 20 requests. I have requested a limit raise and will report back if there's any news on that.

When testing the PR's I was only mapping one table in Glue, which had some issues, that were fixed in follow up commits (thank you again!). Today, as I am nearing the release of this feature, I decided to get two more tables mapped in Glue in an attempt to optimize the speed as much as possible. Unfortunately, I hit a few more problematic fields in one of the two tables.

When the tables were mapped, using the crawler, the Athena UI would no longer list tables in the database, at all...

I looked at logs and found the following:

2020-01-08 02:42:27 <1596562f-279c-4a8a-afb3-36b19686050d> WARN  DynamoDBMetadataHandler:doGetTable: Unable to retrieve table application_retracted_prod from AWSGlue in database/schema default. Falling back to schema inference. If inferred schema is incorrect, create a matching table in Glue to define schema (see README)

Requested resource not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 4IVIF6FLCB5U7D17I1DGQ5J8FRVV4KQNSO5AEMVJF66Q9ASUAAJG): com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException
com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 4IVIF6FLCB5U7D17I1DGQ5J8FRVV4KQNSO5AEMVJF66Q9ASUAAJG)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1701)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1356)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1102)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:759)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:733)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:715)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:675)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:657)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:521)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4230)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4197)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeScan(AmazonDynamoDBClient.java:3007)
	at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.scan(AmazonDynamoDBClient.java:2974)
	at com.amazonaws.athena.connectors.dynamodb.util.DDBTableUtils.lambda$peekTableForSchema$3(DDBTableUtils.java:137)
	at com.amazonaws.athena.connector.lambda.ThrottlingInvoker.invoke(ThrottlingInvoker.java:187)
	at com.amazonaws.athena.connector.lambda.ThrottlingInvoker.invoke(ThrottlingInvoker.java:167)
	at com.amazonaws.athena.connectors.dynamodb.util.DDBTableUtils.peekTableForSchema(DDBTableUtils.java:137)
	at com.amazonaws.athena.connectors.dynamodb.resolver.DynamoDBTableResolver.getTableSchema(DynamoDBTableResolver.java:111)
	at com.amazonaws.athena.connectors.dynamodb.DynamoDBMetadataHandler.doGetTable(DynamoDBMetadataHandler.java:240)
	at com.amazonaws.athena.connector.lambda.handlers.MetadataHandler.doHandleRequest(MetadataHandler.java:243)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:132)
	at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100)

I then proceeded to remove some fields which I found potentially problematic. Those fields seem to be array and date. Changing the date field to a string also worked, but while I now have a workaround, all fields are re-added and updated on the next crawler run, which effectively will potentially break queries, so maybe these fields can be tweaked to work without issues. It may be that in some cases these fields have no value or even a different type than array or date (similar to the bug we had already).

Document the raw request/response payloads

Integration with languages other than Java will benefit from detailed documentation on the serialized form required by each request/response. At present we are making heavy use of Jackson's ObjectMapper but we will want to move away from this in the future to something that gives us more control of how we evolve the protocol while minimizing disruption of older clients (Lambda functions).

Athena ElasticSearch connector

Hello,

Are there are plans/requests for ElasticSearch connector for Athena? (not just AWS ElasticSearch service).

Thanks,
Lior

[FEATURE] Multiple calls to getTableLayouts()

We've seen cases where Athena calls getTableLayouts(...) multiple times for a given query. This is not expected and seems to suggest that Athena's optimizer is looking for different options (layouts) to find the best one. This task is to:

try and reproduce that (a simple query against any table with at least 1 partition should suffice) and see why this happened or if it was an artifact of the environment I saw it in (which was imperfect).
stop issuing redundant calls if we have no indication that additional layouts are possible.

[FEATURE] Improve JDBC Perf

Is your feature request related to a problem? If yes, please describe.
There is a known issue with the jdbc connector that limits it to 64mbps scan rate for tables/engine which are not partitioned. Using partitioned tables unlocks significantly greater parallelism but unpartitioned tables appear to be running into some performance limit related to the JDBC driver (or its config). We are aware and investigating.

Describe the solution you'd like
Attempt to improve the JDBC single-split performance to be closer to the Lambda max for the SDK. You may want to do a test on raw EC2 to see if the JDBC driver itself is the bottleneck.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[QUESTION] Cannot compile on Mac with default Java

Cannot compile what looks like being an issue with Java > 8 and using the 'javax.activation' package.

[ERROR] /Users/magrund/Development/aws-athena-query-federation/athena-federation-sdk/src/test/java/com/amazonaws/athena/connector/lambda/data/BlockTest.java:[58,24] package javax.activation does not exist
[ERROR] /Users/magrund/Development/aws-athena-query-federation/athena-federation-sdk/src/test/java/com/amazonaws/athena/connector/lambda/data/BlockTest.java:[346,31] cannot find symbol
  symbol:   class UnsupportedDataTypeException
  location: class com.amazonaws.athena.connector.lambda.data.BlockTest

Problem related to Query in Athena

Hi All,

I am running queries on Athena with configuration through Crawler,Tables and Databases. It is reading the queries from S3. First time i run the query i get the response in one line(Select * from hitesh1907). But when i run this query repetitively i keep on getting the blank rows and result of query way lower.

I am not aware what i am doing wrong as why I am getting empty rows before actual result.

Screenshot attached for reference.

Glue connectivity timeout causing slow queries

When customers use a Glue enabled connector like DDB, DocDB, or HBase they may not actually have Glue connectivity setup in their VPC. By default these connectors attempt to contect Glue even when it isn't available. This results in a >90 second wait while the call times out. Then because Glue is hidden, we don't even tell the customer. So it just appears as a warning in the log and a slow query. We need to fix that and handle lack of glue connectivity more gracefully, possibly by detecting this via a low timeout and remembering it for the life of the Lambda. We should also use a much lower timeout in general i suspect.

Below is an example of the error we found when troubleshooting slow query reports from some docdb beta-testers.

com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to glue.us-east-2.amazonaws.com:443 [glue.us-east-2.amazonaws.com/3.134.172.40, glue.us-east-2.amazonaws.com/3.130.5.83, glue.us-east-2.amazonaws.com/3.133.176.154] failed: connect timed out
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1164)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1110)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:759)

[FEATURE] Enhance Redis Connector Exception Messages

Is your feature request related to a problem? If yes, please describe.
I'm frustrated when the Redis connector fails with NPEs caused by missing table properties.

Describe the solution you'd like
The connector should throw an exception message that describes the missing table properties instead of just assuming they are present and causing an NPE which is tough to diagnose.

StackOverflowError in the DynamoDB connector

I'm experimenting with the DynamoDB connector. It seems to work fantastically well. I do get this error for one specific table for some reason. Is this a known issue?

GENERIC_USER_ERROR: Encountered an exception[java.lang.StackOverflowError] from your LambdaFunction[arn:aws:lambda:us-east-1:xxx:function:xxx] executed in context[retrieving meta-data] with message[java.lang.StackOverflowError]

Query Id: fe473a53-ea8d-4933-bbc5-0423fa15e0ae

Change SAR Template to Lowercase Catalog and Lambda Names

Due to Athena's automatically lowercasing identifiers referenced in queries, it is necessary to use only lowercase catalogs and function names in external Lambdas. Today, our Serverless Application Repository template allows users to create Lambda functions and catalog names with arbitrary case, despite the fact that non-lowercased names will be unusable in Athena today. This issue is to update the SAR template to automatically lowercase these two fields so that they are guaranteed to work with Athena. The "Lower" operation here offers a way to do this.

Add query completion detection

When a query fails, uses a limit, or gets canceled it is possible that a Lambda invocation will continue to run until it completes its work even though the request has been abandoned. Since Lambda does not offer a mechanism we can use to 'cancel' a running invocation. It would be useful to have a way to indicate to potentially long running operations like GetTableLayouts, GetSplits, and ReadRecords/ReadWithConstraint that the request is done.

This is a particularly useful optimization for limit operations until we can truly support pushing down limits. Arguably, this is a more generic mechanism for stopping a scan early regardless of the reason since not all limits can be pushed down (especially when multiple tables or functions are involved).

Lets add a facility by which the SDK can use the Athena Query ID in the request to provide a signal isQueryRunning() that can be used to stop a request.

Keep in mind that this feature will require a new IAM permission/policy for connectors wishing to take advantage of this feature. We should update the serverless application repository yaml files for each connector that adopts this in order to avoid runtime errors. My inclination is also that this method should default to 'true' in the event it has issues actually obtaining the status of the query. I suspect it is better to lose the optimization than to fail things assuming we log appropriately.

[FEATURE] Publish API metrics

It would be helpful if the SDK allowed customers to enable/disable publishing of cloudwatch metrics for connector performance, including but not limited to:

latency by method
total bytes
total rows
total splits
total partitions
success vs failure

This is something that should be on by default but could be disabled via env-var if the customer wanted to reduce cloudwatch costs.

[FEATURE] Release a Python SDK

Is your feature request related to a problem? If yes, please describe.
No.

Describe the solution you'd like
I'd like to be able to write my connectors in Python either because I'm more familiar with it or because I have a large amount of existing python code i need to use or because we don't have java build infrastructure and this poses a large barrier to entry for us when writing a custom connector.

Generate javadoc

Now that we have high coverage for javadoc and are close to our launch date, lets start publishing javadoc. This task is to:

auto-generate javadoc on commit
auto-publish javadoc
determine the best place to host the javadoc. (consider github pages)

DynamoDB connector chokes on some tables

Hey there,

I'm checking out this FANTASTIC improvement to Athena with a DynamoDB but hit a snag fairly early in my testing. Some tables seem to be returning data fine, while 1-2 of my tables seems to choke no matter what I do:

Your query has the following error(s):

GENERIC_INTERNAL_ERROR: Row type must have at least one parameter
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: e7ab04c2-bcb1-4eaa-9cfc-a3f176866fa1.

I've more than triple checked that my query is valid and correct.

The Query Id should help you debug I understand? I'm suspecting that this is a problem with one or more of the records in the table, though it's working normally in production for the rest of the system (Amplify with GraphQL which is pretty sensible in itself due to schemas etc.). The reason for this is that I have 3 environments of this same table, and 2 of them are breaking, while the third one is not (but it that one does not have any data).

I'd be happy to assist in any further debugging. Unfortunately, this table is also my main table, so until I figure out what's going wrong here, I will not be able to progress and I already wasted an insane amount of time trying to make Athena work with Firehose / Glue / ETL :/

Add Connector Sanity Test

Today, authors of custom Lambda connectors based on the Athena Federation SDK are provided only single-component unit tests to use for correctness verification. There is a class of logical bugs which could be written and not caught by these targeted unit tests, so the project would benefit from a broader logical test which would verify the end-to-end workflow across multiple components and invocations, similar to what Athena will eventually send to these custom Lambdas.

This task is to implement a tool that will simulate the call patterns Athena will make when running queries against a custom Lambda function. This will allow users to test their Lambdas without using Athena and iterate more quickly.

awslabs / aws-athena-query-federation Goto Github PK

aws-athena-query-federation's Introduction

Amazon Athena Query Federation

Serverless Big Data Using AWS Lambda

Queries That Span Data Stores

License

aws-athena-query-federation's People

Contributors

Stargazers

Watchers

Forkers

aws-athena-query-federation's Issues

Recommend Projects

Recommend Topics

Recommend Org