Coder Social home page Coder Social logo

exasol / s3-document-files-virtual-schema Goto Github PK

View Code? Open in Web Editor NEW
3.0 7.0 1.0 1.78 MB

Virtual Schema for document files on AWS S3

License: MIT License

Java 76.84% HCL 0.92% Shell 2.70% TypeScript 19.32% JavaScript 0.22%
s3 aws-s3 virtual-schema exasol exasol-integration parquet

s3-document-files-virtual-schema's Introduction

Virtual Schema for Document Files in S3

Build Status

Quality Gate Status

Security Rating Reliability Rating Maintainability Rating Technical Debt

Code Smells Coverage Duplicated Lines (%) Lines of Code

This Virtual Schemas allows you to access document files stored in S3 like any regular Exasol table.

This Virtual Schema is built for and tested with the official AWS S3. Third-party S3 API compatible products are expected to work as well. It is highly recommended to thoroughly test 3rd party products used in combination with Exasol, especially regarding sufficient S3 API compatibility.

For MinIO each release of VSS3 is verified by automated integration tests. MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake and database workloads. It runs on-prem and on any cloud (public or private) and from the data center to the edge. MinIO is software-defined and open source under GNU AGPL v3. See the Hand-On Guide for a quick tour with some sample JSON files.

This Virtual Schema is prepared for Java UDF startup time improver.

For supported document file formats see Virtual Schema Common Document Files.

Additional Information:

s3-document-files-virtual-schema's People

Contributors

ckunki avatar jakobbraun avatar kaklakariada avatar pj-spoelders avatar redcatbear avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

rohankumardubey

s3-document-files-virtual-schema's Issues

Can we query files in private S3 "custom endpoints" without dns style bucket level access

Currently when we query an internal S3 compatible store such as minio, we get bucket string exceptions. The syntax does not allow for custom endpoints with path style access and without s3 in the bucket string. Thus we cannot use this with custom internal endpoints. Is there any means to enable path style access (bucket is not visible in dns but is first url part after domain name) or relax the requirement on the url format:

Exception / Stacktrace from query: SELECT * FROM FILES_VS_TEST."metrics" LIMIT 10

SQL-Error [22002]: VM error: F-UDF-CL-LIB-1126: F-UDF-CL-SL-JAVA-1006: F-UDF-CL-SL-JAVA-1026:
com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1068: Exception during singleCall adapterCall
java.lang.IllegalArgumentException: E-S3VS-1: The given S3 Bucket string 'https://<private_endpoint>/test/metrics' has an invalid format. Expected format: http(s)://BUCKET.s3.REGION.amazonaws.com/KEY or http(s)://BUCKET.s3.REGION.CUSTOM_ENDPOINT/KEY. Note that the address from the CONNECTION and the source are concatenated. Change the address in your CONNECTION and the source in your mapping definition.
com.exasol.adapter.document.files.S3Uri.fromString(S3Uri.java:58)
com.exasol.adapter.document.files.S3FileLoader.(S3FileLoader.java:37)
com.exasol.adapter.document.files.S3FileLoaderFactory.getLoader(S3FileLoaderFactory.java:15)
com.exasol.adapter.document.files.FilesDocumentFetcherFactory.buildSegmentDescriptions(FilesDocumentFetcherFactory.java:59)
com.exasol.adapter.document.files.FilesDocumentFetcherFactory.buildDocumentFetcherForQuery(FilesDocumentFetcherFactory.java:39)
com.exasol.adapter.document.files.FilesQueryPlanner.planQuery(FilesQueryPlanner.java:33)
com.exasol.adapter.document.DocumentAdapter.runQuery(DocumentAdapter.java:122)
com.exasol.adapter.document.DocumentAdapter.pushdown(DocumentAdapter.java:107)
com.exasol.adapter.AdapterCallExecutor.executePushDownRequest(AdapterCallExecutor.java:109)
com.exasol.adapter.request.PushDownRequest.executeWith(PushDownRequest.java:55)
com.exasol.adapter.AdapterCallExecutor.executeAdapterCall(AdapterCallExecutor.java:26)
com.exasol.adapter.RequestDispatcher.processAdapterCall(RequestDispatcher.java:48)
com.exasol.adapter.RequestDispatcher.adapterCall(RequestDispatcher.java:33)
com.exasol.ExaWrapper.runSingleCall(ExaWrapper.java:100)
(Session: 1718395300191535104)

Remove upload driver part from user guide

Situation

The part with "Upload the driver to BucketFS" can be confusing, since in other VS we require 3rd party driver to be uploaded. But it is not required for VS S3.

We should change it to "Access Files in BucketFS".

Acceptance Criteria

  • User guide is updated

Remove exclude for E-PK-CORE-53 dependencies.md file has outdated content

After setting required maven version to 3.8.7. or higher with PK ticket #444 (project-keeper release 2.9.6) users can remove the suppressed warning from file .project-keeper.yml:

  - regex: "(?s)E-PK-CORE-53: The dependencies.md file has outdated content.*"

Affected repositories:

Dependency upgrade

See log messages from build job Dependency Check:

Excluded vulnerabilities:

Remove lombok

We should remove lombok from this project to reduce dependencies and complexity.

Vulnerabilities

Replace test config parsing library

Currently s3-document-files-virtual-schema parses test_config.yml with snakeyaml which has security vulnerabilities.
As the file format is quite simple we should replace snakeyaml with Java's property format.

Example:

awsProfile: default_mfa
owner: [email protected]
s3CacheBucket: user-testdata-cache

Proposal: change name of expected file to test_config.properties. In case this file does not exist and test_config.yml does exists then tests should fail with a helpful error message (e.g. "please migrate from yaml to properties").

Add cache for s3 uploads

The uploads to s3 are quite slow. Especially for local runs, when the upload goes over a DSL connection.

Solution:
Add a cache that only uploads the files if they changed

Adapt Extension to Exasol v8

The findInstances command reads the virtual schema table SYS.EXA_ALL_VIRTUAL_SCHEMAS. The schema of this table changes in Exasol v8, so we need to adapt the query to also work with Exasol 8.

Precondition: Exasol Docker DB v8 is available that allows running Java UDFs. Without this we can't run integration tests.

Unreadable error message if region is missing

If no region is specified the virtual-schema query fails with an unreadable error:

SQL Error [22002]: VM error: F-UDF-CL-LIB-1127: F-UDF-CL-SL-JAVA-1002: F-UDF-CL-SL-JAVA-1013: 
com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1080: Exception during run 
software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
software.amazon.awssdk.awscore.interceptor.HelpfulUnknownHostExceptionInterceptor.modifyException(HelpfulUnknownHostExceptionInterceptor.java:59)
software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:199)
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54)
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38)
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39)
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:135)
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:161)
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:114)
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:169)
software.amazon.awssdk.core.internal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.