pitchmuc / aepp Goto Github PK

View Code? Open in Web Editor NEW

24.0 8.0 17.0 2.19 MB

Adobe Experience Platform API for humans

License: Apache License 2.0

Python 52.60% Jupyter Notebook 47.40%

queryservice python aep adobe platform

aepp's People

Contributors

Stargazers

Watchers

Forkers

vikash-kothary kpunyakoti adobeemeapartners swarna04 awiddie sh1ju benedikt-buchert pmonjo iraluca cmenguy liqunc filiuta brian-a-au samircaus taulantracajm lukkyluke

aepp's Issues

Add label evaluation method to Policy module

Support policy evaluation of usage labels for an intended marketing action so that users of the package can handle policy enforcement in their Python code.

Add methods to the Policy class in policy.py to make requests to the /marketingActions/{{namespace}}/{{marketing action name}}/constraints endpoint of the Policy Service API (API reference)

There are 2 policy evaluation GET requests within the Policy Service API that have not yet been implemented in aepp:

GET Evaluate a core marketing action based on data usage labels
GET Evaluate a custom marketing action based on data usage labels

The two GET requests can be consolidated into one method by taking namespace ("core" or "custom")

The POST requests for evaluating datasets against a marketing action are covered in issue #50

getDataTypes method throws an error 'params' referenced before assignment

local variable 'params' referenced before assignment

Exception has occurred: UnboundLocalError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
local variable 'params' referenced before assignment

This happens because kwargs is an empty object in my case. Params should be defined before that.

if kwargs.get("properties", None) is not None:
            params = {"properties": kwargs.get("properties", "title,$id")}

Support enabling dataset for profile with inserts

Currently it seems the only function to enable a dataset for profile with enableDatasetProfile is restricted to doing upserts. The tag for upserts should likely be optional so one can enable a dataset for inserts or upserts depending on the use case.

Calling Catalog.enableDatasetProfile doesn't seem to do anything

The call to enableDatasetProfile seems to succeed, but it also seems to have no effect on the dataset, and I've verified on multiple occasions the "Profile" toggle is not flipped on even after this. Seems like some kind of bug, I haven't looked at the root cause.

Error handling in `getSchemas`

Right now if you call getSchemas but there is some authentication issue, it fails with some intermediate stack trace that obfsucates what is happening.

Calling createSourceConnectionDataLake should automatically resolve the dataset nae

Right now when we call createSourceConnectionDataLake it will not fetch the dataset name, so the resulting destination contains an empty dataset name.

We should modify this function so it automatically fetches the name and embed that into the payload so it shows up nicely in the UI.

Add a driver for processing and working with spark dataframes

The goal is to work with really large datasets and extract the results of large queries into a spark dataframe, this will allow us to work with pqs and spark to do large scale feature transformation and processing, we'll need to do a little design around this before implementation

Support stage environment

Right now everything seems to be hard-coded to use the prod environment. This includes:

The endpoints in config.py
The JWT authentication with IMS in connector.py

We should support additional environments like stage or int. Ideally it's something that could be added in the config.json file or directly via the aepp.configure call.

We will need to make changes to config.py and connector.py to fetch the environment at runtime and modify the URLs accordingly to IMS and experience platform to be respectively ims-na1-stg1.adobelogin.com and experience-stage.adobe.com

Support service tokens

Currently the library is very much tailored to JWT-based authentication.

However it would be really useful to also support service tokens where you just have a client ID, client secret and auth code. This would allow this library to be used directly in services, and some APIs are also only accessible via service tokens.

Add unit tests for all major APIs

Using mocking and patching add pytests to all APIs

Batch data ingestion of a csv file - `uploadSmallFile`

Reading a .csv file into a python dictionary for data upload, does not work at the moment. It maybe due to the flat structure of csv files.

Request to support upload of a csv file, by just taking the localFilePath as input, where reading and converting the file into a hierarchical structure is handled by the wrapper.

Support dataset system labels and tags

Currently in the createDataSets method there is no way to pass labels and tags programmatically. You can do it most likely with the data parameter (haven't tried) but this is cumbersome as it requires passing the entire payload.

We would like to add extra parameters to this function to add system_labels: list[str] and tags: dict[str, list[str]] so it's completely transparent and easy to manipulate.

Possibly we can go further and abstract that fully for some use case, like for example if I want to create a dataset that is profile enabled it would be nice to just have to call createDataSets(..., profile_enabled=True)

Similarly, having the ability to update system labels and tags could be added in datasets.py module.

Create a utils module for misc kind of utilities

It would be useful to have a module for miscellaneous functions that can be useful. For example stuff we use:

A function to print a link to the UI for a given resource
A function that returns the export timestamp for a given destination run.
etc, many other ideas.

Attach a CI/CD build system based on open source tooling

We need to have a CI/CD build system that runs unit tests when code gets merged and possibly publishes a new version of the library using poetry publish to pypi

GET Schema API is not working when 'start' parameter is provided

Expected Behaviour

Get Schema list succesfully from a Sandbox

Actual Behaviour

Schemas are not returned

Reproduce Scenario (including but not limited to)

We are trying to use Aepp wrapper in order to automate the provisioning of the schemas, etc.
However when we try to use the it we noticed that AEPP wrapper is supplying a start parameter. When the start parameter is not provided from the request the default is set to 0. So either way the start parameter is provided. In this case the result is an empty list as it expects other parameters such as order by etc. Shouldn't the default behavior be without providing the start parameter at all?

Steps to Reproduce

use
schemaConnection = schema.Schema()
schemas = schemaConnection.getSchemas()

Platform and Version

Sample Code that illustrates the problem

Logs taken while reproducing problem

Patching datasets seems to have invalid content type

Trying to enable a dataset for profile ingestion with enableDatasetProfile in catalog.py is failing due to invalid content type:

Out[248]: {'type': '/placeholder/type/uri', 'status': 400, 'title': 'BadRequestError', 'detail': 'Content-type does not match json-patch body format, value of Content-type should be application/json-patch+json.'}

streamMessage will not raise exception when data is not None but not a dict

Expected Behaviour

In the method streamMessage inside the DataIngestion class, data is expected to be a dict. There exists a checking as follows:

if data is None and type(data) != dict:
            raise Exception("Require a dictionary to be send for ingestion")

The exception tells the user that it expects data to be a dictionary.

Actual Behaviour

Rereading the code,

if data is None and type(data) != dict:
            raise Exception("Require a dictionary to be send for ingestion")

When data is not None, no matter what type data is, that condition will always evaluate to False, leading to it not raising exception when data is, say, a string.

Reproduce Scenario (including but not limited to)

One potential common pitfall is to type data as a string instead of a dict.

In queryservice module for createSchedule, the `sql` statement should not be passed if template ID is passed

Currently createSchedule requires sql param to be passed. However that seems incorrect because templateId can be passed in which case the sql is not needed.

In fact currently specifying both sql and templateId causes an error in PQS, see below:

{'message': 'requirement failed: Only one of sql and templateId must be defined',
 'statusCode': 400}

We should accept calling this function as valid if templateId is passed without sql. In the meantime we can still use the full object.

Support UPS Exports

It would be very useful to support UPS Exports that are available via API - both for profiles, events and profile+events.

For example to export profile data aggregated with events:

{
  "filter": {
    "segmentQualificationTime": {
      "startTime": "2022-12-04T00:00:00Z",
      "endTime": "2023-01-04T00:00:00Z"
    },
    "emptyProfiles": false
  },
  "additionalFields": {
    "eventList": {
      "filter": {
        "fromIngestTimestamp": "2022-12-04T00:00:00Z",
        "toIngestTimestamp": "2023-01-04T00:00:00Z"
      }
    }
  },
  "destination": {
    "datasetId": "{{upsExportDataset}}",
    "segmentPerBatch": false
  },
  "schema": {
    "name": "_xdm.context.profile"
  },
  "properties": {
      "checkBatchStatusForSuccess": true
  }
}

I can help with that if you are open to it as I've been intimately familiar with this API for the past few months and there's a few caveats for best practices in what should be provided by the user.

Support ingesting small parquet files

Currently the small file API support in ingestion seems incomplete, as in the method uploadSmallFile it expects data of type Union[list, dict] which works fine for JSON, but for Parquet format you would be passing in bytes, I couldn't figure out how to get this working with some Parquet data so ended up using JSON. But I think either the prototype of the function needs to change since the notion of multiline isn't really applicable to Parquet, or we need to actually change the code to fully support Parquet binary payload.

Add dataset evalution method to policy module

Support policy evaluation of datasets against a marketing action so that users of the package can handle policy enforcement in their Python code.

Add a method to the Policy class in policy.py to make POST requests to the /marketingActions/{{namespace}}/{{marketing action name}}/constraints endpoint of the Policy Service API (API reference)

There are 2 policy evaluation requests in the Policy Service API that have not yet been implemented in aepp:

POST Evaluate a core marketing action based on datasets and/or fields
POST Evaluate a custom marketing action based on datasets and/or fields

The 2 POST requests can be consolidated into 1 method by taking the namespace ("core" or "custom") as a parameter

A method to implement the GET requests for evaluating a set of DULE labels against a marketing action is covered in Issue #51

Support statisics service API

See https://git.corp.adobe.com/pages/experience-platform/api-specification/?urls.primaryName=%E2%94%94%20Statistics%20Service

We should have a module to support that.

Accept schema ID for enabling Schema for Real Time

The method - enableSchemaForRealTime in the schema module, currently only accepts meta:altId attribute for a schema. The method does not support the $id attribute at the moment.

It could be enhanced to accept either - meta:altId or $id for enabling a schema for RT.

Allow get put and post for other Schema Descriptors

Thanks a lot again for your work on this!

Currently extend a soft enum field for event types is only possible via API so being able to update other Schema Descriptors would help to do this.

Extend a soft enum field

I did not manage to do this with the current Python wrapper? Can you help here?

Exception in flowservice.getRuns if there are no runs

If there is no runs returned in getRuns then it will fail because the _links won't be set in the response:

File /usr/local/lib/python3.10/site-packages/aepp/flowservice.py:756, in FlowService.getRuns(self, limit, n_results, prop, **kwargs)
    754 res: dict = self.connector.getData(self.endpoint + path, params=params)
    755 items: list = res["items"]
--> 756 nextPage = res["_links"].get("next", {}).get("href", "")
    757 while nextPage != "" and len(items) < float(n_results):
    758     token: str = res["_links"]["next"].get("href", "")

KeyError: '_links'

It should just return an empty array.

init module can be shared between classes

see #33 (comment) for context

Profile and Identity enabled is not properly passed in createDataSets

Expected Behaviour

While creating a new datasets, profile and identity are enabled.

Actual Behaviour

While creating a new datasets, profile and identity are not enabled.

Reproduce Scenario (including but not limited to)

While calling createDataSets I'm sending profileEnabled=True, identityEnabled=True
But in the UI I see that it's not enabled.

Steps to Reproduce

Platform and Version

Sample Code that illustrates the problem

connection.createDataSets(name=name, schemaId=schema_id, profileEnabled=True, identityEnabled=True)
In catalog.py following part causing the issue:

if profileEnabled:
  data['tags']["unifiedProfile"] = ["enabled: true"]
if identityEnabled:
  data['tags']["unifiedIdentity"] = ["enabled: true"]`

There shouldn't be any space in enabled: true

I created the following object

tags = {
        "unifiedProfile": [
            "enabled:true"
        ],
        "unifiedIdentity": [
            "enabled:true"
        ]
    }

Then passed as a parameter to createDataSets and verified in the UI that they are enabled
connection.createDataSets(name=name, schemaId=schema_id, tags=tags )

Logs taken while reproducing problem

Batch data ingestion of a json file - `uploadSmallFile`

Current:
The uploadSmallFile currently takes a python dictionary as the input for data to be ingested.

Suggested:
To add an additional parameter - localFilePath that takes the path of the json file and uses json.load() to read the file into a dictionary. Basically, to handle the file processing part by the wrapper.

Missing header in `enableDatasetIdentity`

Missing the Content-type header application/json-patch+json in the request sent for enabling the dataset for Identity (catalog.enableDatasetIdentity).

Error in retrieving datasets

Expected Behaviour

Expected behaviour is to get dataset labels on invoking the datasets.Datasets() method. I have added the config correctly as the Schemas module is working fine.

Actual Behaviour

seeing this error:
AttributeError: module 'aepp.datasets' has no attribute 'Datasets'

On invoking the help method I see the class Datasets. Not sure what I'm doing wrong.

Support data landing zone

It would be useful to support the API calls to the data landing zone (DLZ) so users can retrieve their blob container and credentials.

Details on API under https://experienceleague.adobe.com/docs/experience-platform/sources/api-tutorials/create/cloud-storage/data-landing-zone.html?lang=en

Add support for triggering dataset export in destinatons

This functionality is being built internally right now, and once it is available we would like to update the destination module to trigger on-demand dataset exports.

Right now the call to create a destination just takes a raw dictionary destinationObj: dict but it would be nice to actually be able to programmatically pass just a dataset_id: str and not having to deal with complicated payloads.

Return named tuples or objects instead of raw JSON responses

Currently when calling most of the functions it just returns a raw JSON response, and you have to manually inspect it to retrieve what you need. It would be nice to instead return a specific object that you can directly extract known fields from.

For example when creating a dataset, to get the dataset ID you need to do something like dataset_response[0].split("/")[-1] but we would like to change it so that we can just do dataset_response.dataset_id

Another example in catalog module to get the table name for a dataset we have to do response[dataset_id]["tags"]["adobe/pqs/table"][0] but it would be so much easier to use response.table_name

Passing the sandbox programmatically

Currently to pass the sandbox the documentation mentions using kwargs to pass it like this:

mySchemaConnection1 = schema.Schema({"x-sandbox-name":"mySandbox1"})

However in the orgs I have tried this on it gives an error about having an invalid source, even when just using the default prod sandbox. I can only get this to work when not passing the sandbox at all so it defaults to the default sandbox.

We would like to add the following:

Ability to pass the sandbox as a full-fledged parameter so you could do for example schema.Schema(sandbox_name=foo)
Error handling to provide more details why the source is not working

Data Landing Zone should support credentials for destination

Currently we only fetch credentials and container for the user space, but there's a separate container and credentials for destinations in DLZ. See https://experienceleague.adobe.com/docs/experience-platform/destinations/catalog/cloud-storage/data-landing-zone.html?lang=en#connect-your-data-landing-zone-container-to-azure-storage-explorer

Exception - "The schema must include an 'allOf' attribute..." when creating a schema

Hi, I'm trying to create a schema using aepp and am getting the following exception.
` 474 raise TypeError("Expecting a dictionary")
475 if "allOf" not in schema.keys():
--> 476 raise Exception(
477 "The schema must include an ‘allOf’ attribute (a list) referencing the $id of the base class the schema will implement."
478 )

Exception: The schema must include an ‘allOf’ attribute (a list) referencing the $id of the base class the schema will implement.`

I am creating the schema by running getSchema in a different sandbox and then importing it into another one by using createSchema. I made sure the field groups and other dependencies pre-exist in the new sandbox.

Any ideas on how I can get the allOf attribute that is needed to create the schema?

Creating a descriptor is not working due to missing xdm:property

Currently if I try to create a descriptor with createDescriptor in schema.py it is failing because of the missing xdm:property field. See error below:

Out[262]: {'type': 'http://ns.adobe.com/aep/errors/XDM-4000-400', 'title': 'Validation error', 'status': 400, 'report': {'registryRequestId': 'd754ef24-8a43-4625-9742-3cd86156fce8', 'timestamp': '02-03-2023 08:02:43', 'detailed-message': 'An error occurred validating the schema.', 'sub-errors': [{'path': '$', 'type': 'required', 'arguments': ['xdm:property'], 'message': '$.xdm:property: is missing but it is required'}]}, 'detail': 'An error occurred validating the schema.'}

The only way to get this function working now seems to be by just passing the raw object with descriptorObj but we should modify the function to ensure it works with the parameters.

pitchmuc / aepp Goto Github PK

aepp's People

Contributors

Stargazers

Watchers

Forkers

aepp's Issues

Expected Behaviour

Actual Behaviour

Reproduce Scenario (including but not limited to)

Steps to Reproduce

Platform and Version

Sample Code that illustrates the problem

Logs taken while reproducing problem

Expected Behaviour

Actual Behaviour

Reproduce Scenario (including but not limited to)

Expected Behaviour

Actual Behaviour

Reproduce Scenario (including but not limited to)

Steps to Reproduce

Platform and Version

Sample Code that illustrates the problem

Logs taken while reproducing problem

Expected Behaviour

Actual Behaviour

Recommend Projects

Recommend Topics

Recommend Org