Coder Social home page Coder Social logo

datatamer / tamr-client Goto Github PK

View Code? Open in Web Editor NEW
11.0 11.0 26.0 1.69 MB

Programmatically interact with Tamr

Home Page: https://tamr-client.readthedocs.io

License: Apache License 2.0

Python 99.99% JavaScript 0.01%
api api-client python tamr

tamr-client's People

Contributors

abafzal avatar andy-l avatar caspiana1 avatar charlottemoremen avatar derrickrice avatar ianbakst avatar juliamcclellan avatar keziah-tamr avatar laferrieren avatar lordluen avatar mollysacks avatar nbateshaus avatar olivito avatar pcattori avatar skalish avatar tiems90 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tamr-client's Issues

Dataset.update_records should stream updates

๐Ÿ™‹ feature request

At the moment Dataset.update_records() does a

  body = "\n".join([json.dumps(r) for r in records])

which materializes all the updates as one massive string.

๐Ÿค” Expected Behavior

Updates from a streaming source, e.g. a database, don't need to be materialized en-route to Unify.

๐Ÿ˜ฏ Current Behavior

Updates from a streaming source, e.g. a database, get materialized as one giant string en-route to Unify.

๐Ÿ’ Possible Solution

requests is able to stream from a generator, so this could be changed to

    def _stringify_updates(updates):
        for update in updates:
            yield f"{update}\n".encode("utf-8")

    self.client.post(
        self.api_path + ":updateRecords",
        headers={"Content-Encoding": "utf-8"},
        data=_stringify_updates(records)
    )

๐Ÿ”ฆ Context

Materializing records doesn't scale, and is slow.
I'm reading records from a database (this is simplified):

    def load_data_from_database():
        engine = sqlalchemy.create_engine(database.get_uri())
        with closing(engine.connect().execution_options(stream_results=True)) as conn:
                query = sqlalchemy.sql.text(sql)
                query = query.bindparams(**sql_params)
                with closing(conn.execute(query)) as cursor:
                    my_dataset.update_records(cursor)

Note that stream_results=True - this means the cursor will return one (really a few) record at a time so I don't use crazy amounts of RAM. This is defeated by python-client materializing everything.

๐Ÿ’ป Examples

See above.

Project.unified_dataset should be a @property

๐Ÿ› bug report

Most sub-resources are properties:

  • Client.projects
  • Client.datasets
  • Dataset.attributes

etc.

An exception to this rule is Project.unified_dataset, which is a method:

  • Project.unified_dataset()

This makes coding awkward, because I need to keep track of whether sub-resources should be accessed as properties or methods.

๐Ÿค” Expected Behavior

To get the unified dataset of a project:

>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
datasets/3

๐Ÿ˜ฏ Current Behavior

>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'relative_id'

๐Ÿ’ Possible Solution

Add @property to Project.unified_dataset
NB: THIS WOULD BE A BREAKING CHANGE

๐Ÿ”ฆ Context

I have to remember that unified_dataset is special - it is a resource, but is accessed as a method, not a property.

๐Ÿ’ป Code Sample

See above.

๐ŸŒ Your Environment

Software Version(s)
tamr-unify-client 0.5.0-dev
Tamr Unify server v2019.10.0
Python 3.6.8
Operating System MacOS X 10.14.4

Issues raised during release process for 0.3.0

  • Retag releases without leading v
  • Consolidate dev deps into dev-requirements.txt or dev extras (#48)
  • Travis CI should publish commits tagged with release version (#49)
  • flake8-import-order (#50)
  • Rewrite RELEASE.md with updated changes (#51)
  • Simplify README: replace header links with just links (#52)
  • Issue template (#55)
    • make sure this issue doesn't already exist. if it does, consider ๐Ÿ‘ existing issue
    • Feature request
    • Bug request
  • PR template (#57)
    • docs
    • CHANGELOG updates
  • Contributors guide (#58)
    • Instructions/docs for installing latest from source
    • include push -f preference note
    • install dev deps

Python version (and any other requirements) should be specified more visibly

Got the following error after running the quickstart example in the docs:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from tamr_unify_client import Client
  File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/__init__.py", line 1, in <module>
    from tamr_unify_client.client import Client
  File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/client.py", line 69
    return f"{method} {url} : {response.status_code}"
                                                    ^
SyntaxError: invalid syntax

The issue is that I'm running python 2.7, but 3.6+ is required. Maybe we could add a requirements section to the docs and github to help make this more visible?

Access to Attributes of a Dataset

๐Ÿ™‹ feature request

Attribute resources, available individually or as a collection under Dataset.
https://docs.tamr.com/reference#attribute-types

๐Ÿค” Expected Behavior

Access attributes collection as a sub-resource of any dataset.
Access an attribute as a resource.

๐Ÿ˜ฏ Current Behavior

No way to get attributes.

๐Ÿ’ Possible Solution

Implement it.

๐Ÿ”ฆ Context

I need to know attribute types in order to do intelligent conversion between Unify values and values in the outside world.

In particular, I need to identify attributes that use Unify's geometry representation so I can convert between those and the Python Geo Interface.

๐Ÿ’ป Examples

dataset = client.datasets().by_name("my_dataset")
for attr in dataset.attributes():
    do_something_with(attr)

attr = client.datasets().by_name("my_dataset").attributes().by_name("my_attribute")
print(attr.name)
print(attr.description)
print(attr.nullable)

Should produce something like:

"my_attribute"
"Description of my_attribute"
True

Unable to fetch published_clusters

๐Ÿ› bug report

Unable to retrieve published_clusters from a mastering project.

๐Ÿ˜ฏ Current Behavior

  1. Published clusters via the UI
  2. Fetch project from the client, convert to mastering, and get published_clusters
  3. Attempt clusters.status() raises a 404 exception
  4. Attempt clusters.records() gives a list with one element that is an error dict.

image

>>> project = client.projects.by_external_id('idogs')
>>> project.name
'idogs'
>>> project = project.as_mastering()
>>> clusters = project.
project.api_path              project.client                project.external_id           project.high_impact_pairs(    project.pairs(                project.resource_id
project.as_categorization(    project.data                  project.from_data(            project.name                  project.published_clusters(   project.type
project.as_mastering(         project.description           project.from_json(            project.pair_matching_model(  project.relative_id           project.unified_dataset(
>>> clusters = project.published_clusters()
>>> clusters.status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 75, in status
    status_json = self.client.get(self.api_path + "/status").successful().json()
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/client.py", line 19, in successful
    self.raise_for_status()
  File "/home/drice/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://localhost:4443/api/versioned/v1/projects/3/publishedClusters/status
>>> data = list(clusters.records())
>>> print(len(data))
1
>>> pp(data[0])
{'causedBy': None,
 'class': 'javax.ws.rs.NotFoundException',
 'message': 'HTTP 404 Not Found',
 'service': 'pubapi',
 'stackTrace': ['org.glassfish.jersey.server.ServerRuntime$2::run::323',
                'org.glassfish.jersey.internal.Errors$1::call::271',
                'org.glassfish.jersey.internal.Errors$1::call::267',
                'org.glassfish.jersey.internal.Errors::process::315',
                'org.glassfish.jersey.internal.Errors::process::297',
                'org.glassfish.jersey.internal.Errors::process::267',
                'org.glassfish.jersey.process.internal.RequestScope::runInScope::317',
                'org.glassfish.jersey.server.ServerRuntime::process::305',
                'org.glassfish.jersey.server.ApplicationHandler::handle::1154',
                'org.glassfish.jersey.servlet.WebComponent::serviceImpl::473',
                'org.glassfish.jersey.servlet.WebComponent::service::427',
                'org.glassfish.jersey.servlet.ServletContainer::service::388',
                'org.glassfish.jersey.servlet.ServletContainer::service::341',
                'org.glassfish.jersey.servlet.ServletContainer::service::228',
                'io.dropwizard.jetty.NonblockingServletHolder::handle::49',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1655',
                'io.dropwizard.servlets.ThreadNameFilter::doFilter::34',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'io.dropwizard.jersey.filter.AllowedMethodsFilter::handle::45',
                'io.dropwizard.jersey.filter.AllowedMethodsFilter::doFilter::39',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.palantir.websecurity.filters.JerseyAwareWebSecurityFilter::doFilter::63',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.serviceenabled.dropwizardrequesttracker.RequestTrackerServletFilter::doFilter::49',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.tamr.zookeeper.dw.servicestate.ServiceStateFilter::doFilter::73',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642'],
 'status': 404}
>>>

๐Ÿค” Expected Behavior

  • Retrieve clusters successfully
  • In the case that cluster records cannot be retrieved, raise an exception rather than returning an error record as the single result.

๐Ÿ”ฆ Context

Unable to use the API to fetch Tamr Unify's results.

๐ŸŒ Your Environment

Software Version(s)
Python 3.6
Tamr Unify server Tamr Unify 2019.003.0 build 3a89900beb
tamr-unify-client 4.0-dev
Operating System Ubuntu

High impact pair entity ID requires a scan of dataset

๐Ÿ™‹ feature request

Currently, using the high impact pair records requires that the client find, fetch, and scan the entire dataset for each of the records referenced by the high impact pair.

More friendly options would be:

  • (Best) Have the pair information populated directly into the results of high_impact_pairs().records()
  • (Good) Be able to fetch a single record by tamr_id from a dataset.

๐Ÿ”ฆ Context

high_impact_pairs is far less performant / scale-conscious in the API than it is in the UI. The UI shows high impact pairs efficiently, which the API requires a full scan of the source datasets.

v0.3.0 client doesn't properly override base_path if specified

I tried setting the base_path='api' to use the requests method for non-versioned apis, but it seems to not be picking it up. See below:

Python 3.7.0 (default, Nov 10 2018, 18:44:49) 
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tamr_unify_client import Client
>>> from tamr_unify_client.auth import UsernamePasswordAuth
>>> catmando_host = 'localhost'
>>> catmando_port = 9030
>>> tamr_user = 'admin'
>>> tamr_pw = 'dt'
>>> auth=UsernamePasswordAuth(tamr_user, tamr_pw)
>>> catmando_client = Client(auth, host=catmando_host, port=catmando_port, base_path='api')
>>> response = catmando_client.request( 'GET', 'service/health', headers={'Accept': 'application/json'})
>>> response
<Response [404]>
>>> response.url
'http://localhost:9030/service/health'

More investigation reveals that it is a problem with urllib.parse.urljoin:

[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://localhost:9030'+'/'+'api','service/health')
'http://localhost:9030/service/health'```

Progress callback for synchronous actions

๐Ÿ’ฌ RFC

Many of the python client actions take a long while. While the default behavior of a synchronous action makes sense, it would be significantly improved if there were a simple callback that provided some indication of progress.

The progress is visible in the Unify UI, so I image that the data is available in the job operation data. It could be extracted and provided to a callback.

๐Ÿ”ฆ Context

Provide some information about how long something has taken and how far it is gotten, therefore offering some amount of insight into how much longer it might take.

๐Ÿ’ป Examples

dataset = project.unifying_dataset()
def progress_callback(status, start_time, done, total):
    # ... print information or whatever.
dataset.refresh(progress_callback=progress_callback)

`request()` with an absolute path resolves incorrectly

When request is called with an absolute path (e.g. /api/service/health), it rewrites the path into a path relative to base_path instead of correctly resolving the absolute path.

Example:
client.request(method, "/api/service/health")

Desired result:
Request is made to http://unify-host:9100/api/service/health

Actual result:
Request is made to http://unify-host:9100/api/versioned/v1//api/service/health

Create new dataset

๐Ÿ’ฌ RFC

Trying to upload new data. I would prefer to create a new dataset, but the API doesn't support that. Instead, I need to use Dataset.update_records.

I would like to be able to create a new dataset.

๐Ÿ”ฆ Context

Concerns with updating into an existing dataset:

  • Does not provide isolation of batches of data. With different datasets, I can map a given record to a particular load operation.
  • Different datasets allow for partitioning, a common data management technique. Partitioning would allow for deletion of a single dataset and allow the dataset name to be an indicator of which data has already been uploaded (imagine that I have a system of uploading data daily, with a name based on the date).
  • Records within a single dataset require unique recordId values, which requires additional management.
  • Making a larger dataset causes later operations to be more costly. This further hurts the performance in the context of #66, for example.

It appears that the server's API does support the creation of new datasets, but this isn't provided in the python client. https://docs.tamr.com/reference#create-dataset

๐Ÿ’ Possible Solution

def create_dataset(unify, dataset_config):
    """
    Create a dataset in Unify
    
    :param unify: Unify Client
    :param dataset_config: Dataset Configuration
    :return: the created Dataset
    """
    from tamr_unify_client.models.dataset.resource import Dataset
    data = unify.post(unify.datasets.api_path, json=dataset_config).successful().json()
    return Dataset(unify, data, data["relativeId"])

Conflation between 'api_path', 'relative_id' / 'relativeId', and BaseResource ctor 'alias'

๐Ÿ’ฌ RFC

These terms are used nearly interchangeably:

  • Object.relative_id (which reads data["relativeId"])
  • Object.api_path
  • argument alias on BaseResource.__init__

๐Ÿ”ฆ Context

"relative ID" is an unfamiliar term.

I understand relative path, and api path. Those describe things that I'm familiar with.

If I had to vote for one, I'd request api_path, since it is tacked onto the end of the versioned API path.

Further confusing: sometimes resource.api_path is valid and populated, and sometimes different from resource.relative_id. This has to do with the path the resource was fetched from vs the canonical path that the resource lives at. For example, unified data sets.

>>> ds = project.unified_dataset()
>>> ds.api_path
'projects/3/unifiedDataset'
>>> ds.relative_id
'datasets/58'

How is this meant to be used?

๐Ÿ’ป Examples

Before:

>>> client.datasets.by_relative_id('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>

After:

>>> client.datasets.by_api_path('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>

Make some docs changes visible on the stable branch

๐Ÿ’ฌ RFC

Figure out a mechanism for pushing changes to the docs that should become immediately visible.

๐Ÿ”ฆ Context

If we don't want to change our code, but we do want to change the docs that describe that code, how do we do that? We want to change the stable docs, not the latest docs, but we don't want to necessitate a full release to do so.

๐Ÿ’ป Examples

#58 adds refinements to the Contributor Guide, but those refinements are only visible on latest / 0.4.0-dev. They should be visible on stable / 0.3.0 too.


We could add a commit to fix the docs both to master and to the most recent stable release branch. BUT readthedocs uses Github tags for building multiple versions of the docs, so we would need to re-tag after this commit got included? BUT the Github release is tied to the release tag... hmm ๐Ÿค”

Pass requests.Session to Client

requests.Session allows for configuration of things like cookies, headers, and other connection parameters (including connection pooling). It would be nice to be able to specify the Session that the Client uses as an optional constructor argument.

If one isn't provided, the Client should create a Session object at construction time.

Changes requested:

  • Add Client constructor arg session default None that causes the Client to get a default session from requests.Session() which lives as long as the Client does.
  • Client.request: make requests through the Session
  • Client: create getter property for Client.session

Of significance for my usage, in order to connect over https with a self-signed certificate, I need to monkey-patch requests:

# monkey patching requests.request
from functools import partialmethod
old_request = requests.Session.request
requests.Session.request = partialmethod(old_request, verify=False)

I would like to handle this without monkey patching:

session = requests.Session(verify=false)
client = Client(session=session, ...)

Add clusters() method to mastering project

Could you add a method .clusters() to the MasteringProject object? It should in turn support a .refresh() method to recreate the clusters without publishing them. This functionality is currently missing in the python client but is necessary for a continuous mastering workflow.

In the docs, the Continuous Mastering example should also be updated to include this step (before publishing clusters).

Add useful repr to various object throughout the codebase

๐Ÿ™‹ feature request

Desire: Add a useful repr() to all common objects throughout unify-client-python.

๐Ÿค” Expected Behavior

interactive python and repr(x) for objects from the library should show some elements that are useful for a developer, though they need not be eval-compatible.

๐Ÿ˜ฏ Current Behavior

Default Python Object.__repr__

๐Ÿ’ Possible Solution

Discussed and negotiated in PR #59

MasteringProject high_impact_pairs and published_clusters not loaded on fetch

๐Ÿ› bug report

MasteringProject.high_impact_pairs and MasteringProject.published_clusters do not fetch the dataset information from the server.

This is inconsistent with Project.unified_dataset

๐Ÿ˜ฏ Current Behavior

>>> project.high_impact_pairs().external_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
    return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable

>>> project.published_clusters().external_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
    return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable

>>> project.unified_dataset().external_id
'idogs_unified_dataset'

๐Ÿค” Expected Behavior / ๐Ÿ’ Possible Solution

use the same logic as Project.unified_dataset. Fetch the dataset object from the server, so that the returned Dataset has all of the appropriate fields and does not generate exceptions on standard property access.

๐ŸŒ Your Environment

Software Version(s)
Python 3.6
tamr-unify-client 0.4.0-dev (master HEAD at the time of writing)
Operating System Linux (ubuntu:latest docker container)

Change `data` to `_data`

๐Ÿ’ฌ RFC

Many classes within tamr_unify_client use a data field member for the static information, often loaded from a server's JSON response upon fetching the object.

This object is not documented. Because of its direct relationship to the server response, it's also subject to change from version to version. It should be messaged as a private by the underscore naming convention (_data)

This also affects tab completion in the python3 interactive repl.

๐Ÿ”ฆ Context

This is particularly misleading when working with a dataset object, because it looks like it ought to be the collection of the dataset's records.

>>> dataset = project.unified_dataset()
>>> dataset.
dataset.api_path         dataset.data             dataset.external_id      dataset.from_json(       dataset.records(         dataset.relative_id      dataset.status(          dataset.update_records(
dataset.client           dataset.description      dataset.from_data(       dataset.name             dataset.refresh(         dataset.resource_id      dataset.tags             dataset.version
>>> dataset.

๐Ÿ’ป Examples

  • Change base_resource to use _data
  • Change various self.data references to self._data

Dataset should be updatable from Python Geo FeatureCollection

๐Ÿ™‹ feature request

Since a Dataset can produce a Python Geo FeatureCollection, it should also be updatable from a Python Geo FeatureCollection.
See https://gist.github.com/sgillies/2217756

Relates to #98

๐Ÿค” Expected Behavior

When given a FeatureCollection, Dataset should upsert records by matching the feature id to the record key.

๐Ÿ˜ฏ Current Behavior

No support.

๐Ÿ’ Possible Solution

Provide a from_geo_features(...) method on Dataset.

๐Ÿ”ฆ Context

When working with geospatial data, most Python GIS tools are able to produce a FeatureCollection. Dataset should embody the One True Way to convert from a FeatureCollection to a Dataset.

๐Ÿ’ป Examples

>>> import geopandas
>>> my_dataset = unify.datasets.by_name("my_dataset")
>>> my_geo_data_frame = geopandas.GeoDataFrame.from_features(my_dataset)
>>> # Modify my_geo_data_frame
>>> my_dataset.from_geo_features(my_geo_data_frame)

Proposal for extensibility

๐Ÿ’ฌ RFC

A proposal for cases where an end user wants to use this library, but add additional functionality without the difficulty of monkey-patching or of permanently modifying the library (even within a single Python runtime).

My proposal is to add a (barely-documented) class mapping argument to the Client which is used whenever a library class is invoked (i.e. a method is called on it, including it's constructor).

See DerrickRice@a6b644c

๐Ÿ’ป Examples

def example():
    class_mapping = {ProjectCollection: MyProjectCollection}

    client = Client(
        UsernamePasswordAuth("username", "password"), class_mapping=class_mapping
    )

    assert isinstance(client.projects, MyProjectCollection)

Dataset Upload With Schema

๐Ÿ’ฌ RFC

Currently it is hard to upload a dataset with a custom schema. Given a dataset name, schema as json, and csv, efficiently upload the data. For some datasets, it's useful to generate guids for the primary key field if it doesn't exist.

๐Ÿ”ฆ Context

I want to upload a dataset for golden records and bootstrap it, to do this I need a dataset with strings for the origin source name field and uploading via the csv endpoint in the UI converts everything to [string]

๐Ÿ’ป Examples

dataset.update_records does not return a reponse

๐Ÿ› bug report

calling my_dataset.update_records(stuff) returns None

๐Ÿค” Expected Behavior

calling my_dataset.update_records(stuff) should return the result of "self.client.post(self.api_path + ":updateRecords", data=body)"

๐Ÿ˜ฏ Current Behavior

๐Ÿ’ Possible Solution

๐Ÿ”ฆ Context

๐Ÿ’ป Code Sample

๐ŸŒ Your Environment

Software Version(s)
tamr-unify-client
Tamr Unify server
Python
Operating System

Ability to create projects

๐Ÿ™‹ feature request

๐Ÿค” Expected Behavior

Create a project in Unify

๐Ÿ˜ฏ Current Behavior

๐Ÿ’ Possible Solution

def create_project(unify, project_config):
    """
    Create a Project in Unify
    
    :param unify: Unify Client
    :param project_config: Project Configuration
    :return: The created Project
    """
    from tamr_unify_client.models.project.resource import Project
    data = unify.post(unify.projects.api_path, json=project_config).successful().json()
    return Project(unify, data, data["relativeId"])

๐Ÿ”ฆ Context

๐Ÿ’ป Examples

Fix Travis config

  • should not test against Python 3.8 (not released yet)
  • use Xenial distro for Travis
  • add build badge

Add support for creating a dataset attribute

๐Ÿ™‹ feature request

๐Ÿค” Expected Behavior

๐Ÿ˜ฏ Current Behavior

๐Ÿ’ Possible Solution

def create_attribute(dataset, attribute_config):
    """
    Create an Attribute in Unify
    
    :param dataset: the Unify Dataset to which to add the attribute
    :type dataset: :class:`tamr_unify_client.models.dataset.resource.Dataset`
    :param attribute_config: the configuration of the attribute to create
    :type attribute_config: dict[str, object]
    :return: the created Attribute
    """
    from tamr_unify_client.models.attribute.resource import Attribute
    data = dataset.client.post(dataset.attributes.api_path, json=attribute_config).successful().json()
    alias = dataset.attributes.api_path + "/" + attribute_config["name"]
    return Attribute(dataset.client, data, alias)

๐Ÿ”ฆ Context

๐Ÿ’ป Examples

Confusion between from_data and from_json (and arguments resource_json and data)

๐Ÿ› bug report

JSON is a string. Data (in the way it is used here), is a dictionary.

In many cases, from_json is actually taking a dictionary as an argument named resource_json.

๐Ÿค” Expected Behavior

An argument or method referring to json should be working with strings. If it is only working with structured data (that may or may not have once been a string), it should not be referred to as "json".

๐Ÿ˜ฏ Current Behavior

  • BaseCollection expects resouce_class.from_json to exist.
  • Most (all?) implementations of BaseResource have a classmethod from_json that calls BaseResource.from_data. Note that there is no conversion of string to dict between the two calls.

๐Ÿ’ Possible Solution

I'm fairly certain that all from_json methods can be removed, and calls to them can be changed to from_data (which is already a @classmethod, and so cls will be set to the subclass that it is invoked upon, without that subclass explicitly redefining from_data)

๐Ÿ”ฆ Context

I got confused.

๐ŸŒ Your Environment

Software Version(s)
tamr-unify-client 4.0-dev
Tamr Unify server n/a
Python n/a
Operating System n/a

Improve documentation on Dataset.update_records recordId

๐Ÿ™‹ feature request

I discovered, through experimentation, that Dataset.update_records uses the datasets keyAttributeNames as the recordId. The corresponding value in the record itself is ignored.

I don't know how this works if there are multiple key attributes.

This was not clear through documentation. The python client documentation defers to https://docs.tamr.com/reference#modify-a-datasets-records which says only:

Field Description
action The action being requested.CREATE or DELETE.
recordId The ID of the record.
record Optional. The information the new record will contain upon creation. Fields in the record must exist in the schema of the dataset they are being added to. If fields are not in the schema, they will be ignored.

This isn't sufficient information to understand how to use the API without experimentation and close scrutiny.

Remove ads on docs

Switch to a paid plan on readthedocs so we don't have ads on our docs.

Dataset should provide a Geo Python interface

๐Ÿ™‹ feature request

Dataset should provide a Geo Python interface for geospatial data.
See: https://gist.github.com/sgillies/2217756

๐Ÿค” Expected Behavior

Produce a GeoJSON FeatureCollection from a dataset:

  dataset = client.datasets.by_name("my_dataset")
  with open("my_dataset.json", "w") as f:
    json.dump(dataset.__geo_interface__, f)

๐Ÿ˜ฏ Current Behavior

No __geo_interface__ on Dataset

๐Ÿ’ Possible Solution

Provide __geo_interface__ on Dataset

๐Ÿ”ฆ Context

When exporting geospatial data from Unify, many python Geospatial packages are able to consume the Geo interface. For interoperability, Unify Dataset should provide geo_interface to easily convert to a FeatureCollection.

๐Ÿ’ป Examples

>>> dataset = client.datasets.by_name("my_dataset")
>>> feature_collection = dataset.__geo_interface__
>>> feature_collection["type"]
"FeatureCollection"
>>> feature = feature_collection["features"][0]
>>> feature["type"]
"Feature"
>>> geometry = feature["geometry"]
>>> geometry["type"]
"Polygon"
>>> geometry["coordinates"]
[[[-71.1522084419519, 42.3745215835176], [-71.1521881193882, 42.3744888310051], [-71.1521638278098, 42.3744971184826], [-71.1521246630653, 42.3744340027633], [-71.1521483999176, 42.3744259044764], [-71.1521407901139, 42.3744136394194], [-71.1521190173832, 42.3743785507759], [-71.1520756633469, 42.3743086828514], [-71.1520842951625, 42.3743095255146], [-71.1521193262041, 42.3743129156902], [-71.1521547542348, 42.374316315158], [-71.1521573062878, 42.3743165632504], [-71.1522809543662, 42.3743287505821], [-71.1523849604337, 42.3743390013382], [-71.1523885493228, 42.3743447856967], [-71.1524009012572, 42.3743405718856], [-71.1524174585427, 42.3743422046124], [-71.1524742202209, 42.3744336791865], [-71.1524042055534, 42.3744575646173], [-71.1524038456022, 42.3744793152621], [-71.1523870075068, 42.3744958511921], [-71.1523590300683, 42.3745053960713], [-71.1523312402649, 42.374504881751], [-71.1523030827935, 42.3744923997539], [-71.1523017432803, 42.3744902404282], [-71.1522860204876, 42.3744956045043], [-71.1522915145566, 42.3745044573864], [-71.1522687642496, 42.374512218378], [-71.1522630222685, 42.3745029641195], [-71.1522084419519, 42.3745215835176]]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.