datatamer / tamr-client Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 26.0 1.69 MB

Programmatically interact with Tamr

Home Page: https://tamr-client.readthedocs.io

License: Apache License 2.0

Python 99.99% JavaScript 0.01%

api api-client python tamr

tamr-client's People

Contributors

Stargazers

Watchers

tamr-client's Issues

Dataset.update_records should stream updates

🙋 feature request

At the moment Dataset.update_records() does a

  body = "\n".join([json.dumps(r) for r in records])

which materializes all the updates as one massive string.

🤔 Expected Behavior

Updates from a streaming source, e.g. a database, don't need to be materialized en-route to Unify.

😯 Current Behavior

Updates from a streaming source, e.g. a database, get materialized as one giant string en-route to Unify.

💁 Possible Solution

requests is able to stream from a generator, so this could be changed to

    def _stringify_updates(updates):
        for update in updates:
            yield f"{update}\n".encode("utf-8")

    self.client.post(
        self.api_path + ":updateRecords",
        headers={"Content-Encoding": "utf-8"},
        data=_stringify_updates(records)
    )

🔦 Context

Materializing records doesn't scale, and is slow.
I'm reading records from a database (this is simplified):

    def load_data_from_database():
        engine = sqlalchemy.create_engine(database.get_uri())
        with closing(engine.connect().execution_options(stream_results=True)) as conn:
                query = sqlalchemy.sql.text(sql)
                query = query.bindparams(**sql_params)
                with closing(conn.execute(query)) as cursor:
                    my_dataset.update_records(cursor)

Note that stream_results=True - this means the cursor will return one (really a few) record at a time so I don't use crazy amounts of RAM. This is defeated by python-client materializing everything.

💻 Examples

See above.

Project.unified_dataset should be a @property

🐛 bug report

Most sub-resources are properties:

Client.projects
Client.datasets
Dataset.attributes

etc.

An exception to this rule is Project.unified_dataset, which is a method:

Project.unified_dataset()

This makes coding awkward, because I need to keep track of whether sub-resources should be accessed as properties or methods.

🤔 Expected Behavior

To get the unified dataset of a project:

>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
datasets/3

😯 Current Behavior

>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'relative_id'

💁 Possible Solution

Add @property to Project.unified_dataset
NB: THIS WOULD BE A BREAKING CHANGE

🔦 Context

I have to remember that unified_dataset is special - it is a resource, but is accessed as a method, not a property.

💻 Code Sample

See above.

🌍 Your Environment

Software	Version(s)
tamr-unify-client	0.5.0-dev
Tamr Unify server	v2019.10.0
Python	3.6.8
Operating System	MacOS X 10.14.4

implement LLM function into python client

Issues raised during release process for 0.3.0

Retag releases without leading v
Consolidate dev deps into dev-requirements.txt or dev extras (#48)
Travis CI should publish commits tagged with release version (#49)
flake8-import-order (#50)
Rewrite RELEASE.md with updated changes (#51)
Simplify README: replace header links with just links (#52)
Issue template (#55)
- make sure this issue doesn't already exist. if it does, consider 👍 existing issue
- Feature request
- Bug request
PR template (#57)
- docs
- CHANGELOG updates
Contributors guide (#58)
- Instructions/docs for installing latest from source
- include push -f preference note
- install dev deps

Python version (and any other requirements) should be specified more visibly

Got the following error after running the quickstart example in the docs:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from tamr_unify_client import Client
  File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/__init__.py", line 1, in <module>
    from tamr_unify_client.client import Client
  File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/client.py", line 69
    return f"{method} {url} : {response.status_code}"
                                                    ^
SyntaxError: invalid syntax

The issue is that I'm running python 2.7, but 3.6+ is required. Maybe we could add a requirements section to the docs and github to help make this more visible?

Add "Become a maintainer!" section to Contributor Guide

Maintainer responsibilities:

Review PRs
Participate in RFCs
(Bonus) PRs for bug fixes / feature requests

Add link to this section from the README's "Maintainers" list

Access to Attributes of a Dataset

🙋 feature request

Attribute resources, available individually or as a collection under Dataset.
https://docs.tamr.com/reference#attribute-types

🤔 Expected Behavior

Access attributes collection as a sub-resource of any dataset.
Access an attribute as a resource.

😯 Current Behavior

No way to get attributes.

💁 Possible Solution

Implement it.

🔦 Context

I need to know attribute types in order to do intelligent conversion between Unify values and values in the outside world.

In particular, I need to identify attributes that use Unify's geometry representation so I can convert between those and the Python Geo Interface.

💻 Examples

dataset = client.datasets().by_name("my_dataset")
for attr in dataset.attributes():
    do_something_with(attr)

attr = client.datasets().by_name("my_dataset").attributes().by_name("my_attribute")
print(attr.name)
print(attr.description)
print(attr.nullable)

Should produce something like:

"my_attribute"
"Description of my_attribute"
True

Unable to fetch published_clusters

🐛 bug report

Unable to retrieve published_clusters from a mastering project.

😯 Current Behavior

Published clusters via the UI
Fetch project from the client, convert to mastering, and get published_clusters
Attempt clusters.status() raises a 404 exception
Attempt clusters.records() gives a list with one element that is an error dict.

image

>>> project = client.projects.by_external_id('idogs')
>>> project.name
'idogs'
>>> project = project.as_mastering()
>>> clusters = project.
project.api_path              project.client                project.external_id           project.high_impact_pairs(    project.pairs(                project.resource_id
project.as_categorization(    project.data                  project.from_data(            project.name                  project.published_clusters(   project.type
project.as_mastering(         project.description           project.from_json(            project.pair_matching_model(  project.relative_id           project.unified_dataset(
>>> clusters = project.published_clusters()
>>> clusters.status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 75, in status
    status_json = self.client.get(self.api_path + "/status").successful().json()
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/client.py", line 19, in successful
    self.raise_for_status()
  File "/home/drice/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://localhost:4443/api/versioned/v1/projects/3/publishedClusters/status
>>> data = list(clusters.records())
>>> print(len(data))
1
>>> pp(data[0])
{'causedBy': None,
 'class': 'javax.ws.rs.NotFoundException',
 'message': 'HTTP 404 Not Found',
 'service': 'pubapi',
 'stackTrace': ['org.glassfish.jersey.server.ServerRuntime$2::run::323',
                'org.glassfish.jersey.internal.Errors$1::call::271',
                'org.glassfish.jersey.internal.Errors$1::call::267',
                'org.glassfish.jersey.internal.Errors::process::315',
                'org.glassfish.jersey.internal.Errors::process::297',
                'org.glassfish.jersey.internal.Errors::process::267',
                'org.glassfish.jersey.process.internal.RequestScope::runInScope::317',
                'org.glassfish.jersey.server.ServerRuntime::process::305',
                'org.glassfish.jersey.server.ApplicationHandler::handle::1154',
                'org.glassfish.jersey.servlet.WebComponent::serviceImpl::473',
                'org.glassfish.jersey.servlet.WebComponent::service::427',
                'org.glassfish.jersey.servlet.ServletContainer::service::388',
                'org.glassfish.jersey.servlet.ServletContainer::service::341',
                'org.glassfish.jersey.servlet.ServletContainer::service::228',
                'io.dropwizard.jetty.NonblockingServletHolder::handle::49',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1655',
                'io.dropwizard.servlets.ThreadNameFilter::doFilter::34',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'io.dropwizard.jersey.filter.AllowedMethodsFilter::handle::45',
                'io.dropwizard.jersey.filter.AllowedMethodsFilter::doFilter::39',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.palantir.websecurity.filters.JerseyAwareWebSecurityFilter::doFilter::63',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.serviceenabled.dropwizardrequesttracker.RequestTrackerServletFilter::doFilter::49',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
                'com.tamr.zookeeper.dw.servicestate.ServiceStateFilter::doFilter::73',
                'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642'],
 'status': 404}
>>>

🤔 Expected Behavior

Retrieve clusters successfully
In the case that cluster records cannot be retrieved, raise an exception rather than returning an error record as the single result.

🔦 Context

Unable to use the API to fetch Tamr Unify's results.

🌍 Your Environment

Software	Version(s)
Python	3.6
Tamr Unify server	Tamr Unify 2019.003.0 build 3a89900beb
tamr-unify-client	4.0-dev
Operating System	Ubuntu

High impact pair entity ID requires a scan of dataset

🙋 feature request

Currently, using the high impact pair records requires that the client find, fetch, and scan the entire dataset for each of the records referenced by the high impact pair.

More friendly options would be:

(Best) Have the pair information populated directly into the results of high_impact_pairs().records()
(Good) Be able to fetch a single record by tamr_id from a dataset.

🔦 Context

high_impact_pairs is far less performant / scale-conscious in the API than it is in the UI. The UI shows high impact pairs efficiently, which the API requires a full scan of the source datasets.

v0.3.0 client doesn't properly override base_path if specified

I tried setting the base_path='api' to use the requests method for non-versioned apis, but it seems to not be picking it up. See below:

Python 3.7.0 (default, Nov 10 2018, 18:44:49) 
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tamr_unify_client import Client
>>> from tamr_unify_client.auth import UsernamePasswordAuth
>>> catmando_host = 'localhost'
>>> catmando_port = 9030
>>> tamr_user = 'admin'
>>> tamr_pw = 'dt'
>>> auth=UsernamePasswordAuth(tamr_user, tamr_pw)
>>> catmando_client = Client(auth, host=catmando_host, port=catmando_port, base_path='api')
>>> response = catmando_client.request( 'GET', 'service/health', headers={'Accept': 'application/json'})
>>> response
<Response [404]>
>>> response.url
'http://localhost:9030/service/health'

More investigation reveals that it is a problem with urllib.parse.urljoin:

[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://localhost:9030'+'/'+'api','service/health')
'http://localhost:9030/service/health'```

add example of how to use config files authorize credentials in client user guide

Progress callback for synchronous actions

💬 RFC

Many of the python client actions take a long while. While the default behavior of a synchronous action makes sense, it would be significantly improved if there were a simple callback that provided some indication of progress.

The progress is visible in the Unify UI, so I image that the data is available in the job operation data. It could be extracted and provided to a callback.

🔦 Context

Provide some information about how long something has taken and how far it is gotten, therefore offering some amount of insight into how much longer it might take.

💻 Examples

dataset = project.unifying_dataset()
def progress_callback(status, start_time, done, total):
    # ... print information or whatever.
dataset.refresh(progress_callback=progress_callback)

`request()` with an absolute path resolves incorrectly

When request is called with an absolute path (e.g. /api/service/health), it rewrites the path into a path relative to base_path instead of correctly resolving the absolute path.

Example:
client.request(method, "/api/service/health")

Desired result:
Request is made to http://unify-host:9100/api/service/health

Actual result:
Request is made to http://unify-host:9100/api/versioned/v1//api/service/health

Create new dataset

💬 RFC

Trying to upload new data. I would prefer to create a new dataset, but the API doesn't support that. Instead, I need to use Dataset.update_records.

I would like to be able to create a new dataset.

🔦 Context

Concerns with updating into an existing dataset:

Does not provide isolation of batches of data. With different datasets, I can map a given record to a particular load operation.
Different datasets allow for partitioning, a common data management technique. Partitioning would allow for deletion of a single dataset and allow the dataset name to be an indicator of which data has already been uploaded (imagine that I have a system of uploading data daily, with a name based on the date).
Records within a single dataset require unique recordId values, which requires additional management.
Making a larger dataset causes later operations to be more costly. This further hurts the performance in the context of #66, for example.

It appears that the server's API does support the creation of new datasets, but this isn't provided in the python client. https://docs.tamr.com/reference#create-dataset

💁 Possible Solution

def create_dataset(unify, dataset_config):
    """
    Create a dataset in Unify
    
    :param unify: Unify Client
    :param dataset_config: Dataset Configuration
    :return: the created Dataset
    """
    from tamr_unify_client.models.dataset.resource import Dataset
    data = unify.post(unify.datasets.api_path, json=dataset_config).successful().json()
    return Dataset(unify, data, data["relativeId"])

Conflation between 'api_path', 'relative_id' / 'relativeId', and BaseResource ctor 'alias'

💬 RFC

These terms are used nearly interchangeably:

Object.relative_id (which reads data["relativeId"])
Object.api_path
argument alias on BaseResource.__init__

🔦 Context

"relative ID" is an unfamiliar term.

I understand relative path, and api path. Those describe things that I'm familiar with.

If I had to vote for one, I'd request api_path, since it is tacked onto the end of the versioned API path.

Further confusing: sometimes resource.api_path is valid and populated, and sometimes different from resource.relative_id. This has to do with the path the resource was fetched from vs the canonical path that the resource lives at. For example, unified data sets.

>>> ds = project.unified_dataset()
>>> ds.api_path
'projects/3/unifiedDataset'
>>> ds.relative_id
'datasets/58'

How is this meant to be used?

💻 Examples

Before:

>>> client.datasets.by_relative_id('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>

After:

>>> client.datasets.by_api_path('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>

Make some docs changes visible on the stable branch

💬 RFC

Figure out a mechanism for pushing changes to the docs that should become immediately visible.

🔦 Context

If we don't want to change our code, but we do want to change the docs that describe that code, how do we do that? We want to change the stable docs, not the latest docs, but we don't want to necessitate a full release to do so.

💻 Examples

#58 adds refinements to the Contributor Guide, but those refinements are only visible on latest / 0.4.0-dev. They should be visible on stable / 0.3.0 too.

We could add a commit to fix the docs both to master and to the most recent stable release branch. BUT readthedocs uses Github tags for building multiple versions of the docs, so we would need to re-tag after this commit got included? BUT the Github release is tied to the release tag... hmm 🤔

Pass requests.Session to Client

requests.Session allows for configuration of things like cookies, headers, and other connection parameters (including connection pooling). It would be nice to be able to specify the Session that the Client uses as an optional constructor argument.

If one isn't provided, the Client should create a Session object at construction time.

Changes requested:

Add Client constructor arg session default None that causes the Client to get a default session from requests.Session() which lives as long as the Client does.
Client.request: make requests through the Session
Client: create getter property for Client.session

Of significance for my usage, in order to connect over https with a self-signed certificate, I need to monkey-patch requests:

# monkey patching requests.request
from functools import partialmethod
old_request = requests.Session.request
requests.Session.request = partialmethod(old_request, verify=False)

I would like to handle this without monkey patching:

session = requests.Session(verify=false)
client = Client(session=session, ...)

Add clusters() method to mastering project

Could you add a method .clusters() to the MasteringProject object? It should in turn support a .refresh() method to recreate the clusters without publishing them. This functionality is currently missing in the python client but is necessary for a continuous mastering workflow.

In the docs, the Continuous Mastering example should also be updated to include this step (before publishing clusters).

HTTP/HTTPS for username/password authentication

whether the username password is send through http, or secure (authenticated and encrypted) connections, like HTTPS?

Add "released <date>" to Changelog during release process

In RELEASE.md, figure out where to include "Add release date to changelog".

Add useful repr to various object throughout the codebase

🙋 feature request

Desire: Add a useful repr() to all common objects throughout unify-client-python.

🤔 Expected Behavior

interactive python and repr(x) for objects from the library should show some elements that are useful for a developer, though they need not be eval-compatible.

😯 Current Behavior

Default Python Object.__repr__

💁 Possible Solution

Discussed and negotiated in PR #59

Tests use nonexistent `Dataset.stream_records` method

... causing tests to fail. Should use Dataset.records method instead

MasteringProject high_impact_pairs and published_clusters not loaded on fetch

🐛 bug report

MasteringProject.high_impact_pairs and MasteringProject.published_clusters do not fetch the dataset information from the server.

This is inconsistent with Project.unified_dataset

😯 Current Behavior

>>> project.high_impact_pairs().external_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
    return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable

>>> project.published_clusters().external_id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
    return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable

>>> project.unified_dataset().external_id
'idogs_unified_dataset'

🤔 Expected Behavior / 💁 Possible Solution

use the same logic as Project.unified_dataset. Fetch the dataset object from the server, so that the returned Dataset has all of the appropriate fields and does not generate exceptions on standard property access.

🌍 Your Environment

Software	Version(s)
Python	3.6
tamr-unify-client	0.4.0-dev (master HEAD at the time of writing)
Operating System	Linux (ubuntu:latest docker container)

Change `data` to `_data`

💬 RFC

Many classes within tamr_unify_client use a data field member for the static information, often loaded from a server's JSON response upon fetching the object.

This object is not documented. Because of its direct relationship to the server response, it's also subject to change from version to version. It should be messaged as a private by the underscore naming convention (_data)

This also affects tab completion in the python3 interactive repl.

🔦 Context

This is particularly misleading when working with a dataset object, because it looks like it ought to be the collection of the dataset's records.

>>> dataset = project.unified_dataset()
>>> dataset.
dataset.api_path         dataset.data             dataset.external_id      dataset.from_json(       dataset.records(         dataset.relative_id      dataset.status(          dataset.update_records(
dataset.client           dataset.description      dataset.from_data(       dataset.name             dataset.refresh(         dataset.resource_id      dataset.tags             dataset.version
>>> dataset.

💻 Examples

Change base_resource to use _data
Change various self.data references to self._data

Documentation migration + improvements

Link to Github repo within Sphinx docs
Migrate Python Client guides from docs.tamr.com to Sphinx docs
- Fix #8
Include a "How to choose Python Client version?" in FAQ section

Dataset should be updatable from Python Geo FeatureCollection

🙋 feature request

Since a Dataset can produce a Python Geo FeatureCollection, it should also be updatable from a Python Geo FeatureCollection.
See https://gist.github.com/sgillies/2217756

Relates to #98

🤔 Expected Behavior

When given a FeatureCollection, Dataset should upsert records by matching the feature id to the record key.

😯 Current Behavior

No support.

💁 Possible Solution

Provide a from_geo_features(...) method on Dataset.

🔦 Context

When working with geospatial data, most Python GIS tools are able to produce a FeatureCollection. Dataset should embody the One True Way to convert from a FeatureCollection to a Dataset.

💻 Examples

>>> import geopandas
>>> my_dataset = unify.datasets.by_name("my_dataset")
>>> my_geo_data_frame = geopandas.GeoDataFrame.from_features(my_dataset)
>>> # Modify my_geo_data_frame
>>> my_dataset.from_geo_features(my_geo_data_frame)

Tamr org account for readthedocs

Currently it's using @pcattori 's personal account.

`JSONDecodeError` are not helpful when HTTP status was not successful

Instead, Python Client should raise the HTTP error as an exception.

E.g. if a 404 happens under the hood, don't show JSONDecodeError, show the 404

Proposal for extensibility

💬 RFC

A proposal for cases where an end user wants to use this library, but add additional functionality without the difficulty of monkey-patching or of permanently modifying the library (even within a single Python runtime).

My proposal is to add a (barely-documented) class mapping argument to the Client which is used whenever a library class is invoked (i.e. a method is called on it, including it's constructor).

See DerrickRice@a6b644c

💻 Examples

def example():
    class_mapping = {ProjectCollection: MyProjectCollection}

    client = Client(
        UsernamePasswordAuth("username", "password"), class_mapping=class_mapping
    )

    assert isinstance(client.projects, MyProjectCollection)

Can we get a by_name() method for projects?

It's a nice feature on datasets.

Dataset Upload With Schema

💬 RFC

Currently it is hard to upload a dataset with a custom schema. Given a dataset name, schema as json, and csv, efficiently upload the data. For some datasets, it's useful to generate guids for the primary key field if it doesn't exist.

🔦 Context

I want to upload a dataset for golden records and bootstrap it, to do this I need a dataset with strings for the origin source name field and uploading via the csv endpoint in the UI converts everything to [string]

💻 Examples

dataset.update_records does not return a reponse

🐛 bug report

calling my_dataset.update_records(stuff) returns None

🤔 Expected Behavior

calling my_dataset.update_records(stuff) should return the result of "self.client.post(self.api_path + ":updateRecords", data=body)"

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Code Sample

🌍 Your Environment

Software	Version(s)
tamr-unify-client
Tamr Unify server
Python
Operating System

Add support for updating the schema of a dataset

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples

Ability to create projects

🙋 feature request

🤔 Expected Behavior

Create a project in Unify

😯 Current Behavior

💁 Possible Solution

def create_project(unify, project_config):
    """
    Create a Project in Unify
    
    :param unify: Unify Client
    :param project_config: Project Configuration
    :return: The created Project
    """
    from tamr_unify_client.models.project.resource import Project
    data = unify.post(unify.projects.api_path, json=project_config).successful().json()
    return Project(unify, data, data["relativeId"])

🔦 Context

💻 Examples

Create op helper structure for API calls outside of the versioned API

Currently if we call a non-versioned or husk API, we cannot use the Operation status structure to check success or poll the operation status.

It would be great to have the Operation class generalized a bit so it can be used for custom API calls.

package name doesn't match documentation

downloading from pypi means you have to import the package as tamr_unify_client, which is different from the documentation which states import unify_api_v1

Better docs for how to call directly call APIs

Can we have a docs that specify how to use python client call non-versioned APIs?

ask for implementation of LLM and record streaming in python client

Fix Travis config

should not test against Python 3.8 (not released yet)
use Xenial distro for Travis
add build badge

project api breaks if previously a project has been made with same name even if it doesn't currently exist

implement record streaming function into python client

Add support for creating a dataset attribute

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

def create_attribute(dataset, attribute_config):
    """
    Create an Attribute in Unify
    
    :param dataset: the Unify Dataset to which to add the attribute
    :type dataset: :class:`tamr_unify_client.models.dataset.resource.Dataset`
    :param attribute_config: the configuration of the attribute to create
    :type attribute_config: dict[str, object]
    :return: the created Attribute
    """
    from tamr_unify_client.models.attribute.resource import Attribute
    data = dataset.client.post(dataset.attributes.api_path, json=attribute_config).successful().json()
    alias = dataset.attributes.api_path + "/" + attribute_config["name"]
    return Attribute(dataset.client, data, alias)

🔦 Context

💻 Examples

Confusion between from_data and from_json (and arguments resource_json and data)

🐛 bug report

JSON is a string. Data (in the way it is used here), is a dictionary.

In many cases, from_json is actually taking a dictionary as an argument named resource_json.

🤔 Expected Behavior

An argument or method referring to json should be working with strings. If it is only working with structured data (that may or may not have once been a string), it should not be referred to as "json".

😯 Current Behavior

BaseCollection expects resouce_class.from_json to exist.
Most (all?) implementations of BaseResource have a classmethod from_json that calls BaseResource.from_data. Note that there is no conversion of string to dict between the two calls.

💁 Possible Solution

I'm fairly certain that all from_json methods can be removed, and calls to them can be changed to from_data (which is already a @classmethod, and so cls will be set to the subclass that it is invoked upon, without that subclass explicitly redefining from_data)

🔦 Context

I got confused.

🌍 Your Environment

Software	Version(s)
tamr-unify-client	4.0-dev
Tamr Unify server	n/a
Python	n/a
Operating System	n/a

Improve documentation on Dataset.update_records recordId

🙋 feature request

I discovered, through experimentation, that Dataset.update_records uses the datasets keyAttributeNames as the recordId. The corresponding value in the record itself is ignored.

I don't know how this works if there are multiple key attributes.

This was not clear through documentation. The python client documentation defers to https://docs.tamr.com/reference#modify-a-datasets-records which says only:

Field	Description
action	The action being requested.CREATE or DELETE.
recordId	The ID of the record.
record	Optional. The information the new record will contain upon creation. Fields in the record must exist in the schema of the dataset they are being added to. If fields are not in the schema, they will be ignored.

This isn't sufficient information to understand how to use the API without experimentation and close scrutiny.

Description is rendered as MarkDown source on PyPI

Remove ads on docs

Switch to a paid plan on readthedocs so we don't have ads on our docs.

return all data sets from one project.

Add support for initializing a source dataset

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples

Tamr org account for Travis CI

Currently, it's using @pcattori 's personal account.

Dataset should provide a Geo Python interface

🙋 feature request

Dataset should provide a Geo Python interface for geospatial data.
See: https://gist.github.com/sgillies/2217756

🤔 Expected Behavior

Produce a GeoJSON FeatureCollection from a dataset:

  dataset = client.datasets.by_name("my_dataset")
  with open("my_dataset.json", "w") as f:
    json.dump(dataset.__geo_interface__, f)

😯 Current Behavior

No __geo_interface__ on Dataset

💁 Possible Solution

Provide __geo_interface__ on Dataset

🔦 Context

When exporting geospatial data from Unify, many python Geospatial packages are able to consume the Geo interface. For interoperability, Unify Dataset should provide geo_interface to easily convert to a FeatureCollection.

💻 Examples

>>> dataset = client.datasets.by_name("my_dataset")
>>> feature_collection = dataset.__geo_interface__
>>> feature_collection["type"]
"FeatureCollection"
>>> feature = feature_collection["features"][0]
>>> feature["type"]
"Feature"
>>> geometry = feature["geometry"]
>>> geometry["type"]
"Polygon"
>>> geometry["coordinates"]
[[[-71.1522084419519, 42.3745215835176], [-71.1521881193882, 42.3744888310051], [-71.1521638278098, 42.3744971184826], [-71.1521246630653, 42.3744340027633], [-71.1521483999176, 42.3744259044764], [-71.1521407901139, 42.3744136394194], [-71.1521190173832, 42.3743785507759], [-71.1520756633469, 42.3743086828514], [-71.1520842951625, 42.3743095255146], [-71.1521193262041, 42.3743129156902], [-71.1521547542348, 42.374316315158], [-71.1521573062878, 42.3743165632504], [-71.1522809543662, 42.3743287505821], [-71.1523849604337, 42.3743390013382], [-71.1523885493228, 42.3743447856967], [-71.1524009012572, 42.3743405718856], [-71.1524174585427, 42.3743422046124], [-71.1524742202209, 42.3744336791865], [-71.1524042055534, 42.3744575646173], [-71.1524038456022, 42.3744793152621], [-71.1523870075068, 42.3744958511921], [-71.1523590300683, 42.3745053960713], [-71.1523312402649, 42.374504881751], [-71.1523030827935, 42.3744923997539], [-71.1523017432803, 42.3744902404282], [-71.1522860204876, 42.3744956045043], [-71.1522915145566, 42.3745044573864], [-71.1522687642496, 42.374512218378], [-71.1522630222685, 42.3745029641195], [-71.1522084419519, 42.3745215835176]]]

datatamer / tamr-client Goto Github PK

tamr-client's People

Contributors

Stargazers

Watchers

Forkers

tamr-client's Issues

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples

🐛 bug report

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Code Sample

🌍 Your Environment

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples

🐛 bug report

😯 Current Behavior

🤔 Expected Behavior

🔦 Context

🌍 Your Environment

🙋 feature request

🔦 Context

💬 RFC

🔦 Context

💻 Examples

💬 RFC

🔦 Context

💁 Possible Solution

💬 RFC

🔦 Context

💻 Examples

💬 RFC

🔦 Context

💻 Examples

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🐛 bug report

😯 Current Behavior

🤔 Expected Behavior / 💁 Possible Solution

🌍 Your Environment

💬 RFC

🔦 Context

💻 Examples

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples

💬 RFC

💻 Examples

💬 RFC

🔦 Context

💻 Examples

🐛 bug report

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Code Sample

🌍 Your Environment

🙋 feature request

🤔 Expected Behavior

😯 Current Behavior

💁 Possible Solution

🔦 Context

💻 Examples