datatamer / tamr-client Goto Github PK
View Code? Open in Web Editor NEWProgrammatically interact with Tamr
Home Page: https://tamr-client.readthedocs.io
License: Apache License 2.0
Programmatically interact with Tamr
Home Page: https://tamr-client.readthedocs.io
License: Apache License 2.0
At the moment Dataset.update_records()
does a
body = "\n".join([json.dumps(r) for r in records])
which materializes all the updates as one massive string.
Updates from a streaming source, e.g. a database, don't need to be materialized en-route to Unify.
Updates from a streaming source, e.g. a database, get materialized as one giant string en-route to Unify.
requests
is able to stream from a generator, so this could be changed to
def _stringify_updates(updates):
for update in updates:
yield f"{update}\n".encode("utf-8")
self.client.post(
self.api_path + ":updateRecords",
headers={"Content-Encoding": "utf-8"},
data=_stringify_updates(records)
)
Materializing records doesn't scale, and is slow.
I'm reading records from a database (this is simplified):
def load_data_from_database():
engine = sqlalchemy.create_engine(database.get_uri())
with closing(engine.connect().execution_options(stream_results=True)) as conn:
query = sqlalchemy.sql.text(sql)
query = query.bindparams(**sql_params)
with closing(conn.execute(query)) as cursor:
my_dataset.update_records(cursor)
Note that stream_results=True
- this means the cursor will return one (really a few) record at a time so I don't use crazy amounts of RAM. This is defeated by python-client materializing everything.
See above.
Most sub-resources are properties:
etc.
An exception to this rule is Project.unified_dataset, which is a method:
This makes coding awkward, because I need to keep track of whether sub-resources should be accessed as properties or methods.
To get the unified dataset of a project:
>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
datasets/3
>>> project = unify.projects.by_resource_id("1")
>>> ud = project.unified_dataset
>>> print(ud.relative_id)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'function' object has no attribute 'relative_id'
Add @property
to Project.unified_dataset
NB: THIS WOULD BE A BREAKING CHANGE
I have to remember that unified_dataset is special - it is a resource, but is accessed as a method, not a property.
See above.
Software | Version(s) |
---|---|
tamr-unify-client | 0.5.0-dev |
Tamr Unify server | v2019.10.0 |
Python | 3.6.8 |
Operating System | MacOS X 10.14.4 |
v
dev-requirements.txt
or dev
extras (#48)flake8-import-order
(#50)latest
from sourcepush -f
preference notedev
depsGot the following error after running the quickstart example in the docs:
Traceback (most recent call last):
File "test.py", line 1, in <module>
from tamr_unify_client import Client
File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/__init__.py", line 1, in <module>
from tamr_unify_client.client import Client
File "/home/nikki/.local/lib/python2.7/site-packages/tamr_unify_client/client.py", line 69
return f"{method} {url} : {response.status_code}"
^
SyntaxError: invalid syntax
The issue is that I'm running python 2.7, but 3.6+ is required. Maybe we could add a requirements section to the docs and github to help make this more visible?
Maintainer responsibilities:
Add link to this section from the README's "Maintainers" list
Attribute resources, available individually or as a collection under Dataset.
https://docs.tamr.com/reference#attribute-types
Access attributes collection as a sub-resource of any dataset.
Access an attribute as a resource.
No way to get attributes.
Implement it.
I need to know attribute types in order to do intelligent conversion between Unify values and values in the outside world.
In particular, I need to identify attributes that use Unify's geometry representation so I can convert between those and the Python Geo Interface.
dataset = client.datasets().by_name("my_dataset")
for attr in dataset.attributes():
do_something_with(attr)
attr = client.datasets().by_name("my_dataset").attributes().by_name("my_attribute")
print(attr.name)
print(attr.description)
print(attr.nullable)
Should produce something like:
"my_attribute"
"Description of my_attribute"
True
Unable to retrieve published_clusters
from a mastering project.
published_clusters
clusters.status()
raises a 404 exceptionclusters.records()
gives a list with one element that is an error dict.>>> project = client.projects.by_external_id('idogs')
>>> project.name
'idogs'
>>> project = project.as_mastering()
>>> clusters = project.
project.api_path project.client project.external_id project.high_impact_pairs( project.pairs( project.resource_id
project.as_categorization( project.data project.from_data( project.name project.published_clusters( project.type
project.as_mastering( project.description project.from_json( project.pair_matching_model( project.relative_id project.unified_dataset(
>>> clusters = project.published_clusters()
>>> clusters.status()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 75, in status
status_json = self.client.get(self.api_path + "/status").successful().json()
File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/client.py", line 19, in successful
self.raise_for_status()
File "/home/drice/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://localhost:4443/api/versioned/v1/projects/3/publishedClusters/status
>>> data = list(clusters.records())
>>> print(len(data))
1
>>> pp(data[0])
{'causedBy': None,
'class': 'javax.ws.rs.NotFoundException',
'message': 'HTTP 404 Not Found',
'service': 'pubapi',
'stackTrace': ['org.glassfish.jersey.server.ServerRuntime$2::run::323',
'org.glassfish.jersey.internal.Errors$1::call::271',
'org.glassfish.jersey.internal.Errors$1::call::267',
'org.glassfish.jersey.internal.Errors::process::315',
'org.glassfish.jersey.internal.Errors::process::297',
'org.glassfish.jersey.internal.Errors::process::267',
'org.glassfish.jersey.process.internal.RequestScope::runInScope::317',
'org.glassfish.jersey.server.ServerRuntime::process::305',
'org.glassfish.jersey.server.ApplicationHandler::handle::1154',
'org.glassfish.jersey.servlet.WebComponent::serviceImpl::473',
'org.glassfish.jersey.servlet.WebComponent::service::427',
'org.glassfish.jersey.servlet.ServletContainer::service::388',
'org.glassfish.jersey.servlet.ServletContainer::service::341',
'org.glassfish.jersey.servlet.ServletContainer::service::228',
'io.dropwizard.jetty.NonblockingServletHolder::handle::49',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1655',
'io.dropwizard.servlets.ThreadNameFilter::doFilter::34',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
'io.dropwizard.jersey.filter.AllowedMethodsFilter::handle::45',
'io.dropwizard.jersey.filter.AllowedMethodsFilter::doFilter::39',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
'com.palantir.websecurity.filters.JerseyAwareWebSecurityFilter::doFilter::63',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
'com.serviceenabled.dropwizardrequesttracker.RequestTrackerServletFilter::doFilter::49',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642',
'com.tamr.zookeeper.dw.servicestate.ServiceStateFilter::doFilter::73',
'org.eclipse.jetty.servlet.ServletHandler$CachedChain::doFilter::1642'],
'status': 404}
>>>
Unable to use the API to fetch Tamr Unify's results.
Software | Version(s) |
---|---|
Python | 3.6 |
Tamr Unify server | Tamr Unify 2019.003.0 build 3a89900beb |
tamr-unify-client | 4.0-dev |
Operating System | Ubuntu |
Currently, using the high impact pair records requires that the client find, fetch, and scan the entire dataset for each of the records referenced by the high impact pair.
More friendly options would be:
high_impact_pairs().records()
tamr_id
from a dataset.high_impact_pairs
is far less performant / scale-conscious in the API than it is in the UI. The UI shows high impact pairs efficiently, which the API requires a full scan of the source datasets.
I tried setting the base_path='api'
to use the requests method for non-versioned apis, but it seems to not be picking it up. See below:
Python 3.7.0 (default, Nov 10 2018, 18:44:49)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tamr_unify_client import Client
>>> from tamr_unify_client.auth import UsernamePasswordAuth
>>> catmando_host = 'localhost'
>>> catmando_port = 9030
>>> tamr_user = 'admin'
>>> tamr_pw = 'dt'
>>> auth=UsernamePasswordAuth(tamr_user, tamr_pw)
>>> catmando_client = Client(auth, host=catmando_host, port=catmando_port, base_path='api')
>>> response = catmando_client.request( 'GET', 'service/health', headers={'Accept': 'application/json'})
>>> response
<Response [404]>
>>> response.url
'http://localhost:9030/service/health'
More investigation reveals that it is a problem with urllib.parse.urljoin
:
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://localhost:9030'+'/'+'api','service/health')
'http://localhost:9030/service/health'```
add example of how to use config files authorize credentials in client user guide
Many of the python client actions take a long while. While the default behavior of a synchronous action makes sense, it would be significantly improved if there were a simple callback that provided some indication of progress.
The progress is visible in the Unify UI, so I image that the data is available in the job operation data. It could be extracted and provided to a callback.
Provide some information about how long something has taken and how far it is gotten, therefore offering some amount of insight into how much longer it might take.
dataset = project.unifying_dataset()
def progress_callback(status, start_time, done, total):
# ... print information or whatever.
dataset.refresh(progress_callback=progress_callback)
When request
is called with an absolute path (e.g. /api/service/health
), it rewrites the path into a path relative to base_path
instead of correctly resolving the absolute path.
Example:
client.request(method, "/api/service/health")
Desired result:
Request is made to http://unify-host:9100/api/service/health
Actual result:
Request is made to http://unify-host:9100/api/versioned/v1//api/service/health
Trying to upload new data. I would prefer to create a new dataset, but the API doesn't support that. Instead, I need to use Dataset.update_records
.
I would like to be able to create a new dataset.
Concerns with updating into an existing dataset:
recordId
values, which requires additional management.It appears that the server's API does support the creation of new datasets, but this isn't provided in the python client. https://docs.tamr.com/reference#create-dataset
def create_dataset(unify, dataset_config):
"""
Create a dataset in Unify
:param unify: Unify Client
:param dataset_config: Dataset Configuration
:return: the created Dataset
"""
from tamr_unify_client.models.dataset.resource import Dataset
data = unify.post(unify.datasets.api_path, json=dataset_config).successful().json()
return Dataset(unify, data, data["relativeId"])
These terms are used nearly interchangeably:
Object.relative_id
(which reads data["relativeId"]
)Object.api_path
alias
on BaseResource.__init__
"relative ID" is an unfamiliar term.
I understand relative path, and api path. Those describe things that I'm familiar with.
If I had to vote for one, I'd request api_path
, since it is tacked onto the end of the versioned API path.
Further confusing: sometimes resource.api_path
is valid and populated, and sometimes different from resource.relative_id
. This has to do with the path the resource was fetched from vs the canonical path that the resource lives at. For example, unified data sets.
>>> ds = project.unified_dataset()
>>> ds.api_path
'projects/3/unifiedDataset'
>>> ds.relative_id
'datasets/58'
How is this meant to be used?
Before:
>>> client.datasets.by_relative_id('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>
After:
>>> client.datasets.by_api_path('datasets/58')
<tamr_unify_client.models.dataset.resource.Dataset object at 0x7fe17cddee80>
Figure out a mechanism for pushing changes to the docs that should become immediately visible.
If we don't want to change our code, but we do want to change the docs that describe that code, how do we do that? We want to change the stable docs, not the latest docs, but we don't want to necessitate a full release to do so.
#58 adds refinements to the Contributor Guide, but those refinements are only visible on latest / 0.4.0-dev
. They should be visible on stable / 0.3.0
too.
We could add a commit to fix the docs both to master
and to the most recent stable release branch. BUT readthedocs uses Github tags for building multiple versions of the docs, so we would need to re-tag after this commit got included? BUT the Github release is tied to the release tag... hmm ๐ค
requests.Session
allows for configuration of things like cookies, headers, and other connection parameters (including connection pooling). It would be nice to be able to specify the Session that the Client uses as an optional constructor argument.
If one isn't provided, the Client should create a Session object at construction time.
Changes requested:
session
default None that causes the Client to get a default session from requests.Session() which lives as long as the Client does.Client.request
: make requests through the SessionClient.session
Of significance for my usage, in order to connect over https with a self-signed certificate, I need to monkey-patch requests
:
# monkey patching requests.request
from functools import partialmethod
old_request = requests.Session.request
requests.Session.request = partialmethod(old_request, verify=False)
I would like to handle this without monkey patching:
session = requests.Session(verify=false)
client = Client(session=session, ...)
Could you add a method .clusters() to the MasteringProject object? It should in turn support a .refresh() method to recreate the clusters without publishing them. This functionality is currently missing in the python client but is necessary for a continuous mastering workflow.
In the docs, the Continuous Mastering example should also be updated to include this step (before publishing clusters).
whether the username password is send through http, or secure (authenticated and encrypted) connections, like HTTPS?
In RELEASE.md, figure out where to include "Add release date to changelog".
Desire: Add a useful repr() to all common objects throughout unify-client-python.
interactive python and repr(x)
for objects from the library should show some elements that are useful for a developer, though they need not be eval-compatible.
Default Python Object.__repr__
Discussed and negotiated in PR #59
... causing tests to fail. Should use Dataset.records
method instead
MasteringProject.high_impact_pairs
and MasteringProject.published_clusters
do not fetch the dataset information from the server.
This is inconsistent with Project.unified_dataset
>>> project.high_impact_pairs().external_id
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable
>>> project.published_clusters().external_id
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/drice/c/tamr/unify-client-python/tamr_unify_client/models/dataset/resource.py", line 23, in external_id
return self.data["externalId"]
TypeError: 'NoneType' object is not subscriptable
>>> project.unified_dataset().external_id
'idogs_unified_dataset'
use the same logic as Project.unified_dataset
. Fetch the dataset object from the server, so that the returned Dataset
has all of the appropriate fields and does not generate exceptions on standard property access.
Software | Version(s) |
---|---|
Python | 3.6 |
tamr-unify-client | 0.4.0-dev (master HEAD at the time of writing) |
Operating System | Linux (ubuntu:latest docker container) |
Many classes within tamr_unify_client use a data
field member for the static information, often loaded from a server's JSON response upon fetching the object.
This object is not documented. Because of its direct relationship to the server response, it's also subject to change from version to version. It should be messaged as a private by the underscore naming convention (_data
)
This also affects tab completion in the python3 interactive repl.
This is particularly misleading when working with a dataset object, because it looks like it ought to be the collection of the dataset's records.
>>> dataset = project.unified_dataset()
>>> dataset.
dataset.api_path dataset.data dataset.external_id dataset.from_json( dataset.records( dataset.relative_id dataset.status( dataset.update_records(
dataset.client dataset.description dataset.from_data( dataset.name dataset.refresh( dataset.resource_id dataset.tags dataset.version
>>> dataset.
_data
self.data
references to self._data
Since a Dataset can produce a Python Geo FeatureCollection, it should also be updatable from a Python Geo FeatureCollection.
See https://gist.github.com/sgillies/2217756
Relates to #98
When given a FeatureCollection, Dataset should upsert records by matching the feature id
to the record key.
No support.
Provide a from_geo_features(...)
method on Dataset.
When working with geospatial data, most Python GIS tools are able to produce a FeatureCollection. Dataset should embody the One True Way to convert from a FeatureCollection to a Dataset.
>>> import geopandas
>>> my_dataset = unify.datasets.by_name("my_dataset")
>>> my_geo_data_frame = geopandas.GeoDataFrame.from_features(my_dataset)
>>> # Modify my_geo_data_frame
>>> my_dataset.from_geo_features(my_geo_data_frame)
Currently it's using @pcattori 's personal account.
Instead, Python Client should raise the HTTP error as an exception.
E.g. if a 404 happens under the hood, don't show JSONDecodeError
, show the 404
A proposal for cases where an end user wants to use this library, but add additional functionality without the difficulty of monkey-patching or of permanently modifying the library (even within a single Python runtime).
My proposal is to add a (barely-documented) class mapping argument to the Client
which is used whenever a library class is invoked (i.e. a method is called on it, including it's constructor).
def example():
class_mapping = {ProjectCollection: MyProjectCollection}
client = Client(
UsernamePasswordAuth("username", "password"), class_mapping=class_mapping
)
assert isinstance(client.projects, MyProjectCollection)
It's a nice feature on datasets.
Currently it is hard to upload a dataset with a custom schema. Given a dataset name, schema as json, and csv, efficiently upload the data. For some datasets, it's useful to generate guids for the primary key field if it doesn't exist.
I want to upload a dataset for golden records and bootstrap it, to do this I need a dataset with strings for the origin source name field and uploading via the csv endpoint in the UI converts everything to [string]
calling my_dataset.update_records(stuff) returns None
calling my_dataset.update_records(stuff) should return the result of "self.client.post(self.api_path + ":updateRecords", data=body)"
Software | Version(s) |
---|---|
tamr-unify-client | |
Tamr Unify server | |
Python | |
Operating System |
Create a project in Unify
def create_project(unify, project_config):
"""
Create a Project in Unify
:param unify: Unify Client
:param project_config: Project Configuration
:return: The created Project
"""
from tamr_unify_client.models.project.resource import Project
data = unify.post(unify.projects.api_path, json=project_config).successful().json()
return Project(unify, data, data["relativeId"])
Currently if we call a non-versioned or husk API, we cannot use the Operation status structure to check success or poll the operation status.
It would be great to have the Operation class generalized a bit so it can be used for custom API calls.
downloading from pypi means you have to import the package as tamr_unify_client, which is different from the documentation which states import unify_api_v1
Can we have a docs that specify how to use python client call non-versioned APIs?
def create_attribute(dataset, attribute_config):
"""
Create an Attribute in Unify
:param dataset: the Unify Dataset to which to add the attribute
:type dataset: :class:`tamr_unify_client.models.dataset.resource.Dataset`
:param attribute_config: the configuration of the attribute to create
:type attribute_config: dict[str, object]
:return: the created Attribute
"""
from tamr_unify_client.models.attribute.resource import Attribute
data = dataset.client.post(dataset.attributes.api_path, json=attribute_config).successful().json()
alias = dataset.attributes.api_path + "/" + attribute_config["name"]
return Attribute(dataset.client, data, alias)
JSON is a string. Data (in the way it is used here), is a dictionary.
In many cases, from_json
is actually taking a dictionary as an argument named resource_json
.
An argument or method referring to json
should be working with strings. If it is only working with structured data (that may or may not have once been a string), it should not be referred to as "json".
resouce_class.from_json
to exist.from_json
that calls BaseResource.from_data
. Note that there is no conversion of string to dict between the two calls.I'm fairly certain that all from_json
methods can be removed, and calls to them can be changed to from_data
(which is already a @classmethod
, and so cls
will be set to the subclass that it is invoked upon, without that subclass explicitly redefining from_data
)
I got confused.
Software | Version(s) |
---|---|
tamr-unify-client | 4.0-dev |
Tamr Unify server | n/a |
Python | n/a |
Operating System | n/a |
I discovered, through experimentation, that Dataset.update_records
uses the datasets keyAttributeNames
as the recordId
. The corresponding value in the record
itself is ignored.
I don't know how this works if there are multiple key attributes.
This was not clear through documentation. The python client documentation defers to https://docs.tamr.com/reference#modify-a-datasets-records which says only:
Field | Description |
---|---|
action | The action being requested.CREATE or DELETE. |
recordId | The ID of the record. |
record | Optional. The information the new record will contain upon creation. Fields in the record must exist in the schema of the dataset they are being added to. If fields are not in the schema, they will be ignored. |
This isn't sufficient information to understand how to use the API without experimentation and close scrutiny.
Switch to a paid plan on readthedocs so we don't have ads on our docs.
return all data sets from one project.
Currently, it's using @pcattori 's personal account.
Dataset should provide a Geo Python interface for geospatial data.
See: https://gist.github.com/sgillies/2217756
Produce a GeoJSON FeatureCollection from a dataset:
dataset = client.datasets.by_name("my_dataset")
with open("my_dataset.json", "w") as f:
json.dump(dataset.__geo_interface__, f)
No __geo_interface__
on Dataset
Provide __geo_interface__
on Dataset
When exporting geospatial data from Unify, many python Geospatial packages are able to consume the Geo interface. For interoperability, Unify Dataset should provide geo_interface to easily convert to a FeatureCollection.
>>> dataset = client.datasets.by_name("my_dataset")
>>> feature_collection = dataset.__geo_interface__
>>> feature_collection["type"]
"FeatureCollection"
>>> feature = feature_collection["features"][0]
>>> feature["type"]
"Feature"
>>> geometry = feature["geometry"]
>>> geometry["type"]
"Polygon"
>>> geometry["coordinates"]
[[[-71.1522084419519, 42.3745215835176], [-71.1521881193882, 42.3744888310051], [-71.1521638278098, 42.3744971184826], [-71.1521246630653, 42.3744340027633], [-71.1521483999176, 42.3744259044764], [-71.1521407901139, 42.3744136394194], [-71.1521190173832, 42.3743785507759], [-71.1520756633469, 42.3743086828514], [-71.1520842951625, 42.3743095255146], [-71.1521193262041, 42.3743129156902], [-71.1521547542348, 42.374316315158], [-71.1521573062878, 42.3743165632504], [-71.1522809543662, 42.3743287505821], [-71.1523849604337, 42.3743390013382], [-71.1523885493228, 42.3743447856967], [-71.1524009012572, 42.3743405718856], [-71.1524174585427, 42.3743422046124], [-71.1524742202209, 42.3744336791865], [-71.1524042055534, 42.3744575646173], [-71.1524038456022, 42.3744793152621], [-71.1523870075068, 42.3744958511921], [-71.1523590300683, 42.3745053960713], [-71.1523312402649, 42.374504881751], [-71.1523030827935, 42.3744923997539], [-71.1523017432803, 42.3744902404282], [-71.1522860204876, 42.3744956045043], [-71.1522915145566, 42.3745044573864], [-71.1522687642496, 42.374512218378], [-71.1522630222685, 42.3745029641195], [-71.1522084419519, 42.3745215835176]]]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.