Comments (9)
I also hit this issue while working on changes for #32814.
I've wanted to add, that this happens when parsing the response from the server (ie. already have to have something set up, and talking to some end point), and when it happens, there's no request nor response log shown, which makes it more difficult to debug (since it's response parsing, but it's unclear what's the response that triggers it. For example in my case most of the tested API endpoints work just fine, but some reliably triggers this issue.
from airbyte.
Hey @imrehg, thanks a lot for picking this up and providing a solution. I'm going to implement the MR on my end and test if it fixes my issue too. I'll report back with my results shortly.
from airbyte.
Hi @imrehg , your patch works fine and behaves as expected.
Looks good and hopefully it can be merged and released in future versions soon.
Really appreciate you taking the lead on this š
from airbyte.
@tturkenitz could you try the proposed changes in the above PR, if you have a chance? Edit: It's slightly different from your workaround, but the spirit is the same (ie. setting defaults when "type" not available) it should be getting to the root of the problem, see the comment below.
from airbyte.
I got my debug system working and actually it's likely a more complex edge case than the linked MR. The node
values that triggers for me has this value:
{
"anyOf": [
{"type": ["boolean", "null", "string"]},
{
"type": "object",
"properties": {
"type": {"type": "string"},
"interval": {"type": "string"},
"maxValue": {"type": "number"},
"minValue": {"type": "number"},
"currencyCode": {"type": "string"},
},
},
]
}
Thus what happens is not that "there's no type info", it's that "the type info is not correctly extracted from an anyOf
?
I've updated the MR to fix the issues correctly for my case, I do wonder if these changes would work for you too, @tturkenitz (or your issue is triggered by some other cleanup edge case)? Is there a chance that you can test-drive this PR?
from airbyte.
I've tested the fix on my end, but it doesn't resolve my specific issue. The type
field is missing from the document entirely, and re-adding it as STRING
resolves the problem. This occurs when I interact with the Coupa ExpenseReports endpoint and only happens when extracting more than one record. I suspect a schema difference between the records could be the issue but I'm not familiar enough with Airbyte code and my assumption is that Airbyte is able to reconcile the schemas into one master schema document when such things happen, but maybe not in this case?
I can share the schema, but it's quite large at 15k lines, and Iām not sure if it reflects the missing type
accurately. It seems to be the schema Airbyte retrieved from the first document.
Adding
if node.get("type", "") == "":
node["type"] = "string"
does solve the issue for me, but I doubt it's production ready code š
from airbyte.
@tturkenitz for testing could you try a debug step?
In schema_inferrer.py
:
below that line add:
if "type" not in node:
print(node)
which should print the node data in the logs when the cleaning step fails. This would show what's your offending schema content.
Let me know if any of this is unclear š¬
I would be surprised if the schema inferrer wouldn't have any info to go on it (the type
doesn't come from your source, but the tool that looks at the source's response).
I suspect a schema difference between the records could be the issue
That's totally the case for my breakage as well, that's when the inferrer would end up with an anyOf
entry (a collection of variations of the data encoded in them), and the handling of that anyOf
is the problematic bit in the code.
from airbyte.
I captured the malformed schema. It seems I was actually wrong, type
does exist in the schema, but it is set to Null
. It confuses me, because my solution assumed that type
is completly missing and it adds it back to the schema. But maybe, type
is removed by Airbyte as it processes the schema and my solution re-adds it? Lots of assumptions, sorry!
Here is the node and you can see that there are multiple attributes like parent-id
, salesforce-id
and avatar-thumb-url
where type is null.
{
"anyOf": [
{
"type": "string"
},
{
"type": "object",
"properties": {
"parent-id": {
"type": "null"
},
"lookup": {
"type": "object",
"properties": {
"content-groups": {
"type": "array",
"items": {
"type": "object",
"properties": {
"updated-by": {
"type": "object",
"properties": {
"salesforce-id": {
"type": "null"
},
"avatar-thumb-url": {
"type": "null"
}
}
}
}
}
}
}
}
}
}
]
}
This is the log row header:
airbyte-connector-builder-server | 2024-06-06 14:14:47 INFO i.a.w.i.VersionedAirbyteStreamFactory(logMalformedLogMessage):390
from airbyte.
Hey @tturkenitz thanks for passing on the node information!
I'm skeptical of the null
causing any issues. I think schema inferrer getting null
would happen if all the examples the inferrer has seen had null
as the value (and APIs can send that back: send a field name but setting it explicitly to null
, rather than only sending the field if it has a non-null value).
Instead, your troublesome node seems to have the same characteristics as mine (anyOf
that isn't just a null
and something else as two entries)
I've tested your node
info, and I've found that.
- current
master
branch indeed breaks with thatKeyError
- the patch from #39146 works correctly with your example too, so seems to be addressing the issue
Sorry for sounding basic, could you check if you were actually using the patched version of the CDK from that merge request?
Direct testing
For reference, since I don't have access to the source you are using, I've used a simple code for direct testing, directly feeding. Install the version of the CDK in a Python environment and run the script. With the patch I get `Success`, with the CDK from `master` it's the usual failure. (click the triangle to expand the code)
import airbyte_cdk as cdk
node = {
"anyOf": [
{
"type": "string"
},
{
"type": "object",
"properties": {
"id": {
"type": "number"
},
"created-at": {
"type": "string"
},
"updated-at": {
"type": "string"
},
"active": {
"type": "boolean"
},
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"external-ref-num": {
"type": "string"
},
"external-ref-code": {
"type": "string"
},
"parent-id": {
"type": "null"
},
"lookup-id": {
"type": "number"
},
"depth": {
"type": "number"
},
"is-default": {
"type": "boolean"
},
"approval-group-1": {
"type": "string"
},
"approval-user-1": {
"type": "string"
},
"approval-group-2": {
"type": "string"
},
"approval-user-2": {
"type": "string"
},
"custom-fields": {
"type": "object",
"properties": {
"watcher": {
"type": "string"
},
"watcher-group": {
"type": "string"
},
"requester-known-for-invoice": {
"type": "string"
},
"territory": {
"type": "string"
}
}
},
"lookup": {
"type": "object",
"properties": {
"id": {
"type": "number"
},
"created-at": {
"type": "string"
},
"updated-at": {
"type": "string"
},
"active": {
"type": "boolean"
},
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"fixed-depth": {
"type": "boolean"
},
"level-1-name": {
"type": "string"
},
"level-2-name": {
"type": "string"
},
"level-3-name": {
"type": "string"
},
"level-4-name": {
"type": "string"
},
"level-5-name": {
"type": "string"
},
"level-6-name": {
"type": "string"
},
"level-7-name": {
"type": "string"
},
"level-8-name": {
"type": "string"
},
"level-9-name": {
"type": "string"
},
"level-10-name": {
"type": "string"
},
"content-groups": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "number"
},
"created-at": {
"type": "string"
},
"updated-at": {
"type": "string"
},
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"updated-by": {
"type": "object",
"properties": {
"id": {
"type": "number"
},
"login": {
"type": "string"
},
"email": {
"type": "string"
},
"employee-number": {
"type": "string"
},
"firstname": {
"type": "string"
},
"lastname": {
"type": "string"
},
"fullname": {
"type": "string"
},
"salesforce-id": {
"type": "null"
},
"avatar-thumb-url": {
"type": "null"
},
"department-ucf": {
"type": "string"
},
"role": {
"type": "string"
},
"uaf": {
"type": "string"
},
"custom-fields": {
"type": "object",
"properties": {
"test-employee-number": {
"type": "string"
},
"default-cost-center": {
"type": "string"
},
"frequent-buyer-training": {
"type": "boolean"
},
"approver-training": {
"type": "boolean"
},
"starter": {
"type": "boolean"
},
"coa-test": {
"type": "string"
}
}
}
}
}
}
}
}
}
},
"account-type": {
"type": "object",
"properties": {
"id": {
"type": "number"
},
"name": {
"type": "string"
}
}
}
}
}
]
}
cdk.utils.SchemaInferrer._clean(None, node)
print("Success")
from airbyte.
Related Issues (20)
- [source-stripe] does not sync unsuccessful payment methods HOT 3
- Airbyte doesn't compile locally HOT 2
- It is possible to use the YAML export from a custom connection in terraform?
- [destination-bigquery] DESTINATION_TYPECAST_ERROR HOT 7
- [source-confluence] export 2000 records from Confluence v7.19.16 and Confluence v8.5.8, but export 80,000 records will fail HOT 1
- [source-prestashop] expecting schemas folder but doesn't exist this folder HOT 6
- [source-zohocrm] - Leads module not syncing converted users HOT 1
- [source-slack] Threads from private channels are not synced HOT 1
- Airbyte API does not use AIRBYTE_URL for paging HOT 1
- [source-bigquery] Fatal error when configuring connector in version 0.4.2 in Airbyte 0.50.38 HOT 3
- Unable to configure the source connector Google Analytics GA4 on Airbyte running on Azure Kubernetes Cluster. HOT 5
- Unexpected error could not find image: Airbyte/*:version HOT 3
- dbt Cloud Integration problem HOT 7
- Enable `Incremental` syncing for `Forms`
- [source-BambooHR] No such file or directory: '/usr/local/lib/python3.9/site-packages/airbyte_cdk/schemas/custom_reports_stream.json'
- Repeated DD_AGENT_HOST item in docker-compose.yaml services.server.environment HOT 2
- 'Discovering schema failed' -- Windows 11
- [source_younium] Unable to connect to stream account HOT 3
- Issue starting airbyte from docker-compose
- [helm] `global.storage.gcs.credentialsJson` is mandatory even if not used
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from airbyte.