Coder Social home page Coder Social logo

blockchain-etl / polygon-etl Goto Github PK

View Code? Open in Web Editor NEW
97.0 10.0 75.0 1.48 MB

ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub

License: MIT License

Python 82.75% Shell 0.14% Dockerfile 0.20% Mustache 0.26% Java 16.65%
polygon bigquery gcp airflow data-engineering etl matic-network maticnetwork cryptocurrency

polygon-etl's Introduction

Polygon-etl

Overview

Polygon ETL allows you to setup an ETL pipeline in Google Cloud Platform for ingesting Polygon blockchain data into BigQuery and Pub/Sub. It comes with CLI tools for exporting Polygon data into convenient formats like CSVs and relational databases.

Architecture

polygon_etl_architecture.svg

  1. The nodes are run in a Kubernetes cluster.

  2. Airflow DAGs export and load Polygon data to BigQuery daily. Refer to Polygon ETL Airflow for deployment instructions.

  3. Polygon data is polled periodically from the nodes and pushed to Google Pub/Sub. Refer to Polygon ETL Streaming for deployment instructions.

  4. Polygon data is pulled from Pub/Sub, transformed and streamed to BigQuery. Refer to Polygon ETL Dataflow for deployment instructions.

Setting Up

  1. Follow the instructions in Polygon ETL Airflow to deploy a Cloud Composer cluster for exporting and loading historical Polygon data. It may take several days for the export DAG to catch up. During this time "load" and "verify_streaming" DAGs will fail.

  2. Follow the instructions in Polygon ETL Streaming to deploy the Streamer component. For the value in last_synced_block.txt specify the last block number of the previous day. You can query it in BigQuery: SELECT number FROM crypto_polygon.blocks ORDER BY number DESC LIMIT 1.

  3. Follow the instructions in Polygon ETL Dataflow to deploy the Dataflow component. Monitor "verify_streaming" DAG in Airflow console, once the Dataflow job catches up the latest block, the DAG will succeed.

Code quality

Over time, we intend to format python files in this repo using isort and black. At the moment, we are only formatting any changed or added files

We have not implemented any sort of automation (e.g. pre-commit), but a requirements_dev.txt is provided for contributors to use.

Testing

Various tests are implemented (airflow/tests, cli/tests and ./tests). As part of an effort towards consistency, they all source the same requirements_test.txt.

polygon-etl's People

Contributors

alifier avatar araa47 avatar ayadigithub avatar charlielewisme avatar cheungringo avatar daishuai323 avatar gulshngill avatar hansolavhilsen avatar jasonbxu avatar jdmaturen avatar kome12 avatar krivars avatar markusbkoch avatar medvedev1088 avatar mexcool avatar ninjascant avatar nothingnix avatar prateekeng avatar pswies avatar qshao-gauntlet avatar rstevens2022 avatar shuoer86 avatar simoncousaert avatar swel4ik avatar thelazyliz avatar timnooren avatar wegory avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polygon-etl's Issues

No data in crypto_polygon in BigQuery since 2022-09-06

Hello!

If I am not mistaken, there is no data in the crypto_polygon dataset in BigQuery since 2022-09-06.

See this query output:

select date(block_timestamp), count(*)
from `bigquery-public-data`.crypto_polygon.transactions
where true
  AND block_timestamp > '2022-06-01'
group by 1 order by 1 desc
limit 1000

Am I missing something?

Thanks,
- JO

EIP-1559

Polygon recently implemented EIP-1559. This has introduced a base fee per block and miner tips. Would it be possible to update the schema?

the method eth_newFilter does not exist/is not available

Could you please help me with the following error
/lib/python3.9/site-packages/web3/manager.py", line 112, in request_blocking raise ValueError(response["error"]) ValueError: {'code': -32004, 'message': 'the method eth_newFilter does not exist/is not available'}

I get this error on executing the polygonetl export_all -p https://polygon-rpc.com -s 1100 -e 1111 -b 10 -o ./ command.
I have changed the provider uril to our local setup node as well but get the same error

[Urgent] Polygon Transaction data duplicated and data quality issue

As you can see at this polygon transactions, you can see that the hash is on 2022-02-09 00:26:21 UTC in 24714823

select
* from `public-data-finance.crypto_polygon.transactions`
where TIMESTAMP_TRUNC(block_timestamp, hour) = TIMESTAMP_TRUNC("2022-02-09 00:00:00", hour)
    AND `hash` = "0xd481ff0f96cdea342e8baa7d3ca65d59589cdec2782b0b7b4f65731bfae0e63e"
    ;

Screen Shot 2022-02-09 at 12 58 38 PM

But when you look at https://polygonscan.com/tx/0xd481ff0f96cdea342e8baa7d3ca65d59589cdec2782b0b7b4f65731bfae0e63e
The transaction happened in Feb-09-2022 12:34:55 AM +UTC

So we deep dive into the logs, we found the transaction is logged in two different blocks and two different block timestamp as well.

select
    *
from `public-data-finance.crypto_polygon.transactions`
where TIMESTAMP_TRUNC(block_timestamp, hour) >= TIMESTAMP_TRUNC("2022-02-09 00:00:00", hour)
    AND `hash` = "0xd481ff0f96cdea342e8baa7d3ca65d59589cdec2782b0b7b4f65731bfae0e63e"

Screen Shot 2022-02-09 at 1 05 11 PM

but we couldn't find the transaction in https://polygonscan.com/txs?block=24714823 at all.

Can anyone help deep dive in this more?

missing inbound transactions for some addresses

Hey everyone, can anyone help me investigate a weird balance / lack of transaction on polygonscan? My team is working on investigate a few weird matic balances in our data and we came across Hop Protocol which allows in the examples here to bridge matic from the Ethereum network to Polygon.
We isolated a few examples of addresses clearly sending matic to polygon, but polygonscan and oklink don't show any incoming transaction for the matic itself on ploygon.

This address: https://polygonscan.com/address/0xfdf6c6096e70799e53d6172d32c6c3b6f3d1f38f
Has only one transaction where it sends about 17.3K matic to anyswap to bridge to Boba. But, nothing shows where the Matic came from in the first place.
I found this transaction on ETH https://etherscan.io/tx/0x81ccb965f059f53a6edd005a3f3645840c24f24440eb26a04e2b0ebee757dd95 - "send to L2", of the same amount, happening before.

So that's where the Matic is coming from. However, I need a way to see the inbound matic on the Polygon chain. Is there a missing transaction here, how does the matic actually ended up in that address?

Also look at this address : https://polygonscan.com/address/0x718c0c212db4fc57f8fe14e92229a604df666ee8

it also has outbound transactions higher that inbound for matic.

export_contracts returns is_erc20=false, even though the contract is ERC-20

Description:
I have deployed ERC-20 contract on the Polygon testnet (Mumbai)

While testing different commands, i have noticed that calling:
polygonetl export_contracts --contract-addresses contract_addresses.txt --provider-uri http://matic-mumbai.chainstacklabs.com --output contracts.csv

returns all contracts, but is_erc20 and is_erc721 fields are always FALSE

Enviroment:
PolygonEtl version: polygonetl, version 0.1.8
Network: Polygon Mumbai testnet
JSON-RPC: http://matic-mumbai.chainstacklabs.com
Chain id: 80001
Transaction hash: 0xb2b0a3a0c07dbf7be8d52f0cb3990bf5ef0088121bf23854ef6f1b7f6c80eac7
Contract Address: 0xCc5dDbAECDB40993d779b68b0CE503219aE5A144
Contract file: https://github.com/OpenZeppelin/openzeppelin-contracts/blob/master/contracts/token/ERC20/ERC20.sol

Steps to reproduce:

1. polygonetl export_blocks_and_transactions --start-block 26347319 --end-block 26347321 --provider-uri http://matic-mumbai.chainstacklabs.com --blocks-output blocks.csv --transactions-output transactions.csv
2. polygonetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txt
3. polygonetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt --provider-uri http://matic-mumbai.chainstacklabs.com --receipts-output receipts.csv --logs-output logs.csv
4. polygonetl extract_token_transfers --logs logs.csv --output token_transfers.csv
5. polygonetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txt
6. polygonetl export_contracts --contract-addresses contract_addresses.txt --provider-uri http://matic-mumbai.chainstacklabs.com --output contracts.csv

Check contracts.csv file and for contract (addr = 0xcc5ddbaecdb40993d779b68b0ce503219ae5a144) is_erc20 and is_erc721 fields are always FALSE

crypto-polygon traces table: traceid order in schema

This is a minor, but annoying issue:

In the schema for the traces table, the traceid is the last column in the public data crypto-ethereum dataset, but is NOT the last column in the crypto-polygon dataset.

That prevents me from writing concise cross-blockchain queries (using, for example, UNION). It'd be awesome if the schemas for the crypto-polygon and crypto-ethereum datasets were identical.

Receipt Data cannot be extracted with "export_all" command.

When i use the "export_all" command I am getting the following error for the receipts data (i could extract receipt data via the export receipts command) but would appreciate support for this issue

python3.9/site-packages/polygonetl/jobs/export_receipts_job.py", line 70, in
receipts = [self.receipt_mapper.json_dict_to_receipt(result) for result in results]
File "/python3.9/site-packages/polygonetl/utils.py", line 71, in rpc_response_batch_to_results
yield rpc_response_to_result(response_item)
File "/python3.9/site-packages/polygonetl/utils.py", line 82, in rpc_response_to_result
raise RetriableValueError(error_message)
polygonetl.misc.retriable_value_error.RetriableValueError: result is None in response {'jsonrpc': '2.0', 'id': 0, 'result': None}. Make sure polygon node is synced.

Unexpected value for max(block_number) in logs table at some point

We are currently using the Polygon dataset available in BigQuery. In this process, every hour we query the logs table for the max block number. This query always returns the expected results. But, in one instance we got back a weird number i.e. 7277816997830857785.

Now this only happened once, but has caused us a lot of headaches due to a run-away process that was trying to catch up to this number.

It would be very helpful if anyone familiar with the ETL process to provide any insight on why this would have been the case. Is there some temporary state the data was in when this query was run?

I tried looking into the implementation in this repo but did not make any meaningful progress.

This occurred around 10 pm US Pacific Standard Time on Dec 26th but has not been reproducible since then.

crypto-polygon.traces table: value column is STRING?

I'm creating queries that look at both ethereum mainnet and the polygon chain, and was surprised to get a mismatch in the 'traces' table between the crypto-polygon dataset you maintain and the Google public crypto-ethereum dataset on the value column.

It's NUMERIC versus STRING; seems to me STRING is correct.

filter_items command does not detect if contract is token address

Description:
I have deployed ERC-20 contract on the Polygon testnet (Mumbai)

While testing different commands, i have noticed that calling:
polygonetl filter_items -i contracts.csv -p "item['is_erc20'] or item['is_erc721']" | polygonetl extract_field -f address -o token_addresses.txt

return contract as token_address even if is_erc20 and is_erc721 items are FALSE

Enviroment:
PolygonEtl version: polygonetl, version 0.1.8
Network: Polygon Mumbai testnet
JSON-RPC: http://matic-mumbai.chainstacklabs.com/
Chain id: 80001
Transaction hash: 0x826a27ebe789e91bb06489dccf5e7382b6109f9eadccff696b32fcd64f1a0a32

Steps to reproduce:

  1. polygonetl export_blocks_and_transactions --start-block 26363958 --end-block 26363958 --provider-uri http://matic-mumbai.chainstacklabs.com --blocks-output blocks.csv --transactions-output transactions.csv
  2. polygonetl extract_csv_column --input transactions.csv --column hash --output transaction_hashes.txt
  3. polygonetl export_receipts_and_logs --transaction-hashes transaction_hashes.txt --provider-uri http://matic-mumbai.chainstacklabs.com --receipts-output receipts.csv --logs-output logs.csv
  4. polygonetl extract_token_transfers --logs logs.csv --output token_transfers.csv
  5. polygonetl extract_csv_column --input receipts.csv --column contract_address --output contract_addresses.txt
  6. polygonetl export_contracts --contract-addresses contract_addresses.txt --provider-uri http://matic-mumbai.chainstacklabs.com --output contracts.csv
  7. polygonetl filter_items -i contracts.csv -p "item['is_erc20'] or item['is_erc721']" | polygonetl extract_field -f address -o token_addresses.txt

Check contracts.csv file and for contract (addr = 0x1a1586c38e38a094e42223a31896c40598516d05) is_erc20 and is_erc721 fields are FALSE. But filter_tems method detect that they are token_addresses and returns address in token_addresses.txt

Proposed solution
FilterItem

predicate(item) return string, so this condition will always be true.

Either change method to return bool or compare to string (if predicate(item)=="True":)

Dataflow documentation incorrect

I followed the streaming documentation successfully and went through the dataflow documentation, but found it's both not up to date and doesn't work for me.

#1 for the bigquery command, this isn't in the repo:
src/main/resources/errors-schema.json

#2 This appears in the logs, but no tables are ever created in BigQuery:
"Executing operation polygonBlocksReadFromPubSub/PubsubUnboundedSource+polygonBlocksReadFromPubSub/MapElements/Map+polygonBlocksConvertToTableRows+polygonBlocksWriteToBigQuery/PrepareWrite/ParDo(Anonymous)+polygonBlocksWriteToBigQuery/StreamingInserts/CreateTables/ParDo(CreateTables)+polygonBlocksWriteToBigQuery/StreamingInserts/StreamingWriteTables/ShardTableWrites+polygonBlocksWriteToBigQuery/StreamingInserts/StreamingWriteTables/TagWithUniqueIds+polygonBlocksWriteToBigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/Window.Into()/Window.Assign+polygonBlocksWriteToBigQuery/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey/WriteStream"

#3 In the PolygonBlocksWriteToBigQuery stage, there are a bunch of warns, I'm not sure if that's relevant, but no more logs appear after that. I've mostly focused on getting the blocks from my local polygon to bigquery:
2022-07-07T18:55:58.630ZOperation ongoing in step polygonBlocksWriteToBigQuery/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 02h10m00s without outputting or completing in state finish at [email protected]/jdk.internal.misc.Unsafe.park(Native Method) at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at [email protected]/java.util.concurrent.FutureTask.awaitDone(FutureTask.java:447) at [email protected]/java.util.concurrent.FutureTask.get(FutureTask.java:190) at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:817) at app//org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:882) at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:143) at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:115) at app//org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)

I would really like to get this ETL job running properly so any advice would be great

Missing base_fee_per_gas

base_fee_per_gas is missing in public-data-finance.crypto_polygon.blocks for the past 1 month

Is it possible to speed up exporting receipts and logs

Hi blockchain-etl team!
I got a question - can I speed up somehow extractions of receipts and logs?

Setup:
-polygonetl, version 3.4.2 (ran in a kubernetes pod with 3000Mi memory and 1500m cpu)
-Querying own polygon archive as provider

Steps:
-starting extraction with command polygonetl export_all -s, 19070000 -e 25069999

Outcome:
-exporting receipts and logs takes >~90% of the time needed to extract info from a batch of blocks f.e.
exporting 1000 blocks' transactions or ERC20 transfers takes <1 minute exporting the receipts and logs for the same batch of blocks takes ~1 hour

Expected:
-is there any way I can speed this up?

I can provide more info, logs or whatever needed.
Thanks!

cannot run any commands

hey guys
I have a problem, so I installed the package on ubuntu wsl windows but when I want to run any command I get a weird issue
Traceback (most recent call last):
File "/home/user/.local/bin/polygonetl", line 5, in
from polygonetl.cli import cli
File "/home/user/.local/lib/python3.10/site-packages/polygonetl/cli/init.py", line 24, in
from polygonetl.cli.export_all import export_all
File "/home/user/.local/lib/python3.10/site-packages/polygonetl/cli/export_all.py", line 30, in
from web3 import Web3
File "/home/user/.local/lib/python3.10/site-packages/web3/init.py", line 7, in
from eth_account import Account # noqa: E402
File "/home/user/.local/lib/python3.10/site-packages/eth_account/init.py", line 1, in
from eth_account.account import Account # noqa: F401
File "/home/user/.local/lib/python3.10/site-packages/eth_account/account.py", line 1, in
from collections import (
ImportError: cannot import name 'Mapping' from 'collections' (/usr/lib/python3.10/collections/init.py)

this is my error log
I googled the problem and I figured out that maybe the error came from a python version upgrade so even downgrading the python version couldn't helpful
so could anybody help me to fix the problem?

Critical: missing transaction data in polygon.transactions and polygon.logs table

We have identified several transactions that seem to be totally missing from bigquery-public-data.crypto_polygon.transactions and bigquery-public-data.crypto_polygon.logs table

Here are a couple of examples:
0xcaed45e923e162a8adb3f42ef55a215b0c0f54b0c48a3bf18a8bf83995c35259
0x22b4a9d39c910e2150bf562c062061eb4f9a7eb5977109b835309f4f4f408c7d
0x4feb97ce0277cd5a837ce9d3018f8371c86b001d7757d2263f5566a283da6828

Seems like there are many more missing transactions that were present before.
Especially we found that all swap transactions and logs for Balancer pool 0x8159462d255c1d24915cb51ec361f700174cd994 disappeared - they were there 3-4 weeks ago but now they are gone.

This is of course a critical issue because if transactions are missing, the data cannot be trusted and used for any purpose.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.