filecoin-project / filet Goto Github PK

2.0 4.0 0.0 84 KB

🧑‍🍳 Filecoin Extract & Transformation jobs

License: MIT License

Makefile 11.34% Dockerfile 19.36% Shell 69.30%

filet's Introduction

🧑‍🍳 Filet

Filet (Filecoin Extract Transform) makes it simple to get CSV data from Filecoin Archival Snapshots using Lily and lily-archiver.

🚀 Usage

The filet image available on Google Container Artifact Hub. Alternatively, you can build it locally with make build.

The following command will generate CSVs from an Filecoin Archival Snapshot:

docker run -it \
    -v $PWD:/tmp/data \
    europe-west1-docker.pkg.dev/protocol-labs-data/pl-data/filet:latest -- \
    /lily/export.sh archival_snapshot.car.zst .

⏰ Scheduling Jobs

You can use the send_export_jobs.sh script to schedule jobs on Google Cloud Batch. The script takes a file with a list of snapshots as input.

./scripts/send_export_jobs.sh SNAPSHOT_LIST_FILE [--dry-run]

For more details on the scheduled jobs configuration, you can check the gce_batch_job.json file.

The SNAPSHOT_LIST_FILE file should contain a list of snapshots, one per line. The snapshots should be available in the fil-mainnet-archival-snapshots Google Cloud Storage bucket.

gsutil ls gs://fil-mainnet-archival-snapshots/historical-exports/ | sort --version-sort > all_snapshots.txt

To get the batches you can use the following command to filter by snapshot height:

grep -E '^[2226480-2232002]$'

🔧 Managing Jobs

In case you need to retry a bunch of failed jobs, you can use the following commands:

# Get the list of failed jobs
gcloud alpha batch jobs list --format=json --filter="Status.state:FAILED" > failed_jobs.json

# Get the snapshot name from failed jobs
cat failed_jobs.json | jq ".[].taskGroups[0].taskSpec.runnables[0].container.commands[0]" -r | cut -d '/' -f 2 | sort > failed_jobs.list

# Retry the failed jobs
./scripts/send_export_jobs.sh failed_jobs.list

filet's People

Contributors

Stargazers

Watchers

filet's Issues

Run on Bacalhau

If we had Archival Snapshots on IPFS/Filecoin,we could use Bacalhau to run ETLs and generate the CSVs from them.

With the current state of Bacalhau, it might be tricky since we're lacking an scheduler and better monitoring around jobs. We can work around both of these though.

Another potential blocker might be the hardware requirements of the current ETL setup. We probably need more than 8 cores and 16GB of RAM.

Websocket error when running Filet

From time to time, filet jobs will get stuck in Google Cloud Batch. The lily daemon gets killed and sentinel-archiver hangs waiting for it to come back.

This is how the resources looks like:

The log produced by lily reports no route found for :: and websocket: close 1000 (normal).

The issue might be related with the job missing resources.

$ tail -f lily.log 
{"level":"info","ts":"2023-01-10T23:16:04.795Z","logger":"lily/index/processor","caller":"processor/state.go:362","msg":"processor ended","task":"miner_sector_deal","height":"2483772","reporter":"arch0109-2023-01-04","duration":84.470909876}
{"level":"debug","ts":"2023-01-10T23:16:04.797Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_event","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465847436}
{"level":"debug","ts":"2023-01-10T23:16:04.811Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_infos_v7","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465866927}
{"level":"info","ts":"2023-01-10T23:16:04.811Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_event","status":"OK","duration":84.465847436}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_infos_v7","status":"OK","duration":84.465866927}
{"level":"debug","ts":"2023-01-10T23:16:04.823Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_deal","reporter":"arch0109-2023-01-04","status":"OK","duration":84.468916001}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_deal","status":"OK","duration":84.468916001}
{"level":"debug","ts":"2023-01-10T23:16:07.606Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}
{"level":"debug","ts":"2023-01-10T23:16:11.564Z","logger":"rpc","caller":"[email protected]/websocket.go:624","msg":"websocket error","error":"websocket: close 1000 (normal)"}
{"level":"debug","ts":"2023-01-10T23:16:12.690Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}

Walk job is not going through the entire range for some miner related tasks

Noticed that the CSVs we're getting weren't covering the specified EPOCHS in the walk.sh script. They usually stop much earlier.

This is the JOB definition of a recent run in Google Cloud Batch walking from epoch 2239080 to 2239440.

{
	"ID": 1,
	"Name": "walk_1665500630",
	"Type": "walk",
	"Tasks": [
		"block_header",
		"block_parent",
		"drand_block_entrie",
		"miner_sector_deal",
		"miner_sector_infos_v7",
		"miner_sector_infos",
		"miner_sector_post",
		"miner_pre_commit_info",
		"miner_sector_event",
		"miner_current_deadline_info",
		"miner_fee_debt",
		"miner_locked_fund",
		"miner_info",
		"market_deal_proposal",
		"market_deal_state",
		"message",
		"block_message",
		"receipt",
		"message_gas_economy",
		"parsed_message",
		"internal_messages",
		"internal_parsed_messages",
		"vm_messages",
		"multisig_transaction",
		"chain_power",
		"power_actor_claim",
		"chain_reward",
		"actor",
		"actor_state",
		"id_address",
		"derived_gas_outputs",
		"chain_economics",
		"chain_consensus",
		"multisig_approvals",
		"verified_registry_verifier",
		"verified_registry_verified_client"
	],
	"Params": {
		"maxHeight": "2239440",
		"minHeight": "2239080",
		"storage": "CSV",
		"window": "0s"
	},
	"RestartOnFailure": false,
	"RestartOnCompletion": false,
	"RestartDelay": 0
}

After a lily job wait --id 1, this is how the job looks like:

{
	"ID": 1,
	"Name": "walk_1665500630",
	"Type": "walk",
	"Error": "",
	"Tasks": [
		"block_header",
		"block_parent",
		"drand_block_entrie",
		"miner_sector_deal",
		"miner_sector_infos_v7",
		"miner_sector_infos",
		"miner_sector_post",
		"miner_pre_commit_info",
		"miner_sector_event",
		"miner_current_deadline_info",
		"miner_fee_debt",
		"miner_locked_fund",
		"miner_info",
		"market_deal_proposal",
		"market_deal_state",
		"message",
		"block_message",
		"receipt",
		"message_gas_economy",
		"parsed_message",
		"internal_messages",
		"internal_parsed_messages",
		"vm_messages",
		"multisig_transaction",
		"chain_power",
		"power_actor_claim",
		"chain_reward",
		"actor",
		"actor_state",
		"id_address",
		"derived_gas_outputs",
		"chain_economics",
		"chain_consensus",
		"multisig_approvals",
		"verified_registry_verifier",
		"verified_registry_verified_client"
	],
	"Running": false,
	"RestartOnFailure": false,
	"RestartOnCompletion": false,
	"RestartDelay": 0,
	"Params": {
		"maxHeight": "2239440",
		"minHeight": "2239080",
		"storage": "CSV",
		"window": "0s"
	},
	"StartedAt": "2022-10-11T15:03:50.042220395Z",
	"EndedAt": "2022-10-11T18:13:45.569797857Z"
}

The script also lists all the CSV files:

-rw-r--r-- 1 root root  60M Oct 11 18:13 actor_states.csv
-rw-r--r-- 1 root root  16M Oct 11 18:13 actors.csv
-rw-r--r-- 1 root root 316K Oct 11 18:13 block_headers.csv
-rw-r--r-- 1 root root  23M Oct 11 18:13 block_messages.csv
-rw-r--r-- 1 root root 1.2M Oct 11 18:13 block_parents.csv
-rw-r--r-- 1 root root 249K Oct 11 18:13 chain_consensus.csv
-rw-r--r-- 1 root root  84K Oct 11 18:13 chain_economics.csv
-rw-r--r-- 1 root root 109K Oct 11 18:13 chain_powers.csv
-rw-r--r-- 1 root root 119K Oct 11 18:13 chain_rewards.csv
-rw-r--r-- 1 root root 125K Oct 11 18:13 drand_block_entries.csv
-rw-r--r-- 1 root root  275 Oct 11 15:17 id_addresses.csv
-rw-r--r-- 1 root root  48K Oct 11 15:21 internal_messages.csv
-rw-r--r-- 1 root root  51K Oct 11 15:21 internal_parsed_messages.csv
-rw-r--r-- 1 root root 172K Oct 11 15:21 market_deal_proposals.csv
-rw-r--r-- 1 root root 7.4M Oct 11 15:21 market_deal_states.csv
-rw-r--r-- 1 root root  62K Oct 11 18:13 message_gas_economy.csv
-rw-r--r-- 1 root root  15M Oct 11 18:13 messages.csv
-rw-r--r-- 1 root root 3.2M Oct 11 18:13 miner_current_deadline_infos.csv
-rw-r--r-- 1 root root 1.4K Oct 11 17:49 miner_fee_debts.csv
-rw-r--r-- 1 root root 3.7K Oct 11 18:13 miner_infos.csv
-rw-r--r-- 1 root root 2.3M Oct 11 18:13 miner_locked_funds.csv
-rw-r--r-- 1 root root 590K Oct 11 15:21 miner_pre_commit_infos.csv
-rw-r--r-- 1 root root  35K Oct 11 15:21 miner_sector_deals.csv
-rw-r--r-- 1 root root 2.8M Oct 11 15:21 miner_sector_events.csv
-rw-r--r-- 1 root root 1.2M Oct 11 15:21 miner_sector_infos_v7.csv
-rw-r--r-- 1 root root  235 Oct 11 15:07 multisig_transactions.csv
-rw-r--r-- 1 root root 3.6K Oct 11 15:19 power_actor_claims.csv
-rw-r--r-- 1 root root 6.6K Oct 11 15:21 verified_registry_verified_clients.csv
-rw-r--r-- 1 root root 136M Oct 11 18:13 visor_processing_reports.csv

After inspecting miner_sector_events, seems we're only getting from 2239417 to 2239440, that is, 23 epochs. Seems this is happening in miner_sector_infos_v7 too.