Coder Social home page Coder Social logo

filet's Introduction

๐Ÿง‘โ€๐Ÿณ Filet

Filet (Filecoin Extract Transform) makes it simple to get CSV data from Filecoin Archival Snapshots using Lily and lily-archiver.

๐Ÿš€ Usage

The filet image available on Google Container Artifact Hub. Alternatively, you can build it locally with make build.

The following command will generate CSVs from an Filecoin Archival Snapshot:

docker run -it \
    -v $PWD:/tmp/data \
    europe-west1-docker.pkg.dev/protocol-labs-data/pl-data/filet:latest -- \
    /lily/export.sh archival_snapshot.car.zst .

โฐ Scheduling Jobs

You can use the send_export_jobs.sh script to schedule jobs on Google Cloud Batch. The script takes a file with a list of snapshots as input.

./scripts/send_export_jobs.sh SNAPSHOT_LIST_FILE [--dry-run]

For more details on the scheduled jobs configuration, you can check the gce_batch_job.json file.

The SNAPSHOT_LIST_FILE file should contain a list of snapshots, one per line. The snapshots should be available in the fil-mainnet-archival-snapshots Google Cloud Storage bucket.

gsutil ls gs://fil-mainnet-archival-snapshots/historical-exports/ | sort --version-sort > all_snapshots.txt

To get the batches you can use the following command to filter by snapshot height:

grep -E '^[2226480-2232002]$'

๐Ÿ”ง Managing Jobs

In case you need to retry a bunch of failed jobs, you can use the following commands:

# Get the list of failed jobs
gcloud alpha batch jobs list --format=json --filter="Status.state:FAILED" > failed_jobs.json

# Get the snapshot name from failed jobs
cat failed_jobs.json | jq ".[].taskGroups[0].taskSpec.runnables[0].container.commands[0]" -r | cut -d '/' -f 2 | sort > failed_jobs.list

# Retry the failed jobs
./scripts/send_export_jobs.sh failed_jobs.list

filet's People

Contributors

davidgasquez avatar kasteph avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

filet's Issues

Run on Bacalhau

If we had Archival Snapshots on IPFS/Filecoin,we could use Bacalhau to run ETLs and generate the CSVs from them.

With the current state of Bacalhau, it might be tricky since we're lacking an scheduler and better monitoring around jobs. We can work around both of these though.

Another potential blocker might be the hardware requirements of the current ETL setup. We probably need more than 8 cores and 16GB of RAM.

Websocket error when running Filet

From time to time, filet jobs will get stuck in Google Cloud Batch. The lily daemon gets killed and sentinel-archiver hangs waiting for it to come back.

This is how the resources looks like:

1673433890

The log produced by lily reports no route found for :: and websocket: close 1000 (normal).

The issue might be related with the job missing resources.

$ tail -f lily.log 
{"level":"info","ts":"2023-01-10T23:16:04.795Z","logger":"lily/index/processor","caller":"processor/state.go:362","msg":"processor ended","task":"miner_sector_deal","height":"2483772","reporter":"arch0109-2023-01-04","duration":84.470909876}
{"level":"debug","ts":"2023-01-10T23:16:04.797Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_event","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465847436}
{"level":"debug","ts":"2023-01-10T23:16:04.811Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_infos_v7","reporter":"arch0109-2023-01-04","status":"OK","duration":84.465866927}
{"level":"info","ts":"2023-01-10T23:16:04.811Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_event","status":"OK","duration":84.465847436}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_infos_v7","status":"OK","duration":84.465866927}
{"level":"debug","ts":"2023-01-10T23:16:04.823Z","logger":"lily/integrated/tipset","caller":"tipset/tipset.go:182","msg":"task report","height":"2483772","task":"miner_sector_deal","reporter":"arch0109-2023-01-04","status":"OK","duration":84.468916001}
{"level":"info","ts":"2023-01-10T23:16:04.823Z","logger":"lily/index/manager","caller":"integrated/manager.go:125","msg":"task success","height":"2483772","reporter":"arch0109-2023-01-04","task":"miner_sector_deal","status":"OK","duration":84.468916001}
{"level":"debug","ts":"2023-01-10T23:16:07.606Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}
{"level":"debug","ts":"2023-01-10T23:16:11.564Z","logger":"rpc","caller":"[email protected]/websocket.go:624","msg":"websocket error","error":"websocket: close 1000 (normal)"}
{"level":"debug","ts":"2023-01-10T23:16:12.690Z","logger":"basichost","caller":"basic/basic_host.go:312","msg":"failed to fetch local IPv6 address","error":"no route found for ::"}

Walk job is not going through the entire range for some miner related tasks

Noticed that the CSVs we're getting weren't covering the specified EPOCHS in the walk.sh script. They usually stop much earlier.

This is the JOB definition of a recent run in Google Cloud Batch walking from epoch 2239080 to 2239440.

{
	"ID": 1,
	"Name": "walk_1665500630",
	"Type": "walk",
	"Tasks": [
		"block_header",
		"block_parent",
		"drand_block_entrie",
		"miner_sector_deal",
		"miner_sector_infos_v7",
		"miner_sector_infos",
		"miner_sector_post",
		"miner_pre_commit_info",
		"miner_sector_event",
		"miner_current_deadline_info",
		"miner_fee_debt",
		"miner_locked_fund",
		"miner_info",
		"market_deal_proposal",
		"market_deal_state",
		"message",
		"block_message",
		"receipt",
		"message_gas_economy",
		"parsed_message",
		"internal_messages",
		"internal_parsed_messages",
		"vm_messages",
		"multisig_transaction",
		"chain_power",
		"power_actor_claim",
		"chain_reward",
		"actor",
		"actor_state",
		"id_address",
		"derived_gas_outputs",
		"chain_economics",
		"chain_consensus",
		"multisig_approvals",
		"verified_registry_verifier",
		"verified_registry_verified_client"
	],
	"Params": {
		"maxHeight": "2239440",
		"minHeight": "2239080",
		"storage": "CSV",
		"window": "0s"
	},
	"RestartOnFailure": false,
	"RestartOnCompletion": false,
	"RestartDelay": 0
}

After a lily job wait --id 1, this is how the job looks like:

{
	"ID": 1,
	"Name": "walk_1665500630",
	"Type": "walk",
	"Error": "",
	"Tasks": [
		"block_header",
		"block_parent",
		"drand_block_entrie",
		"miner_sector_deal",
		"miner_sector_infos_v7",
		"miner_sector_infos",
		"miner_sector_post",
		"miner_pre_commit_info",
		"miner_sector_event",
		"miner_current_deadline_info",
		"miner_fee_debt",
		"miner_locked_fund",
		"miner_info",
		"market_deal_proposal",
		"market_deal_state",
		"message",
		"block_message",
		"receipt",
		"message_gas_economy",
		"parsed_message",
		"internal_messages",
		"internal_parsed_messages",
		"vm_messages",
		"multisig_transaction",
		"chain_power",
		"power_actor_claim",
		"chain_reward",
		"actor",
		"actor_state",
		"id_address",
		"derived_gas_outputs",
		"chain_economics",
		"chain_consensus",
		"multisig_approvals",
		"verified_registry_verifier",
		"verified_registry_verified_client"
	],
	"Running": false,
	"RestartOnFailure": false,
	"RestartOnCompletion": false,
	"RestartDelay": 0,
	"Params": {
		"maxHeight": "2239440",
		"minHeight": "2239080",
		"storage": "CSV",
		"window": "0s"
	},
	"StartedAt": "2022-10-11T15:03:50.042220395Z",
	"EndedAt": "2022-10-11T18:13:45.569797857Z"
}

The script also lists all the CSV files:

-rw-r--r-- 1 root root  60M Oct 11 18:13 actor_states.csv
-rw-r--r-- 1 root root  16M Oct 11 18:13 actors.csv
-rw-r--r-- 1 root root 316K Oct 11 18:13 block_headers.csv
-rw-r--r-- 1 root root  23M Oct 11 18:13 block_messages.csv
-rw-r--r-- 1 root root 1.2M Oct 11 18:13 block_parents.csv
-rw-r--r-- 1 root root 249K Oct 11 18:13 chain_consensus.csv
-rw-r--r-- 1 root root  84K Oct 11 18:13 chain_economics.csv
-rw-r--r-- 1 root root 109K Oct 11 18:13 chain_powers.csv
-rw-r--r-- 1 root root 119K Oct 11 18:13 chain_rewards.csv
-rw-r--r-- 1 root root 125K Oct 11 18:13 drand_block_entries.csv
-rw-r--r-- 1 root root  275 Oct 11 15:17 id_addresses.csv
-rw-r--r-- 1 root root  48K Oct 11 15:21 internal_messages.csv
-rw-r--r-- 1 root root  51K Oct 11 15:21 internal_parsed_messages.csv
-rw-r--r-- 1 root root 172K Oct 11 15:21 market_deal_proposals.csv
-rw-r--r-- 1 root root 7.4M Oct 11 15:21 market_deal_states.csv
-rw-r--r-- 1 root root  62K Oct 11 18:13 message_gas_economy.csv
-rw-r--r-- 1 root root  15M Oct 11 18:13 messages.csv
-rw-r--r-- 1 root root 3.2M Oct 11 18:13 miner_current_deadline_infos.csv
-rw-r--r-- 1 root root 1.4K Oct 11 17:49 miner_fee_debts.csv
-rw-r--r-- 1 root root 3.7K Oct 11 18:13 miner_infos.csv
-rw-r--r-- 1 root root 2.3M Oct 11 18:13 miner_locked_funds.csv
-rw-r--r-- 1 root root 590K Oct 11 15:21 miner_pre_commit_infos.csv
-rw-r--r-- 1 root root  35K Oct 11 15:21 miner_sector_deals.csv
-rw-r--r-- 1 root root 2.8M Oct 11 15:21 miner_sector_events.csv
-rw-r--r-- 1 root root 1.2M Oct 11 15:21 miner_sector_infos_v7.csv
-rw-r--r-- 1 root root  235 Oct 11 15:07 multisig_transactions.csv
-rw-r--r-- 1 root root 3.6K Oct 11 15:19 power_actor_claims.csv
-rw-r--r-- 1 root root 6.6K Oct 11 15:21 verified_registry_verified_clients.csv
-rw-r--r-- 1 root root 136M Oct 11 18:13 visor_processing_reports.csv

After inspecting miner_sector_events, seems we're only getting from 2239417 to 2239440, that is, 23 epochs. Seems this is happening in miner_sector_infos_v7 too.

Google Batch restarts jobs

In recent tasks, Batch will start to pull the initial Docker Image again and then silently fail.

1674117558

It happens in any region and since the VMs are destroyed, I'm not sure where to look for clues of what is going on.

#24 (comment)

Add CI to Filet

We should push to Google Container Registry every new Docker image tag.

Pass all tasks instead of none

Pass in all possible task names that have ever existed instead of an implicit running of all tasks by not passing in any task names.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.