neume-network / data Goto Github PK

View Code? Open in Web Editor NEW

15.0 1.0 6.0 521.46 MB

On-chain music NFT data.

JavaScript 89.96% Shell 10.04%

music web3

data's Introduction

neume-network/data

On-chain music NFT data.

Introduction

The Neume Network's goal is to build an open source, socially scalable, indexer for all activity within the emerging Web3 Music industry.

In this repository, we're frequently crawling the entire Ethereum main network on platforms like:

Zora
Catalog V1 & V2
Mint Songs V2
Sound.xyz & Sound Protocol
Noizd

for the latest music NFT releases. Every x minutes, a GitHub Action is fired that partially crawls a block range delta depending on the last crawl and so new music NFTs are continuously and automatically committed to this repository's results folder. For https://musicos.xyz/, we have a special file called results/music-os-accumulator-extraction that contains all neatly- formatted track metadata in a parsable JSON file. Our goal is to enable anyone to build the Spotify for Web3.

Continuous Data Retrieval

You can find a list of indexing jobs on the GitHub Action's page.
Consider also checking out the GitHub Actions work flow file.
Note: Neume Network doesn't have to be run on GitHub Actions. It's best run on a Unix machine that has an Erigon node running too.

Continuous Data Quality Assurance

All track data within results/music-os-accumulator-extraction is neatly-formatted according to the JSON schema of @neume-network/schema.
We ensure consistency over any track's uniqueness in the list.

Can I use this data?

I (Tim) currently don't know how to best license the data in the repository. For the time being if you're in doubt about being able to use this data: Consider configuring and running neume network yourself to download a similar data set as ours!

data's People

Contributors

Stargazers

Watchers

Forkers

divyakelaskar sekmet acklan rugpullindex malimccalla

data's Issues

The latest NFT included in music-os-accumulator was apparently minted at block number 14624411

For each track in results/music-os-accumulator, given the latest branch on strategies, I've included a createdAt property and started sorting the songs. I found that the latest track is from block 14624411
https://etherscan.io/block/14624411 was mined on Apr-20-2022 09:42:00 PM +UTC
For at least Zora/catalog tracks, this lack of new songs can be explained by them switching to their V2 contract
But what about Sound? Weren't there truly any more songs minted on sound since April 20? Do we have a problem in the crawl?

/cc @il3ven @reimertz

pushing crawler data into the repository can fail in GitHub Actions

run: https://github.com/neume-network/data/runs/7302984803?check_suite_focus=true

the `lastBlockNumber` script doesn't work

The lastBlockNumber script reads the last line from call-block-logs-transformation and parses the block number from there. We have a gotcha here. The lines in call-block-logs-transformation are not in sequence. This is because of extraction-worker.

Exhibit A:
Action #347: This action ran from block number 15501918, 15503120.
Exhibit B:
Action #348: This action ran from block number 15502933 to 15503292. It should have ran from 15503121.

Exhibit C:
These are the last two lines of call-block-logs-transformation from 5ae839b. The second last line has a higher block number than the last line.

[{"metadata":{"platform":{"name":"sound"}},"log":{"address":"0x08056544987f28d3134fe4ce48a8002ec1bfc277","blockHash":"0x7c03949298dfd98361304550e3afde8da244cf5052bcad7744e8f5094e2634f7","blockNumber":"0xec8e55","data":"0x","logIndex":"0x5f","removed":false,"topics":["0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x0000000000000000000000000000000000000000000000000000000000000000","0x000000000000000000000000f95a7c6504f284ada554f89b1abe9f3b0dd0595b","0x000000000000000000000000000000030000000000000000000000000000001d"],"transactionHash":"0x4b00f707738e34e4c0fffdf36ac11c215b40c3cebb8b3740d5464b4970a0b05f","transactionIndex":"0x21"}}]
[{"metadata":{"platform":{"name":"sound"}},"log":{"address":"0xaa0647818717230b74aea9ba711566132f224847","blockHash":"0x815db9118b98267b376280335c66aa5076fb6a9c1b95f862a0e9e869a61cb509","blockNumber":"0xec8e54","data":"0x","logIndex":"0x15f","removed":false,"topics":["0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x0000000000000000000000000000000000000000000000000000000000000000","0x000000000000000000000000f95a7c6504f284ada554f89b1abe9f3b0dd0595b","0x0000000000000000000000000000000400000000000000000000000000000009"],"transactionHash":"0xb53b64af5bcecec5bf1eecac139ea2b0ef71f6314abe511cc0f07f6d3b0ba7c4","transactionIndex":"0x5d"}}]

Possible solution:

Write the last block number in a file and read it from there.
Modify extraction-worker to maintain sequence.

I don't like the latter solution because it is a lot of work for a problem that can be fixed with an easier solution.

there can potentially be duplicate data in call-block-logs-transformation because we're not adding to the latest crawled block

06490c5

If 10min call-block-logs crawl runs regularly, we may not need to execute it on a powerful machine

originally, I had been using node.rugpullindex.com to crawl blocks super fast and because it has terrabytes of storage
and so that's how the block-logs.yml workflow came into existence, because for the initial crawl, we had to download 721GB of data and filter it down
But now, if we manage to crawl Eth mainnet every ten minutes, then we could easily do this from within a single GitHub Action and connecting via ssh to node.rugpullindex.com would be overkill
Instead, we could also continue to use a service like alchemy.
/cc @il3ven feel free to take this since you're the author of the work flow and its steps

GitHub Action that commits to the `results` repository isn't additive, it deletes all content of the repository and re-adds the crawl results

Unfortunately, this leads to call-block-logs-transformation getting deleted: 4428ce9

improve readme to outline what's going on in this repository

add green ring indie artist

responses.csv

Edit from @il3ven:
Adding the essential information from the responses.csv file so I don't have to download it every-time I need to see the content.

What is your name?	How can we contact you?	What's the contract address we should add?	Anything else that you'd like us to know?
Green Ring	https://twitter.com/greenringmusic	0x6eB95f70730DDECe8539fe2aed9D1864Bab85145	This is my genesis piece 'Introvert' created with Manifold. Thank you for giving us a way to add custom contracts!!

Some partial crawls end up running very long and its unclear why they do

example https://github.com/neume-network/data/actions/runs/3009785822
here's one where we see an error and it's a range error that may have to do with the strategy lifecycle: https://github.com/neume-network/data/runs/8231331123?check_suite_focus=true#step:10:36

integrate zora drops

we have recently added zora drops to the /strategies repository
we need to filter all contracts first like with soundxyz-filter contracts
then pass them into the logs-to-subgraph transformer as parameters
then call-zora-drops-tokenuri

/cc @neatonk

github action indexer workflow should run e.g. every hour

GitHub Action step per strategy stage

To crawlpath, add NFT owner information

we merged a new strategy to call a contract's owner: neume-network/strategies#201
to @neume-network/schema, we want to add an owner field: neume-network/schema#24
finally, once all those changes land on neume-network/core, we want to add call-contract-owner as an additional step right after web3subgraph (which it requires data from)
adjust music-os-accumulator strategy to match owner for an address

/cc @neatonk

After call-block-logs-transformation, GitHub Action Worker never stops

run https://github.com/neume-network/data/actions/runs/2937210584

2022-08-27T03:46:42.7830177Z 2022-08-27T03:46:42.724Z neume-network-strategies:lifecycle Shutting down extraction in update callback function
2022-08-27T03:46:42.7834886Z 2022-08-27T03:46:42.724Z neume-network-strategies:lifecycle Ending extractor strategy with name "call-block-logs"
2022-08-27T03:46:42.7835537Z 2022-08-27T03:46:42.725Z neume-network-strategies:lifecycle Starting transformer strategy with name "call-block-logs"
2022-08-27T05:40:46.7719855Z ##[error]The operation was canceled.

Note how at 3:46, the transformer strategies is being called on a not too big file

ls -ahl core/data/
total 79G
drwxr-xr-x  2 root root 4.0K Aug 27 05:46 .
drwxr-xr-x 10 root root 4.0K Aug 27 01:42 ..
-rw-r--r--  1 root root  79G Aug 27 05:46 call-block-logs-extraction

By 5:40, when the worker was canceled, the transformer strategy should have been finished and the GH Action moved on to the next workflow step

Permissively license this data set

seek consensus with co-creators of data set
another option: Try generating revenue by selling access to this data set

duplicate songs in music-os-accumulator

@reimertz: there are duplicate tracks in mint songs for music-os-accumulator

address: 0x2b5426a5b98a3e366230eba9f95a24f09ae4a584
tokenId: 142

Boss - Stay Forever

github action indicate that a job had been running for 6 hours, but that doesn't seem right

this run failed on job 5: https://github.com/neume-network/data/runs/7728307903?check_suite_focus=true
but job 5's first timestamp is shortly before 3pm: https://github.com/neume-network/data/runs/7728307903?check_suite_focus=true#step:3:303
and the last log is sent only 20mins later at 3:15pm: https://github.com/neume-network/data/runs/7728307903?check_suite_focus=true#step:3:20682
So it seems the job 5 itself didn't take 6 hours; Why was it canceled? And how much time e.g. passed between the job being canceled and the last log message?

We often don't comply to our own schema in this repository and we are unaware of it

e.g. today we noticed that any track should have a manifestation with an audio mimetype
or e.g. many tracks don't contain a valid owner field
and so we should be made aware automatically if the outcomes in the results directory are non compliant with the schema
a solution could be where a separate github action work flow parses music os accumulator file and check each tracks schema compliance

we should only update `lastBlockNumber` file with valid block numbers

It can be that the generateBlockNumber script fails and returns a non integer block number. Committing this incorrect block number can make all the newer crawl fail. This happened in cd1acbc.

Validate block number before updating lastBlockNumber. It should be an hexadecimal integer and greater than the block number already present in the file.

instead of manually renaming call-block-logs-transformation into logs-to-subgraph-extraction, we should define transformer.args[0]

this comment explains it: neume-network/strategies#241 (comment)

some newly released songs on soundxyz aren't showing up in the music os data feed

on discord, roundpotatocat reported

"""
hi @TimDaub, noticed something today on musicOS and dan mentioned to let you know!
the sound.xyz contracts not showing up anymore on the musicOS feed.
these are all the ones that dropped and are not showing for me on my musicOS
-grady fall asleep on the plane
-dot make you believe
-healy skin&bones 2019 demo
-latasha show time
all dont show up
i know that Reo is on the new protocol frameworks / season3 so idk if that changes anything but also isnt showing up
"""

I looked at the sound.xyz recent page and was able to confirm this.
However, e.g. TWERL - drowning is showing up which came after grady - falling asleep on an airplane
To me this indicates that something is wrong with e.g. the how we parse the artist contracts from sound, we should investigate

Since we now don't have steps in Github actions we can use a single crawl path

We can remove the less descriptive crawl paths such as 0.mjs, 1.mjs etc.

using ssh-agent, a GitHub Actions step should log into node.rugpullindex.com, trigger a crawl that downloads the lastest blocks and append them to call-block-logs-transformation

open questions

how do we know the last crawled block and how do we know the current block? Are we leaving these tasks to GitHub Actions for now?
As we're building a node software, shouldn't the orchestration of querying a new block happen on the node level and not on GitHub Actions? How can we trade these things off considering that we also have to make progress quickly?

dos

Using ssh-agent, log into server
on server, using git, download latest neume-network/data repository
determine latest block and latest crawled block
fire call-block-logs extraction in boundary and transformer too
commit and reupload filtered events to github

integrate newly added noizd strategies into continuous crawl

run call-tokenuri
run get-tokenuri
try running through music-accumulator, did it work?
insert crawl steps in repository
update music-os-accumulator

job 1 - job x are non-descriptive names for what is actually going on

currently, we're segmenting a crawl into numerical steps: But we have to look up what a step is used for before we can really understand what it does
A descriptive solution names the step by e.g. the strategies it's executing

crawl with neume-network/core at commit hash 89d18bb946029326b0565886b57cb03b1fdb8731 fails

https://github.com/neume-network/core/tree/89d18bb946029326b0565886b57cb03b1fdb8731
one attempt: https://github.com/neume-network/data/actions/runs/2636445714
What's the problem, how can we fix it?
Hotfix is that I reverted to an older version of the crawler in neume-network/data:

data/.github/workflows/node.js.yml

Line 25 in c8e5277

ref: 7c4a5b814f77e5e881658d7fadb9794caa160581

it is not easy to find logs for a given commit

Commits like Update from 254ab8f5... are generated using GitHub actions but it is not clear which run generated those. Hence tracking the logs are difficult. To make things difficult we now have a lot of runs.

Solution: Add run id from GitHub in the commit message.

test increasing concurrency to speedup crawl

we can technically switch back to cpina/github-action-push-to-another-repository as we stopped using the ignore file option for now

DON't delete our fork as it serves as a branch for a PR

when the new NFT IDs get appended to the old NFT IDs in logs-to-subgraph, and then a downstream strategy like soundxyz-call-tokenuri gets called: It sees the entire huge file and recrawls everything and not just the delta

see this crawl: https://github.com/neume-network/data/runs/8100361709?check_suite_focus=true
for call-block-logs and logs-to-subgraph we happily just add to the file we prior download from neume-network/data/results
and while the code in /scripts/ makes sure that we only pass a certain block range to call-block-logs, at the time we get to soundxyz-call-tokenuri, its presented with the entire event log of logs-to-subgraph-transformation and not just the small delta we downloaded
So it'll start downloading all 20k tracks when it probably was only 3 or even zero during that time
Hence, in .github, we have to run a crawl once on new files while generating the zeroth step crawl path from the neume-network/data/result/call-block-logs-transformation step
After completing the entire crawl, we then append all the new results with cat to the data in neume-network/data/results and commit+push it back

at the time of invoking the step 0.mjs file, soundxyz-filter-contracts-transformation file won't exist yet

given `scripts/lastBlockNumber` and a network call to an Ethereum node, determine start and end parameters of call-block-logs-extraction step

we'd like to define these two variables: https://github.com/neume-network/strategies/blob/f301db569ec9b8ba1b234363c16dc56a15737b21/src/strategies/call-block-logs/extractor.mjs#L42

soundxyz makes crawl fail

@il3ven: "A call to tokenURI returns a URL of type sound.xyz/api/metadata/. Since that URL is now invalid our crawl fails. We could query at the latest block number to get tokenURI of type metadata.sound.xyz..
For reference check this run: https://github.com/neume-network/data/runs/7528824146?check_suite_focus=true"
me: "Instead of querying at the latest block number, how about we query at 15223843 (which just came out). I'd consider "latest" a bad practice"

We informed soundxyz too: soundxyz/protocol#101

green ring NFT on catalog has a duplicate

Greenring posted on Discord:

"""
Also, while I'm here, I have a specific question regarding making small corrections to neume. A duplicate of my song 'The Leap' was somehow minted by someone else on the catalog contract after me. Not really sure how it happened, but it went up right after I minted mine. I let catalog know, and it looks like they basically blanked out the NFT. The correct token ID on the Catalog contract for the song is 325, the fake one is 327. I only bring it up because there are currently two copies of the song showing on MusicOS so I'm guessing neume is indexing both. Just wondering if it is a straightforward or more involved fix to get neume to ignore this specific token.
Correct token:
https://opensea.io/assets/ethereum/0x0bc2a24ce568dad89691116d5b34deb6c203f342/325

Extra/Incorrect token:
https://opensea.io/assets/ethereum/0x0bc2a24ce568dad89691116d5b34deb6c203f342/327
"""

@il3ven commented:

"""
@TimDaub Upon a little inspection, I think this is because we fetch NFT on the block they were created. Catalog must have updated the tokenURI sometime after the mint. In neume, the tokenURI for tokenID 327 is still the old one.
"""

@il3ven do we need a "token ignore file" that we could e.g. set "0x0bc2a24ce568dad89691116d5b34deb6c203f342/325" to ignore?

Since the crawl is continuous, music-os-accumulator is appended with new data but our users expect one big JSON array with all tracks and no line separation

maybe jq could be used: https://unix.stackexchange.com/questions/610461/how-to-merge-arrays-from-multiple-json-files-with-jq

For Hifilabs frontend; transform old crawl into @music-os/schema complicit data structure

Scope

Until Friday, Hifilabs wants to ship music OS, a news feed for music NFTs. A comparable music NFT newsfeed could be https://futuretape.xyz/ just to get an idea
HifiLabs and music-os have agree to collaborate with the interface for collaboration being the nfts.json file in this repository.
Essentially, a process from HL downloads nfts.json and ingests it into a strapi instance. strapi then provides a graphql interface for HL's frontend newsfeed.

Problem

If you look into nfts.json currently, you'll see that only two music NFTs are listed there. Please note, though, that the schema they're listed in is complicit to the latest version of music-os-schema.
However, an earlier crawl had much more data, but none that was compliant with music-os-schema.
Since we urgently need the older crawl's data in the new format, we want to write a custom script that transforms the data set.

Deliverables

Take a version of the old crawl, e.g. this one and transform it into a music-os-schema complicit schema.
Important: In the meantime, we have already created transformers for catalog type metadata and sound.xyz. You can find the code: catalog and sound. It may be that you can just copy this code to make the task easier

Notes

The script that transforms the old crawler data set into the new crawler data set might never be used again, so we don't need tests or pretty code or whatever. Rather we just want a sizable data set.
The actual deliverable is the code, along with a nfts.json that includes as many music NFTs as we can extract from the old crawl.
The task has to be done latest by Friday May 13 2022.

data quality assurance should run on every new push on main branch but it doesn't

we define that this workflow should run on every new push to main

data/.github/workflows/dataqualityassurance.js.yml

Lines 3 to 5 in 1cada3a

    
           on: 
        
             push: 
        
               branches: [main]

But for some reason, it isn't running regularly: https://github.com/neume-network/data/actions/workflows/dataqualityassurance.js.yml

Infura has timeouts for IPFS API

https://docs.infura.io/infura/networks/ipfs/how-to/request-rate-limits#authenticated-api-requests
Write dedicated tests to figure out INFURA rate limits for ETH and IPFS

checkout step for neume-network/core wrongfully assumes that if neume-network/data@branchname is run, that branch is available on neume-network/core@branchname too

data/.github/actions/run_step/action.yml

Line 39 in c05e8ea

ref: ${{ github.ref_name }}
e.g. here I tried debugging some custom action we're building but core doesn't have the branch name: https://github.com/neume-network/data/runs/7995476885?check_suite_focus=true#step:3:131

add coc

Copy CoC from https://github.com/music-os/music-os-core/blob/main/CODE_OF_CONDUCT.md into repo

GitHub Actions hides the request endpoint in some cases as it's a secret (when for debugging it would be better to show it)

(node:1920) UnhandledPromiseRejectionWarning: Error: FetchError: request to *** failed, reason: connect ETIMEDOUT 54.146.171.13:443
    at Worker.<anonymous> (file:///home/runner/work/data/data/src/strategies/src/lifecycle.mjs:152:13)
    at Worker.emit (events.js:400:28)
    at MessagePort.<anonymous> (internal/worker.js:236:53)
    at MessagePort.[nodejs.internal.kHybridDispatch] (internal/event_target.js:399:24)
    at MessagePort.exports.emitMessage (internal/per_context/messageport.js:18:26)
(node:1920) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 647)

In this repository, it can be that in the results directory we publish JSON objects that aren't compliant with @neume-network/schema

We should have a GitHub Actions test that periodically checks whether we're compliant with our own schema
Specifically for music-os-accumulator-extraction

For platforms like soundxyz that implement custom editions, attempt to filter by unique audio manifestations and see if editions get reduced to unique works

soundxyz implements editions such that each tokenId is an NFT but can at the same time be an edition
Two NFTs can have practically the same metadata (with some minor differences) but be editions of one another
So the question is: How can we detect editions efficiently without having to deal with the many implementations. What's the common denominator?
One idea was for each Soundxyz NFT to reduce it its audio manifestation and to then check if that reduces all event logs down to the actual newly created media files