Coder Social home page Coder Social logo

data's Introduction

neume-network/data

Index data data quality assurance

On-chain music NFT data.

Introduction

The Neume Network's goal is to build an open source, socially scalable, indexer for all activity within the emerging Web3 Music industry.

In this repository, we're frequently crawling the entire Ethereum main network on platforms like:

  • Zora
  • Catalog V1 & V2
  • Mint Songs V2
  • Sound.xyz & Sound Protocol
  • Noizd

for the latest music NFT releases. Every x minutes, a GitHub Action is fired that partially crawls a block range delta depending on the last crawl and so new music NFTs are continuously and automatically committed to this repository's results folder. For https://musicos.xyz/, we have a special file called results/music-os-accumulator-extraction that contains all neatly- formatted track metadata in a parsable JSON file. Our goal is to enable anyone to build the Spotify for Web3.

Continuous Data Retrieval

  • You can find a list of indexing jobs on the GitHub Action's page.
  • Consider also checking out the GitHub Actions work flow file.
  • Note: Neume Network doesn't have to be run on GitHub Actions. It's best run on a Unix machine that has an Erigon node running too.

Continuous Data Quality Assurance

  • All track data within results/music-os-accumulator-extraction is neatly-formatted according to the JSON schema of @neume-network/schema.
  • We ensure consistency over any track's uniqueness in the list.

Can I use this data?

I (Tim) currently don't know how to best license the data in the repository. For the time being if you're in doubt about being able to use this data: Consider configuring and running neume network yourself to download a similar data set as ours!

data's People

Contributors

timdaub avatar il3ven avatar reimertz avatar divyakelaskar avatar

Stargazers

As avatar Darryl Yeo avatar  avatar Alireza (0xAlireza) avatar netop://ウエハ avatar Michael Demarais avatar pugson avatar bloc13k avatar Brett Henderson avatar  avatar V Andy avatar Brian Chitester avatar Phu Ngo avatar Kevin Neaton avatar Dan Fowler avatar

Watchers

 avatar

data's Issues

The latest NFT included in music-os-accumulator was apparently minted at block number 14624411

  • For each track in results/music-os-accumulator, given the latest branch on strategies, I've included a createdAt property and started sorting the songs. I found that the latest track is from block 14624411
  • https://etherscan.io/block/14624411 was mined on Apr-20-2022 09:42:00 PM +UTC
  • For at least Zora/catalog tracks, this lack of new songs can be explained by them switching to their V2 contract
  • But what about Sound? Weren't there truly any more songs minted on sound since April 20? Do we have a problem in the crawl?

/cc @il3ven @reimertz

the `lastBlockNumber` script doesn't work

The lastBlockNumber script reads the last line from call-block-logs-transformation and parses the block number from there. We have a gotcha here. The lines in call-block-logs-transformation are not in sequence. This is because of extraction-worker.

  • Exhibit A:
    Action #347: This action ran from block number 15501918, 15503120.
  • Exhibit B:
    Action #348: This action ran from block number 15502933 to 15503292. It should have ran from 15503121.
  • Exhibit C:
    These are the last two lines of call-block-logs-transformation from 5ae839b. The second last line has a higher block number than the last line.
    [{"metadata":{"platform":{"name":"sound"}},"log":{"address":"0x08056544987f28d3134fe4ce48a8002ec1bfc277","blockHash":"0x7c03949298dfd98361304550e3afde8da244cf5052bcad7744e8f5094e2634f7","blockNumber":"0xec8e55","data":"0x","logIndex":"0x5f","removed":false,"topics":["0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x0000000000000000000000000000000000000000000000000000000000000000","0x000000000000000000000000f95a7c6504f284ada554f89b1abe9f3b0dd0595b","0x000000000000000000000000000000030000000000000000000000000000001d"],"transactionHash":"0x4b00f707738e34e4c0fffdf36ac11c215b40c3cebb8b3740d5464b4970a0b05f","transactionIndex":"0x21"}}]
    [{"metadata":{"platform":{"name":"sound"}},"log":{"address":"0xaa0647818717230b74aea9ba711566132f224847","blockHash":"0x815db9118b98267b376280335c66aa5076fb6a9c1b95f862a0e9e869a61cb509","blockNumber":"0xec8e54","data":"0x","logIndex":"0x15f","removed":false,"topics":["0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x0000000000000000000000000000000000000000000000000000000000000000","0x000000000000000000000000f95a7c6504f284ada554f89b1abe9f3b0dd0595b","0x0000000000000000000000000000000400000000000000000000000000000009"],"transactionHash":"0xb53b64af5bcecec5bf1eecac139ea2b0ef71f6314abe511cc0f07f6d3b0ba7c4","transactionIndex":"0x5d"}}]
    

Possible solution:

  • Write the last block number in a file and read it from there.
  • Modify extraction-worker to maintain sequence.

I don't like the latter solution because it is a lot of work for a problem that can be fixed with an easier solution.

If 10min call-block-logs crawl runs regularly, we may not need to execute it on a powerful machine

  • originally, I had been using node.rugpullindex.com to crawl blocks super fast and because it has terrabytes of storage
  • and so that's how the block-logs.yml workflow came into existence, because for the initial crawl, we had to download 721GB of data and filter it down
  • But now, if we manage to crawl Eth mainnet every ten minutes, then we could easily do this from within a single GitHub Action and connecting via ssh to node.rugpullindex.com would be overkill
  • Instead, we could also continue to use a service like alchemy.
  • /cc @il3ven feel free to take this since you're the author of the work flow and its steps

add green ring indie artist

responses.csv


Edit from @il3ven:
Adding the essential information from the responses.csv file so I don't have to download it every-time I need to see the content.

What is your name? How can we contact you? What's the contract address we should add? Anything else that you'd like us to know?
Green Ring https://twitter.com/greenringmusic 0x6eB95f70730DDECe8539fe2aed9D1864Bab85145 This is my genesis piece 'Introvert' created with Manifold. Thank you for giving us a way to add custom contracts!!

integrate zora drops

  • we have recently added zora drops to the /strategies repository
  • we need to filter all contracts first like with soundxyz-filter contracts
  • then pass them into the logs-to-subgraph transformer as parameters
  • then call-zora-drops-tokenuri

/cc @neatonk

After call-block-logs-transformation, GitHub Action Worker never stops

2022-08-27T03:46:42.7830177Z 2022-08-27T03:46:42.724Z neume-network-strategies:lifecycle Shutting down extraction in update callback function
2022-08-27T03:46:42.7834886Z 2022-08-27T03:46:42.724Z neume-network-strategies:lifecycle Ending extractor strategy with name "call-block-logs"
2022-08-27T03:46:42.7835537Z 2022-08-27T03:46:42.725Z neume-network-strategies:lifecycle Starting transformer strategy with name "call-block-logs"
2022-08-27T05:40:46.7719855Z ##[error]The operation was canceled.
  • Note how at 3:46, the transformer strategies is being called on a not too big file
ls -ahl core/data/
total 79G
drwxr-xr-x  2 root root 4.0K Aug 27 05:46 .
drwxr-xr-x 10 root root 4.0K Aug 27 01:42 ..
-rw-r--r--  1 root root  79G Aug 27 05:46 call-block-logs-extraction
  • By 5:40, when the worker was canceled, the transformer strategy should have been finished and the GH Action moved on to the next workflow step

github action indicate that a job had been running for 6 hours, but that doesn't seem right

We often don't comply to our own schema in this repository and we are unaware of it

  • e.g. today we noticed that any track should have a manifestation with an audio mimetype
  • or e.g. many tracks don't contain a valid owner field
  • and so we should be made aware automatically if the outcomes in the results directory are non compliant with the schema
  • a solution could be where a separate github action work flow parses music os accumulator file and check each tracks schema compliance

we should only update `lastBlockNumber` file with valid block numbers

It can be that the generateBlockNumber script fails and returns a non integer block number. Committing this incorrect block number can make all the newer crawl fail. This happened in cd1acbc.

Validate block number before updating lastBlockNumber. It should be an hexadecimal integer and greater than the block number already present in the file.

some newly released songs on soundxyz aren't showing up in the music os data feed

on discord, roundpotatocat reported

"""
hi @TimDaub, noticed something today on musicOS and dan mentioned to let you know!
the sound.xyz contracts not showing up anymore on the musicOS feed.
these are all the ones that dropped and are not showing for me on my musicOS
-grady fall asleep on the plane
-dot make you believe
-healy skin&bones 2019 demo
-latasha show time
all dont show up
i know that Reo is on the new protocol frameworks / season3 so idk if that changes anything but also isnt showing up
"""

  • I looked at the sound.xyz recent page and was able to confirm this.
  • However, e.g. TWERL - drowning is showing up which came after grady - falling asleep on an airplane
  • To me this indicates that something is wrong with e.g. the how we parse the artist contracts from sound, we should investigate

using ssh-agent, a GitHub Actions step should log into node.rugpullindex.com, trigger a crawl that downloads the lastest blocks and append them to call-block-logs-transformation

open questions

  • how do we know the last crawled block and how do we know the current block? Are we leaving these tasks to GitHub Actions for now?
  • As we're building a node software, shouldn't the orchestration of querying a new block happen on the node level and not on GitHub Actions? How can we trade these things off considering that we also have to make progress quickly?

dos

  • Using ssh-agent, log into server
  • on server, using git, download latest neume-network/data repository
  • determine latest block and latest crawled block
  • fire call-block-logs extraction in boundary and transformer too
  • commit and reupload filtered events to github

it is not easy to find logs for a given commit

Commits like Update from 254ab8f5... are generated using GitHub actions but it is not clear which run generated those. Hence tracking the logs are difficult. To make things difficult we now have a lot of runs.

Solution: Add run id from GitHub in the commit message.

when the new NFT IDs get appended to the old NFT IDs in logs-to-subgraph, and then a downstream strategy like soundxyz-call-tokenuri gets called: It sees the entire huge file and recrawls everything and not just the delta

  • see this crawl: https://github.com/neume-network/data/runs/8100361709?check_suite_focus=true
  • for call-block-logs and logs-to-subgraph we happily just add to the file we prior download from neume-network/data/results
  • and while the code in /scripts/ makes sure that we only pass a certain block range to call-block-logs, at the time we get to soundxyz-call-tokenuri, its presented with the entire event log of logs-to-subgraph-transformation and not just the small delta we downloaded
  • So it'll start downloading all 20k tracks when it probably was only 3 or even zero during that time
  • Hence, in .github, we have to run a crawl once on new files while generating the zeroth step crawl path from the neume-network/data/result/call-block-logs-transformation step
  • After completing the entire crawl, we then append all the new results with cat to the data in neume-network/data/results and commit+push it back

soundxyz makes crawl fail

@il3ven: "A call to tokenURI returns a URL of type sound.xyz/api/metadata/. Since that URL is now invalid our crawl fails. We could query at the latest block number to get tokenURI of type metadata.sound.xyz..
For reference check this run: https://github.com/neume-network/data/runs/7528824146?check_suite_focus=true"
me: "Instead of querying at the latest block number, how about we query at 15223843 (which just came out). I'd consider "latest" a bad practice"

green ring NFT on catalog has a duplicate

Greenring posted on Discord:

"""
Also, while I'm here, I have a specific question regarding making small corrections to neume. A duplicate of my song 'The Leap' was somehow minted by someone else on the catalog contract after me. Not really sure how it happened, but it went up right after I minted mine. I let catalog know, and it looks like they basically blanked out the NFT. The correct token ID on the Catalog contract for the song is 325, the fake one is 327. I only bring it up because there are currently two copies of the song showing on MusicOS so I'm guessing neume is indexing both. Just wondering if it is a straightforward or more involved fix to get neume to ignore this specific token.
Correct token:
https://opensea.io/assets/ethereum/0x0bc2a24ce568dad89691116d5b34deb6c203f342/325

Extra/Incorrect token:
https://opensea.io/assets/ethereum/0x0bc2a24ce568dad89691116d5b34deb6c203f342/327
"""

@il3ven commented:

"""
@TimDaub Upon a little inspection, I think this is because we fetch NFT on the block they were created. Catalog must have updated the tokenURI sometime after the mint. In neume, the tokenURI for tokenID 327 is still the old one.
"""

@il3ven do we need a "token ignore file" that we could e.g. set "0x0bc2a24ce568dad89691116d5b34deb6c203f342/325" to ignore?

For Hifilabs frontend; transform old crawl into @music-os/schema complicit data structure

Scope

  • Until Friday, Hifilabs wants to ship music OS, a news feed for music NFTs. A comparable music NFT newsfeed could be https://futuretape.xyz/ just to get an idea
  • HifiLabs and music-os have agree to collaborate with the interface for collaboration being the nfts.json file in this repository.
  • Essentially, a process from HL downloads nfts.json and ingests it into a strapi instance. strapi then provides a graphql interface for HL's frontend newsfeed.

Problem

  • If you look into nfts.json currently, you'll see that only two music NFTs are listed there. Please note, though, that the schema they're listed in is complicit to the latest version of music-os-schema.
  • However, an earlier crawl had much more data, but none that was compliant with music-os-schema.
  • Since we urgently need the older crawl's data in the new format, we want to write a custom script that transforms the data set.

Deliverables

  • Take a version of the old crawl, e.g. this one and transform it into a music-os-schema complicit schema.
  • Important: In the meantime, we have already created transformers for catalog type metadata and sound.xyz. You can find the code: catalog and sound. It may be that you can just copy this code to make the task easier

Notes

  • The script that transforms the old crawler data set into the new crawler data set might never be used again, so we don't need tests or pretty code or whatever. Rather we just want a sizable data set.
  • The actual deliverable is the code, along with a nfts.json that includes as many music NFTs as we can extract from the old crawl.
  • The task has to be done latest by Friday May 13 2022.

GitHub Actions hides the request endpoint in some cases as it's a secret (when for debugging it would be better to show it)

(node:1920) UnhandledPromiseRejectionWarning: Error: FetchError: request to *** failed, reason: connect ETIMEDOUT 54.146.171.13:443
    at Worker.<anonymous> (file:///home/runner/work/data/data/src/strategies/src/lifecycle.mjs:152:13)
    at Worker.emit (events.js:400:28)
    at MessagePort.<anonymous> (internal/worker.js:236:53)
    at MessagePort.[nodejs.internal.kHybridDispatch] (internal/event_target.js:399:24)
    at MessagePort.exports.emitMessage (internal/per_context/messageport.js:18:26)
(node:1920) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 647)

For platforms like soundxyz that implement custom editions, attempt to filter by unique audio manifestations and see if editions get reduced to unique works

  • soundxyz implements editions such that each tokenId is an NFT but can at the same time be an edition
  • Two NFTs can have practically the same metadata (with some minor differences) but be editions of one another
  • So the question is: How can we detect editions efficiently without having to deal with the many implementations. What's the common denominator?
  • One idea was for each Soundxyz NFT to reduce it its audio manifestation and to then check if that reduces all event logs down to the actual newly created media files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.