Coder Social home page Coder Social logo

Complete file as torrent about dataset HOT 14 OPEN

openimages avatar openimages commented on July 22, 2024 4
Complete file as torrent

from dataset.

Comments (14)

n01z3 avatar n01z3 commented on July 22, 2024 11

@bjornamr
I created torrent file. All images resized to 420 on small side.

from dataset.

gokul-uf avatar gokul-uf commented on July 22, 2024 1

@gkrasin Why not split up the dataset into smaller parts, say 200 parts of 100 GB each and create torrents of each? That way, its easier to checksum and torrents do support files of 100 GB+.

Also, users can add all the torrents in their client and download one by one.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024 1

@n01z3 Arthur, thank you for your effort and for sharing it with everyone.

from dataset.

spawnflagger avatar spawnflagger commented on July 22, 2024 1

I had an idea related to this- after locally creating 65k subdirs for the dataset (0000 - ffff) I noticed that the ImageID’s are pretty fairly distributed (about 150 images under each). It might be reasonable to distribute as 65,536 .tar files (uncompressed since JPG compression is already better than gzip), and then bin the images by first 4 chars in the ImageID.
Then there could be a downloader script (similar to youtube8m download.py) that gets each of these tar files (~300MB each) and checks their checksum. Also rsync isn’t bad at handling deltas in 300MB files if the set of actual images changes in the future, so the update process could be automated as well.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Hi @bjornamr!

I can only comment on the technical side of the question (distributing the dataset via Torrent). 20TB (the total size of all images in the original resolution) is much larger than what is considered practical for this technology, as it would require a special hardware setup. I also have a feeling that checksumming all the data will take at least hours.

It seems that only a cloud storage of a kind (S3, GCS, whatever) could be a basis for that, but I might be missing some obvious idea.

from dataset.

bjornamr avatar bjornamr commented on July 22, 2024

@gkrasin
Hi, I would suggest down scaling the pictures to for example 640x480, or even smaller 256x256. This would reduce this to about 1TB or around 500gb, which would be possible to share via Torrent. This is ofc, only the case if you have a pretty decent upload speed.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Internally, we use the thumbnails of two sizes:

  • "1600HQ" (roughly 1600x1200): they have 1600px on the largest size
  • "300K" (roughly 640x480): they have ~300K pixels overall.

It's our belief that having thumbnails smaller than 640x480 hursts the ability to train an image-level classifier. It hurts even more for other tasks (object detection / localization / segmentation / colorization / etc).

So, yeah, scaling the thumbnails down to 640x480 will get you into 1 TB territory. That fits a single machine, but still on the edge of being practical for distributing via Torrent. Anyway, let me know if you ever put this together.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

@n01z3 nice!

By the way, what do these numbers mean?

Train
all: 9011219
downloaded: 8798643
labeled: 8646180
post-download clean: 8591564

Validation
all: 167056
downloaded: 160957
post-download clean: 159847

Does it mean that you have deleted all images without labels? (if that's the case, I would highly recommend to put them back, as the current labels are by no means final, and the images with missing labels are obvious targets for improvements).

Also, what additional cleaning did you do?

from dataset.

n01z3 avatar n01z3 commented on July 22, 2024

@gkrasin
All - amount of urls
Downloaded - images that I was able to download
Labeled - all images with labels
Post-Download Clean - amount of images after removing all images with 0 Kb size or white images with error of access.

I didn't delete unlabeled images. But I didn't include unlabeled images to torrent. I going to post url to archives, it's about 8 Gb.

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

Thank you for clarification. The "post-download clean" seems reasonable.

I didn't delete unlabeled images. But I didn't include unlabeled images to torrent.

Sorry for a confusion, by "deleting" I meant "not including".

from dataset.

n01z3 avatar n01z3 commented on July 22, 2024

all archives here

from dataset.

gkrasin avatar gkrasin commented on July 22, 2024

@gokul-uf I guess it's the question to the torrents creators, not to me. I have no opinion on that.

from dataset.

n01z3 avatar n01z3 commented on July 22, 2024

@gokul-uf
I'm not associated with people who made this dataset. But I downloaded most of all images (for my purpose 420 on small size is enough). And I shared this dataset through torrent just because I can. IMO in case torrent one big archive is acceptable solution.

from dataset.

bjornamr avatar bjornamr commented on July 22, 2024

@n01z3 Going to try to download it now. Ty :)

from dataset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.