unsplash / datasets Goto Github PK

View Code? Open in Web Editor NEW

2.3K 64.0 110.0 67 KB

🎁 5,400,000+ Unsplash images made available for research and machine learning

Home Page: https://unsplash.com/data

Jupyter Notebook 100.00%

dataset images unsplash machine-learning research data search-engine keywords photos semantics

datasets's People

Contributors

Stargazers

Watchers

Forkers

shyamalschandra ayoolafelix sniafas copiousfreetime iamvazu sonyeric mehrdad-shokri waghcwb janice-m victor8733 garyci saonam deeplearning2012 dwsyoyo abumubarak-dev zhengfengrao cybernetics hhy5277 hoangpq lemonheart96 bluseking dongitran enoch2090 peanutqaq ethpony flywind2 jdk6979 nonomal erinqi keonjeo aviv-kadair mukeshbhati1777 lqdev openinspiration bhardin batmanwgd minhtrih isabella232 toashishagarwal get-dataxy tomphee jungwonhak chris29boy 513415184 tshepo87 trantrungdng jeremyjsherman abinpj dominic59595 seanwallawalla-forks samjaninf hargurjeet c0ns0le jamznloh rayneboes1 carolynxzy chakrabhatta vardaan-raj 0xbiriyani thecuriouscirc adityagi02 te-utils-packages brucepro mzouros foolsholder alberthch rovivor ljun901527 khancepts101 carlgao-git2 g3rley techwithanirudh bearerpipelinetest liuzhuang1024 adwaya mathildetillman cxapython jchristopher futureprecd mytimebox kuldeepmehra27 beezisback techthiyanes becayesoft quickpanda andyjyk appdirectory riadne42 fauzein founderskentino pyamin1878 usamaa-saleem esikmalazman alverto-ortega creslinux mohamedabdelsalam9 musab-benjadi iamkknnii r1372 hcoco1

datasets's Issues

Feature request: Add description = true feature to your random API requests

Hello,

I'm an English Learning Educator and I have a registered project set up already.

I would like to get random images, but ONLY the ones that have a clear or detailed description field, not null or empty values.

Is there a way to do this with your current api or can it be implemented by your team as a feature request?

something like this:

https://api.unsplash.com/photos/random?description=true&description_min_chars=10&client_id=XXXXXX

params

description = true (required)

description_min_chars = int (required) minimum description characters length or char count

Regards

Hugo Barbosa

Wrong URL on some photos

Describe the bug
URL is bad

To Reproduce
Try downloading it

Expected behavior
Good URL

Additional context
Add any other context about the problem here.

Add width, height and aspect ratio of the photos

The dataset is missing width, height and aspect ratio of the photo.

These are 3 important elements and should be appearing as 3 distinct fields.

I believe this dataset is due for an update soon, may I know if we will receive another link to download the updated version once it is out? What is the procedure to get the updated version once it is out?

Missing photos from v1.1.0

Describe the bug
Photos.tsv numbers 24942 items.

Additional context
I have detected the new photo_ids, 249 items, just in case file

Download the images in Unsplash-lite

How can I download the images of Unsplash-lite using "unsplash-research-dataset-lite-latest.zip"

Include Captions in Lite Dataset

I noticed that in the Lite dataset, there is only an AI caption. Is there a reason that the user's submitted caption isn't there?

Image analysis metadata available?

Ticket #21 mentions additional image metadata that should be in the dataset. Are there other image analysis things that unsplash calculates that could be added? I'm in particular thinking of color value statistics, mean/median pixel value, min/max pixel values etc.

distinct download URLs for each version of the dataset.

The link to download the lite dataset in the the readme is always the same, and always for the latest version of the data set.

This comment seems to indicate that a new snapshots of the datasets are cut when a new version of the contents of this repo is released, it would probably be useful to have links to the various versions of the datasets available, if nothing more than for historical purposes.

I imagine that this cold be integrated with #23 so that in the CHANGELOG.md, or possibly a releases.json that would include the download link for the lite dataset and the appropriate integrity check for each released version, including the full dataset's integrity check.

For the record - I tried hitting several variations of https://unsplash.com/data/lite/<version> to see if a link to the v1.0.0 dataset was available. No luck 😄 .

no access to the full dataset

got reply: "Thanks for inquiring about Unsplash Full Dataset. I would recommend you to download the Lite Dataset before using the Full one. The Lite Dataset is meant to be open and allow anyone to experiment. If you believe your experiment or research would need the whole 2M+ images, we are happy to give you access to it then.
The Full Dataset is meant for artificial intelligence and machine learning research mostly when the Lite Dataset is not sufficient enough."

Clarification of embedded newlines in TSV files

The keywords data file appears to have an embedded newline in one of the records. I just want to clarify if this is expected or not. It looks like the given psql loading instructions do account for newlines in the TSV file, but if folks are processing the file outside of that without using quote-escaping rules they may process the data incorrectly.

To Reproduce

% wc -l *.tsv000
1646598 collections.tsv000
4075505 conversions.tsv000
2689741 keywords.tsv000
25001 photos.tsv000
8436845 total

Load the according to the documented instructions:

 % psql -h localhost -U jeremy -d unsplash_lite -f load-data-client.sql
    COPY 25000
    COPY 2689739 # <-- Hmm.. this one is NOT 1 less than keywords.tsv000 above
    COPY 1646597
    COPY 4075504

Check the db row count

unsplash_lite=# select count(*) from unsplash_keywords;
  count
---------
 2689739
(1 row)

Expected behavior

I initially expected there to be 1 record for each non-header line of TSV, this appears to be an incorrect assumption. It looks like the psql commandline parsed the TSV according to quoted escape rules, so that is good.

I wrote a program to check the keywords file and it reports

% ruby check-tsv.rb keywords.tsv000
Headers: photo_id -- keyword -- ai_service_1_confidence -- ai_service_2_confidence -- suggested_by_user
[1590611 - PF4s20KB678-"fujisan] parts count 2 != 5
[1590612 - mount fuji"-] parts count 4 != 5
lines in file   : 2689741
data lines      : 2689740
unique row count: 2689740

Then looking at the lines around line 1590610 we see:

 % sed -n '1590610,1590615p' keywords.tsv000
PF4s20KB678     night   22.3271160125732                f
PF4s20KB678     "fujisan
mount fuji"                     t
PF4s20KB678     pier    22.6900939941406                f
PF4s20KB678     viaduct 30.6490669250488                f
PF4s20KB678     architecture    33.084938049316399              f

And the db reports that row and the preceding and following rows correctly loaded.

unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%fujisan%';
  photo_id   |  keyword   | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+------------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | fujisan   +|                         |                         | t
             | mount fuji |                         |                         |
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%pier%';
  photo_id   | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | pier    |        22.6900939941406 |                         | f
(1 row)

unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%night%';
  photo_id   | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | night   |        22.3271160125732 |                         | f

If folks are processing these TSV simplistically without using quote-escaping logic then they may process the files incorrectly. I don't want folks to encounter that. And maybe this points to and upstream data input issue, if users are entering newlines in the keyword input - how are they getting processed in the main app.

We may just want to document that there can be embedded newlines in the TSV files.

Thanks!

Potential Data Cleanup activities

In the unsplash_photos.photo_location_country and unsplash_photos.photo_location_city the values appear to be freeform text that was probably direct user input, with effectively duplicate entries for example,

unsplash_lite=# select '>' || photo_location_city || '<' as city, '>' || photo_location_country || '<' as country,  count(*) from unsplash_photos where lower(photo_location_city) like '%london%' group by 1,2;
 ?column?  |       ?column?       | count
-----------+----------------------+-------
 >LONDON < | >United Kingdom <    |     1
 >London<  | >Canada<             |     7
 >London<  | >Egyesült Királyság< |     1
 >London<  | >England<            |     1
 >London<  | >U.K.<               |     2
 >London<  | >U.K<                |     1
 >London<  | >United Kingdom <    |     1
 >London<  | >United Kingdom<     |    73
 >London<  |                      |     3
(9 rows)```

It looks like there needs to be some data cleaning on these fields, definitely some stripping white space and such. Is it assumed that we should do our own location normalization on this and possibly add in a normalized_photo_location_country and normalized_photo_location_city ?

Also - over in unsplash_conversion.conversion_country this appears to be ISO 2 letter country codes. Is this guaranteed to be a valid ISO 2 letter country code? And was this data created based upon a maxmind geoip lookup or something similar?

Thanks so much for this dataset, I think it is going to be quite useful for demonstrational purposes. I hope these questions help increase the quality of what is a already great dataset.

Add an integrity check

We should provide an integrity check for the Lite and the Full datasets.

We could make the SHA256 hash public.

API for Random Images

I love your API and would like to integrate your commercial images into our product but through your API, do you consider creating an API Endpoint?

It could work the same way as your existing API, just for the usage of those datasets.

Creative Commons Images from Flickr is the same, but I like your API more. 👍

Image cannot be displayed contains error

Describe the bug
Photo with id sEDzxW4NhL4 has errors.
While it can be accessed via photo_url https://unsplash.com/photos/sEDzxW4NhL4 it cannot get accessed
in photo_img_url https://images.unsplash.com/photo-1586019496196-bdbea65add07

To Reproduce
https://images.unsplash.com/photo-1586019496196-bdbea65add07

Expected behavior

Additional context

How was the LITE dataset sampled?

First off, thanks a lot for making the data available - it's a tremendous service to the research community!

@TimmyCarbone, I have a question regarding the relationship between LITE and FULL.
From what I understand, the LITE dataset is a subset of the FULL dataset. How were the 25k images in the first release of the LITE dataset selected? And how did you select the images that were added to replace removed images in subsequent releases?
Thanks!

How large is the full dataset

Can you include the size of the full dataset in your README.md? It gives the size of the lite dataset at ~550Mb. I suppose I could interpolate from there but it would be nice to know what to expect while I wait to see if I'm approved for access. I'm guessing around 44Gb? Thanks.

Question about the entities behind the "anonymous_user_id"

Hey,
I'm trying to calculate the average amount of images a user downloads.

As I know from my own photo stats, a lot of downloads are generated via API requests from external applications. You state in your API doc that external applications don't need to authenticate on a user level.

My question: Is for an external application like Trello one anonymous user id generated or do you guys have a better approach to distinguish users "behind" the external application?

Example from the test dataset
Could the user from the first row (942 downloads) really be one person or also a whole logical entity like Trello?

anonymous_user_id	downloads
5a055748-57d2-45c1-a882-5b9bb9313509	942
beb0923e-c17d-4a90-a8db-47b0f45fb0fc	897
85e5db9c-07c7-49bf-9e08-5cbd1603dd74	546
...	...

Thanks a lot for the answer and great job with the data set. 👍

Are there any similar datasets with records of anonymous user visits?

Is your feature request related to a problem? Please describe.
Unsplash is an awesome dataset with records of anonymous user visits (conversions.tsv), I wonder if there are other organizations that have open-sourced this kind of dataset with anonymous user access? It would be so cool if there was and we can add Related Project to link them.

Describe the solution you'd like
Find and then link similar datasets with records of anonymous user visits to this project.

Describe alternatives you've considered
No add.

Additional context
No add.

Judie Lee Montoya

Apply for unsplash full data.

Hi, Can u boost the application process for this dataset.
Really eager for some experiments in related research.
I found it's both hard for research or personal-use application.
I wonder how to apply for the full dataset ??
My email is [email protected] many thanks.

Include blurhash in the dataset

Overview

Include BlurHash hash in the photos document
@elcuervo built the system to generate and store the hash for each photo

Dataset of all Unsplash contributors

Is your feature request related to a problem? Please describe.
Since the Unsplash Slack server as the only connection point of the community isn't as populated as it could be, I think there could be another way of bringing the people of Unsplash together.

Describe the solution you'd like
With a dataset of all Unsplash contributors it would be possible to create a map, giving them a chance to find other motivated photographers nearby. The dataset should contain the name, the location, the number of photos, the URL and maybe the linked website.

Describe alternatives you've considered
Before, there have been local Slack channels on the server to connect with other people from the same country, but afaik they have been shutdown.

Additional context
Nothing to add

Will I be banned if I download the pictures?

I've applied the full dataset and it was approved. Now I get the 47 GB dataset.
Will I be banned if I download the pictures?
I have a 1Gb/s server and I'd download the pictures in 'photos.tsv000', the url begins in 'images.unsplash.com', I wonder if I could get banned when I download them too quickly? (For a 1Gb server of mine, about 5-6 pictures per second.)

The staff Victor Ballesteros refuses to give me the access to the full dataset

I have requested access to the full dataset. However, here is the reply:

"Thanks for inquiring about Unsplash Full Dataset. I would recommend you to download the Lite Dataset before using the Full one. The Lite Dataset is meant to be open and allow anyone to experiment. If you believe your experiment or research would need the whole 2M+ images, we are happy to give you access to it then.
The Full Dataset is meant for artificial intelligence and machine learning research mostly when the Lite Dataset is not sufficient enough."

After I received this message, I have written multiple emails to further request access to the full dataset. However, I got no response from Victor Ballesteros. I do not understand why access to the full dataset is so difficult to obtain. I doubt whether Unsplash is truly willing to let others use their full dataset.

Can't download the lite data set

Describe the bug
Fails to download the lite data set.

To Reproduce
Click the link in the README:

Expected behavior
I would expect that a zip file would be downloaded. Instead I see this message in Chrome:

Is it expected to have fields where all values are null

When reviewing the data in the lite dataset, all of the following fields are null in all records.

unsplash_photos.ai_primary_landmark_name
unsplash_photos.ai_primary_landmark_latitude
unsplash_photos.ai_primary_landmark_longitude
unsplash_photos.ai_primary_landmark_confidence

If all of these are supposed to be null all the time - It may be useful to drop those columns from the dataset completely.

Although if these columns do have data in the full dataset it makes sense to have them exist. If this is the case, it may be useful to update the documentation to note that these fields are null in the lite dataset and have values in the full dataset.

In any case, just checking to make sure that this is the expected behavior.

Explanation on the ai_service_2_confidence column in keywords.tsv000 (range seems weird)

Describe the bug
Hello ,

I was looking on the data from the lite dataset this morning and I noticed something weird in the column 'ai_service_2_confidence' from the keywords.tsv000 file.

when I applied some stats on the columns about ai_service the column 'ai_service_2_confidence' seems to have extreme value that are exceeding 100 that is for me the expected max (if I take the ai_service_1_confidence as reference for exemple)

To Reproduce

There is the code to build the stats

import pandas as pd
dfp_keywords_raw = pd.read_csv('keywords.tsv000', sep='\t', header=0)
dfp_keywords_raw[['ai_service_1_confidence', 'ai_service_2_confidence']].describe()

Steps to reproduce the behavior:
Having a python environment (3.6.13) with pandas 1.1.5 installed

Expected behavior
I am expecting to have a value in the column 'ai_service_2_confidence' in keywords.tsv000 file between 0 and 100 or if it's not the case having a more precise description of the value for the 'ai_service_2_confidence' in the description (like the range)

Additional context
I have a list of the keywords that seems to be impacted by these extreme values
unsplash_extreme_value.zip

Hope that it will help on your investigation 🕵️‍♀️ (and I hope that is not just me that is missing something)

PS: your dataset is great by the way (really hope to have access to the full version soon)👍

cardinality and type of unsplash_photos.photos_featured

I was checking the cardinality of various columns in the dataset and the unsplash_photos.photos_featured is always true.

unsplash_lite=# select photo_featured, count(*) from unsplash_photos group by 1;
 photo_featured | count
----------------+-------
 t              | 25000

Is this the expected value?

Also - the data type for this column in the create-tables.sql is varchar and I think it should be bool. I did a quick reload of the data checking if it would still be valid with that change, and it would. Happy to submit a pull request for that if you like.

Downloading unsplash dataset

I am trying to download unsplash dataset lite version and the link for the lite version download doesn't give me images. Is image in the download link or do I need to download it by using API?

Publishing on Hugging Face

Hi, Sylvain from the Hugging Face datasets team here.

It would be awesome to have this dataset published on Hugging Face. I discovered it through this blog post: https://huggingface.co/blog/visheratin/nomic-data-cleaning, which relies on a user dataset: https://huggingface.co/datasets/visheratin/unsplash-caption-questions-init.

Having a presence on the HF Hub would make it much easier for ML practitioners to train new models.

You can control the license, terms, and user access (see https://huggingface.co/docs/hub/datasets-gated#gated-datasets).

Elaboration of data fields and additions

[1] photo_featured does this include the featured topic categories at the top?
[1.1] Could this field include which topics it is featured in, but also if the photo was submitted and rejected from a topic?
[1.2] Is there historic data for which images were not included as 'searchable' before the search system was replaced for everything being searchable?

[2] suggested_by_user the description mentions 'a user (human)'. At some point (maybe?) unsplash was adding tags or keywords to its approved/moderated photos, does (could) this distinguish who added it (uploader or staff)?

[4] keyword Is this referring to the search terms used to find the photo?
[4.1] assuming it is the searching keywords, can we add a field for the position it was displayed on the website (i.e. if it was the first row first column, or it was an image that was 30 photos down that a searcher scrolled down to find and pick)

Thanks for releasing the dataset, its a great contribution to the research community!

Publishing on Kaggle

Hello,

Thanks for sharing / publishing this dataset.
Are there any plans for mirroring this dataset on Kaggle? If not, can I publish the Lite Dataset as a public dataset on Kaggle linking it back to Unsplash as the source and using the same licensing terms as here?

It would be great to make this data available on Kaggle where a lot of ML research and models can be built.

Downloading the images

Hello,

I would like to download the images using the provided urls. Do I risk getting blocked for making too many requests/downloads?

Thank you

unsplash / datasets Goto Github PK

datasets's People

Contributors

Stargazers

Watchers

Forkers

datasets's Issues

Overview

Recommend Projects

Recommend Topics

Recommend Org