Coder Social home page Coder Social logo

mmc4's Introduction

📷 📝 Multimodal C4 (mmc4) 📝 📷

An open, billion-scale corpus of images interleaved with text.


Updates

  • released mmc4 version 1.1 🔥 which fixes #11 and #10

Corpus stats (v1.1)

# images # docs # tokens
Multimodal-C4 (mmc4) 571M 101.2M 43B
Multimodal-C4 fewer-faces (mmc4-ff) 375M 77.7M 33B
Multimodal-C4 core (mmc4-core) 29.9M 7.3M 2.4B
Multimodal-C4 core fewer-faces (mmc4-core-ff) 22.4M 5.5M 1.8B

More details about these datasets and our processing steps can be found in our paper. (the current paper results describe v1 of the corpus, we will update to v1.1 soon).

Accessing mmc4-ff

Documents

You can directly download the "fewer faces" multimodal c4 documents at urls like this:

https://storage.googleapis.com/ai2-jackh-mmc4-public/data_v1.1/docs_no_face_shard_{$SHARD}_v2.jsonl.zip

where SHARD can vary from 0 to 23098. 14 shards are missing and are not included in the dataset.

You can download the smaller "core fewer faces" documents at URLs like this:

https://storage.googleapis.com/ai2-jackh-mmc4-public/data_core_v1.1/docs_no_face_shard_{$SHARD}_v3.jsonl.zip

where SHARD can vary from 0 to 23098. The total size of all these files together is approximately 9.4GB.

You can also automatically download & unzip these files from commands, you can run the script by providing the destination folder as an argument, like:

sh download_scripts/fewer_facesv2.sh /path/to/destination/folder

sh download_scripts/fewer_faces_corev3.sh /path/to/destination/folder

Documents in both sets contain text, image URLs, assignments of images to sentences, and image-by-text CLIP ViT-L/14 similarity matrices. Specifically:

  • text_list: a list of sentences comprising the text of the document
  • url: the original url where the document was hosted
  • image_info is a key mapping to a list of images. each image contains:
    • image_name: a filename that you could download the image to
    • face_detections: None if no faces are detected (which should be the case in "fewer faces")
    • matched_text_index: the index within text_list representing the sentence that this image is matched to
    • matched_sim: the CLIP ViT-L/14 similarity between the image and the sentence at the matched index
  • similarity_matrix: a matrix of shape len(image_info) x len(text_list) where similarity_matrix[i, j] is the CLIP ViT-L/14 similarity between image i and sentence j.
  • could_have_url_duplicate: a small number of URLs (~3%) in the corpus may have duplicate entries due to commoncrawl collecting multiple snapshots over time. we downsample such that, in expectation, each URL occurs once, but duplicates are technically possible. You can discard all entries with could_have_url_duplicate equal to 1 if you want a more strictly deduplicated set.

Here's an example:

{'image_info': [{'face_detections': None,
                 'image_name': 'b9040a0dbb22.jpg',
                 'matched_sim': 0.27694183588027954,
                 'matched_text_index': 2,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},
                {'face_detections': None,
                 'image_name': 'db1c21bc8474.jpg',
                 'matched_sim': 0.3234919607639313,
                 'matched_text_index': 1,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],
 'similarity_matrix': [[0.24363446235656738,
                        0.31758785247802734,
                        0.27694183588027954],
                       [0.2233106791973114,
                        0.3234919607639313,
                        0.26118797063827515]],
 'text_list': ['When you lock the door using the lock tab on the driver’s '
               'door, all of the other doors and tailgate lock at the same '
               'time.',
               'Press the master door lock switch in as shown to lock or '
               'unlock all doors and the tailgate.',
               'When you lock/unlock the driver’s door and tailgate using the '
               'master lock switch, all the other doors lock/ unlock at the '
               'same time.'],
 'url': 'http://www.hfitinfo.com/hofi-48.html',
 'could_have_url_duplicate': 0 }

The assignments of images to sentences are computed using compute_assignments.py

Image features

You can directly download CLIP ViT-L/14 features extracted from the images at urls like this:

https://storage.googleapis.com/ai2-jackh-mmc4-public/images/clip_vitl14_shard_{$SHARD}_features.pkl

where SHARD can vary from 0 to 23098. The total size of all the image feature files together is approximately 1.8Tb. Each pkl file is a dictionary that maps from image filename (accessible in the document jsons, see image_name above) to the associated CLIP feature. We used a jax port of CLIP to extract features on TPU. As a result, there may be some numerical differences with CPU or GPU versions of features. We have found that differences are relatively small in practice.

Accessing mmc4

If you are interested in accessing mmc4 (and mmc4-core) without the fewer faces restriction, please fill out this form.

Accessing raw images

We are not releasing raw images for now. But if you are interested in potential updates, you can contact us using this google form.

The missing shards ⛏️💎🔍

.1% of the 23099 shards are missing from the corpus. These were not included in any statistics or experiments, so they are not part of mmc4. The missing shards are:

3218,3267,5064,5146,7119,8991,9750,11899,15127,15252,16996,17369,17499,17818

License

  • the new contributions of mmc4 beyond text-only c4 (e.g., the similarity matrices/image-text alignments) are released under ODC-BY.
  • By using mmc4, be aware of that you are also bound by the Common Crawl terms of use.

Citation

If you found our work useful, please consider citing:

@article{zhu2023multimodal,
  title={{Multimodal C4}: An Open, Billion-scale Corpus of Images Interleaved With Text},
  author={Wanrong Zhu and Jack Hessel and Anas Awadalla and Samir Yitzhak Gadre and Jesse Dodge and Alex Fang and Youngjae Yu and Ludwig Schmidt and William Yang Wang and Yejin Choi},
  journal={arXiv preprint arXiv:2304.06939},
  year={2023}
}

mmc4's People

Contributors

chlience avatar jmhessel avatar luodian avatar sramshetty avatar vegb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mmc4's Issues

some shards cannot be accessed with 404 error

Hi, thanks for your great project.

I am downloading the shard with fewer_faces_core_v3 script, but I found that some shards cannot be accessed. The detailed list is as below:

shard_1277.zip shard_3218.zip shard_3267.zip shard_5064.zip shard_5146.zip shard_7119.zip shard_8991.zip shard_9750.zip shard_11899.zip shard_15127.zip shard_15252.zip shard_16996.zip shard_17369.zip shard_17499.zip shard_17818.zip shard_22953.zip

any solutions on this problem?

add download script for fewer_facev2 and fewer_face_corev3

Hi Jack and Wanrong, thank you for providing this valuable dataset!

I'm currently utilizing the data to experiment with the openflamingo model and have written a download script for the currently released splits (fewer_faces_v2 and fewer_faces_core_v3). This should help other users quickly prepare and access this dataset.

I've created PR #2, and I hope it contributes positively to the project.

Image downloading

Hello, your work is great! Regarding image download, many of the existing urls are invalid. Is there any other way to download images? Looking forward to receiving your kind reply

Some links are Unavaliable. They are:

Any plan to release the data processing code?

Thank you for your great work. I am wondering whether you plan to release the data processing code. With the code released, people can process other datasets based on web pages beyond C4.

Multiple images for an identical matched_text_index

Dear authors,

Thanks for releasing MMC4.

In the paper the following is stated:

"we use [14] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image." ,
"For documents with more images than sentences, after assigning an image to each sentence, we assign according to max similarity.".

However, we found examples where multiple images are aligned to a text span.
For instance, consider the following example in ./docs_shard_10063_v3.jsonl.

{
    "url": "http://easydesigns.biz/easydesigns-wins-3rd-consecutive-best-of-houzz-award/",
    "text_list": [
        "cherry hill, nj, january 19, 2015 \u2013 easydesigns of cherry hill, nj has been awarded \u201cbest of houzz\u201d for customer satisfaction by houzz, the leading platform for home remodeling and design.",
        "the interior design and real estate staging firm, in business since 2005, was chosen by the more than 25 million monthly unique users that comprise the houzz community from among more than 500,000 active home building, remodeling and design industry professionals.",
        "\u201ci am so happy to be selected for the 3rd consecutive year.",
        "customer satisfaction is a primary goal of my firm so i am thrilled to be recognized by such a large and prominent community\u201d, said beth secosky, owner of easydesigns."
    ],
    "image_info": [
        {
            "image_name": "202706b24ac4.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.33591771125793457,
            "face_detections": null
        },
        {
            "image_name": "a97c58871c38.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.31495240330696106,
            "face_detections": null
        },
        {
            "image_name": "ce3c9aa070ce.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 1,
            "matched_sim": 0.2770630717277527,
            "face_detections": null
        },
        {
            "image_name": "c22425c7d977.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.3448386490345001,
            "face_detections": null
        }
    ],
    "similarity_matrix": [
        [
            0.33591771125793457,
            0.2377069592475891,
            0.17204634845256805,
            0.22403109073638916
        ],
        [
            0.31495240330696106,
            0.27460938692092896,
            0.12367681413888931,
            0.17759563028812408
        ],
        [
            0.3045308589935303,
            0.2770630717277527,
            0.15680742263793945,
            0.21054978668689728
        ],
        [
            0.3448386490345001,
            0.26175469160079956,
            0.16365793347358704,
            0.237198144197464
        ]
    ]
}

Is that intended?

Thanks for any pointers.

The data performance is inconsistent with the paper

data

{
  "image_info": [
    {
      "image_name": "250a7a3bc1cd.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-54.jpg",
      "matched_text_index": 0,
      "matched_sim": 0.2656169533729553,
      "face_detections": null
    },
    {
      "image_name": "3bde7c3da946.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-47.jpg",
      "matched_text_index": 0,
      "matched_sim": 0.2512032985687256,
      "face_detections": null
    },
    {
      "image_name": "7c0fd1439a00.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-46.jpg",
      "matched_text_index": 3,
      "matched_sim": 0.2422592043876648,
      "face_detections": null
    },
    {
      "image_name": "639278a7b15f.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-70.jpg",
      "matched_text_index": 0,
      "matched_sim": 0.2992704510688782,
      "face_detections": null
    },
    {
      "image_name": "1495695de8fc.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-72.jpg",
      "matched_text_index": 16,
      "matched_sim": 0.24202091991901398,
      "face_detections": null
    },
    {
      "image_name": "16941538d54c.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-52.jpg",
      "matched_text_index": 7,
      "matched_sim": 0.28069621324539185,
      "face_detections": null
    },
    {
      "image_name": "bc48d6f4bcee.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-34.jpg",
      "matched_text_index": 22,
      "matched_sim": 0.22284036874771118,
      "face_detections": null
    },
    {
      "image_name": "3a33cf5918c3.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-62.jpg",
      "matched_text_index": 20,
      "matched_sim": 0.19130770862102509,
      "face_detections": null
    },
    {
      "image_name": "0d35d2a42797.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-28.jpg",
      "matched_text_index": 2,
      "matched_sim": 0.28103262186050415,
      "face_detections": null
    },
    {
      "image_name": "1500104087f1.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-77.jpg",
      "matched_text_index": 14,
      "matched_sim": 0.20090755820274353,
      "face_detections": null
    },
    {
      "image_name": "08ea1babb72e.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-08.jpg",
      "matched_text_index": 23,
      "matched_sim": 0.22727134823799133,
      "face_detections": null
    },
    {
      "image_name": "442f67572306.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-60.jpg",
      "matched_text_index": 6,
      "matched_sim": 0.27460283041000366,
      "face_detections": null
    },
    {
      "image_name": "48983bca5b90.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-68.jpg",
      "matched_text_index": 0,
      "matched_sim": 0.24618177115917206,
      "face_detections": null
    },
    {
      "image_name": "cc3912b7238a.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-51.jpg",
      "matched_text_index": 17,
      "matched_sim": 0.21379199624061584,
      "face_detections": null
    },
    {
      "image_name": "865b6df71dc0.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-50.jpg",
      "matched_text_index": 0,
      "matched_sim": 0.28164172172546387,
      "face_detections": null
    },
    {
      "image_name": "0b467454f984.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-74.jpg",
      "matched_text_index": 13,
      "matched_sim": 0.22586670517921448,
      "face_detections": null
    },
    {
      "image_name": "9272116c8881.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-64.jpg",
      "matched_text_index": 15,
      "matched_sim": 0.2384014129638672,
      "face_detections": null
    },
    {
      "image_name": "3a18c751be76.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-78.jpg",
      "matched_text_index": 18,
      "matched_sim": 0.25756126642227173,
      "face_detections": null
    },
    {
      "image_name": "9fe79aa02fd9.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-15.jpg",
      "matched_text_index": 11,
      "matched_sim": 0.235867440700531,
      "face_detections": null
    },
    {
      "image_name": "a4bf3cd09567.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-12.jpg",
      "matched_text_index": 1,
      "matched_sim": 0.2627013027667999,
      "face_detections": null
    },
    {
      "image_name": "2cc9a869b702.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-mappa.jpg",
      "matched_text_index": 8,
      "matched_sim": 0.24329496920108795,
      "face_detections": null
    },
    {
      "image_name": "098446e754c0.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-00.jpg",
      "matched_text_index": 4,
      "matched_sim": 0.27757883071899414,
      "face_detections": null
    },
    {
      "image_name": "8a741319bbf6.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-02.jpg",
      "matched_text_index": 5,
      "matched_sim": 0.2743998169898987,
      "face_detections": null
    },
    {
      "image_name": "759887fc2498.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-10.jpg",
      "matched_text_index": 12,
      "matched_sim": 0.22129987180233002,
      "face_detections": null
    },
    {
      "image_name": "2a060f10744a.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-32.jpg",
      "matched_text_index": 6,
      "matched_sim": 0.26719343662261963,
      "face_detections": null
    },
    {
      "image_name": "6ea4501d2191.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-38.jpg",
      "matched_text_index": 21,
      "matched_sim": 0.21889738738536835,
      "face_detections": null
    },
    {
      "image_name": "97f69e5ebe95.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-16.jpg",
      "matched_text_index": 9,
      "matched_sim": 0.2752577066421509,
      "face_detections": null
    },
    {
      "image_name": "cce1016b9143.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-26.jpg",
      "matched_text_index": 10,
      "matched_sim": 0.25289982557296753,
      "face_detections": null
    },
    {
      "image_name": "56f86d49ece7.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-18.jpg",
      "matched_text_index": 19,
      "matched_sim": 0.264477014541626,
      "face_detections": null
    },
    {
      "image_name": "6da38d428f9a.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-40.jpg",
      "matched_text_index": 10,
      "matched_sim": 0.2669289708137512,
      "face_detections": null
    },
    {
      "image_name": "ac795f0acd6f.jpg",
      "raw_url": "http://www.florenceholidays.com/image/463-48.jpg",
      "matched_text_index": 23,
      "matched_sim": 0.22040322422981262,
      "face_detections": null
    }
  ],
  "raw_filename": "s3://mllm-raw/mllm-raw-mmc4/text/2023-05-10/docs_shard_9828_v2.jsonl",
  "similarity_matrix": [
    [
      0.2656169533729553,
      0.18537147343158722,
      0.13874414563179016,
      0.16725146770477295,
      0.16030831634998322,
      0.2211000621318817,
      0.24686755239963531,
      0.23777732253074646,
      0.1747286319732666,
      0.21648432314395905,
      0.2049683928489685,
      0.18380270898342133,
      0.14410234987735748,
      0.18983696401119232,
      0.1958218216896057,
      0.19164063036441803,
      0.1814287155866623,
      0.17133653163909912,
      0.16795969009399414,
      0.21247006952762604,
      0.17101499438285828,
      0.174088716506958,
      0.2158135026693344,
      0.16981327533721924
    ],
    [
      0.2512032985687256,
      0.24227669835090637,
      0.21547411382198334,
      0.23384088277816772,
      0.23772752285003662,
      0.23169569671154022,
      0.2269171178340912,
      0.24486318230628967,
      0.21773478388786316,
      0.23209112882614136,
      0.21426191926002502,
      0.1548122763633728,
      0.15852200984954834,
      0.15297038853168488,
      0.1333686113357544,
      0.15327875316143036,
      0.14285880327224731,
      0.19914713501930237,
      0.16902655363082886,
      0.18521054089069366,
      0.12569870054721832,
      0.1995048224925995,
      0.20779436826705933,
      0.22257161140441895
    ],
    [
      0.26278921961784363,
      0.2423395961523056,
      0.22254084050655365,
      0.2422592043876648,
      0.24145008623600006,
      0.23121456801891327,
      0.22555232048034668,
      0.2584843635559082,
      0.20904693007469177,
      0.22809600830078125,
      0.2177431285381317,
      0.1624596118927002,
      0.16100747883319855,
      0.1573682427406311,
      0.14590132236480713,
      0.1517740786075592,
      0.14861249923706055,
      0.19251181185245514,
      0.17229287326335907,
      0.19099831581115723,
      0.12592697143554688,
      0.18233712017536163,
      0.1934153437614441,
      0.22324585914611816
    ],
    [
      0.2992704510688782,
      0.1963413506746292,
      0.1264077126979828,
      0.21164464950561523,
      0.20467646420001984,
      0.24412426352500916,
      0.2551572918891907,
      0.26019302010536194,
      0.19189336895942688,
      0.2272934913635254,
      0.22603432834148407,
      0.18582221865653992,
      0.12029045820236206,
      0.2203194797039032,
      0.1684253215789795,
      0.22262603044509888,
      0.21966834366321564,
      0.19113287329673767,
      0.2180924266576767,
      0.22134266793727875,
      0.17851127684116364,
      0.2030227929353714,
      0.22269167006015778,
      0.18410852551460266
    ],
    [
      0.2781187891960144,
      0.18948838114738464,
      0.14410817623138428,
      0.19522805511951447,
      0.17081187665462494,
      0.2175617218017578,
      0.2780090868473053,
      0.27391675114631653,
      0.2016804814338684,
      0.21859076619148254,
      0.23271512985229492,
      0.18943879008293152,
      0.12236252427101135,
      0.22485750913619995,
      0.1821601539850235,
      0.23592393100261688,
      0.24202091991901398,
      0.1769600212574005,
      0.2064613699913025,
      0.24329333007335663,
      0.17481355369091034,
      0.18273276090621948,
      0.2169996201992035,
      0.18116874992847443
    ],
    [
      0.2922931909561157,
      0.2074548900127411,
      0.16617578268051147,
      0.20417538285255432,
      0.19207656383514404,
      0.26095759868621826,
      0.2728133201599121,
      0.28069621324539185,
      0.1881171017885208,
      0.21880611777305603,
      0.22960172593593597,
      0.18206548690795898,
      0.1555396467447281,
      0.20267625153064728,
      0.18538689613342285,
      0.19657163321971893,
      0.1882498562335968,
      0.19706885516643524,
      0.17470233142375946,
      0.2230358123779297,
      0.16874438524246216,
      0.21609702706336975,
      0.2287738025188446,
      0.1786457598209381
    ],
    [
      0.27176713943481445,
      0.2128254771232605,
      0.20758211612701416,
      0.18684495985507965,
      0.16163602471351624,
      0.19102495908737183,
      0.25558799505233765,
      0.2264847457408905,
      0.21248562633991241,
      0.2310383915901184,
      0.2183862328529358,
      0.16110670566558838,
      0.17942406237125397,
      0.15998724102973938,
      0.15694956481456757,
      0.16826103627681732,
      0.1395907700061798,
      0.12549827992916107,
      0.17220377922058105,
      0.20038007199764252,
      0.16021588444709778,
      0.20366857945919037,
      0.22284036874771118,
      0.18692228198051453
    ],
    [
      0.2751815915107727,
      0.1863328218460083,
      0.11373403668403625,
      0.17706799507141113,
      0.17377373576164246,
      0.22297735512256622,
      0.2362765669822693,
      0.24405865371227264,
      0.16753968596458435,
      0.2192014455795288,
      0.22819554805755615,
      0.2314252257347107,
      0.13049623370170593,
      0.22346243262290955,
      0.19437643885612488,
      0.19594870507717133,
      0.18355610966682434,
      0.18184378743171692,
      0.16997304558753967,
      0.22733725607395172,
      0.19130770862102509,
      0.18877500295639038,
      0.2150861918926239,
      0.184693843126297
    ],
    [
      0.30005452036857605,
      0.2444041520357132,
      0.28103262186050415,
      0.2104332149028778,
      0.1963677704334259,
      0.2368326038122177,
      0.29620829224586487,
      0.28124570846557617,
      0.23367264866828918,
      0.25121891498565674,
      0.2637023329734802,
      0.17868557572364807,
      0.20320042967796326,
      0.17928679287433624,
      0.17917431890964508,
      0.17291668057441711,
      0.1539233922958374,
      0.14969635009765625,
      0.17179861664772034,
      0.23650386929512024,
      0.15658210217952728,
      0.22286033630371094,
      0.23230329155921936,
      0.21675530076026917
    ],
    [
      0.25243330001831055,
      0.18594501912593842,
      0.13128787279129028,
      0.16527792811393738,
      0.15479980409145355,
      0.19544941186904907,
      0.20602142810821533,
      0.20982371270656586,
      0.17409390211105347,
      0.21797922253608704,
      0.17991343140602112,
      0.21105024218559265,
      0.16548922657966614,
      0.21043086051940918,
      0.20090755820274353,
      0.19794145226478577,
      0.18184930086135864,
      0.19286535680294037,
      0.2427641749382019,
      0.21400651335716248,
      0.17760156095027924,
      0.17696458101272583,
      0.19731730222702026,
      0.21179193258285522
    ],
    [
      0.26753920316696167,
      0.20994319021701813,
      0.19886861741542816,
      0.19198423624038696,
      0.2126874327659607,
      0.2491675615310669,
      0.23521187901496887,
      0.21491405367851257,
      0.19633451104164124,
      0.2507404685020447,
      0.21660488843917847,
      0.21398556232452393,
      0.2036958932876587,
      0.1745581477880478,
      0.14521823823451996,
      0.18938466906547546,
      0.1521453857421875,
      0.1894209086894989,
      0.17959250509738922,
      0.19471874833106995,
      0.1728854477405548,
      0.2092742621898651,
      0.21670138835906982,
      0.22727134823799133
    ],
    [
      0.28067055344581604,
      0.22548863291740417,
      0.18247827887535095,
      0.21820944547653198,
      0.21359281241893768,
      0.23284107446670532,
      0.27460283041000366,
      0.2742912471294403,
      0.22623062133789062,
      0.180877685546875,
      0.1955769658088684,
      0.15270674228668213,
      0.14200682938098907,
      0.17641660571098328,
      0.14438723027706146,
      0.1489056795835495,
      0.15617594122886658,
      0.16448575258255005,
      0.16287319362163544,
      0.19025346636772156,
      0.15318799018859863,
      0.18822550773620605,
      0.18225428462028503,
      0.17615069448947906
    ],
    [
      0.24618177115917206,
      0.16895894706249237,
      0.1038752868771553,
      0.17481262981891632,
      0.1720360517501831,
      0.20523856580257416,
      0.21258386969566345,
      0.19601623713970184,
      0.15526573359966278,
      0.19139109551906586,
      0.17778784036636353,
      0.18016183376312256,
      0.1276606321334839,
      0.20190200209617615,
      0.16328783333301544,
      0.21840637922286987,
      0.20794200897216797,
      0.18328502774238586,
      0.20778480172157288,
      0.1684459149837494,
      0.18211877346038818,
      0.1768956184387207,
      0.18382760882377625,
      0.15014015138149261
    ],
    [
      0.2970350980758667,
      0.23741406202316284,
      0.18147963285446167,
      0.20299431681632996,
      0.2078612595796585,
      0.2552454173564911,
      0.2507679760456085,
      0.27146780490875244,
      0.18869897723197937,
      0.23548991978168488,
      0.21190917491912842,
      0.20612069964408875,
      0.1659134477376938,
      0.19959141314029694,
      0.18094795942306519,
      0.20242686569690704,
      0.1843566596508026,
      0.21379199624061584,
      0.18157589435577393,
      0.205306276679039,
      0.17089620232582092,
      0.2159019112586975,
      0.2302195429801941,
      0.19268539547920227
    ],
    [
      0.28164172172546387,
      0.19847586750984192,
      0.12840622663497925,
      0.19365528225898743,
      0.17914311587810516,
      0.2512874901294708,
      0.2648576498031616,
      0.2690545320510864,
      0.17980141937732697,
      0.21610325574874878,
      0.22502651810646057,
      0.19437208771705627,
      0.15698686242103577,
      0.18768680095672607,
      0.1892012059688568,
      0.19485749304294586,
      0.17957207560539246,
      0.18366765975952148,
      0.17459532618522644,
      0.22162693738937378,
      0.16398005187511444,
      0.20830923318862915,
      0.20397639274597168,
      0.16697406768798828
    ],
    [
      0.25912606716156006,
      0.17477290332317352,
      0.1248440146446228,
      0.1773035228252411,
      0.1605766862630844,
      0.20320840179920197,
      0.2315133661031723,
      0.2092740386724472,
      0.16888362169265747,
      0.2160763144493103,
      0.19049742817878723,
      0.19386795163154602,
      0.11962063610553741,
      0.22586670517921448,
      0.16560178995132446,
      0.23559865355491638,
      0.22271788120269775,
      0.1765291690826416,
      0.2128770649433136,
      0.22049763798713684,
      0.1834126114845276,
      0.2062053233385086,
      0.22095519304275513,
      0.1788317859172821
    ],
    [
      0.27939218282699585,
      0.1902030110359192,
      0.14197932183742523,
      0.19378405809402466,
      0.1654062569141388,
      0.20786695182323456,
      0.2570139169692993,
      0.24496804177761078,
      0.18677487969398499,
      0.2267095148563385,
      0.21814917027950287,
      0.20866522192955017,
      0.12455444037914276,
      0.21700063347816467,
      0.17640623450279236,
      0.2384014129638672,
      0.22705960273742676,
      0.18717768788337708,
      0.21774734556674957,
      0.23203477263450623,
      0.16865795850753784,
      0.18535155057907104,
      0.229965478181839,
      0.17057979106903076
    ],
    [
      0.2578897774219513,
      0.18187379837036133,
      0.157437264919281,
      0.16104531288146973,
      0.16988921165466309,
      0.1856439709663391,
      0.17817062139511108,
      0.18068116903305054,
      0.16490276157855988,
      0.2239041030406952,
      0.15650111436843872,
      0.2171623706817627,
      0.1796576976776123,
      0.19457583129405975,
      0.1883215308189392,
      0.21573877334594727,
      0.1794152557849884,
      0.20577472448349,
      0.25756126642227173,
      0.17432603240013123,
      0.20140011608600616,
      0.20026803016662598,
      0.21790653467178345,
      0.1797640323638916
    ],
    [
      0.3012367784976959,
      0.23252299427986145,
      0.17309220135211945,
      0.19829556345939636,
      0.1851770132780075,
      0.23905816674232483,
      0.24880632758140564,
      0.2601985037326813,
      0.22945672273635864,
      0.2452102154493332,
      0.24476373195648193,
      0.235867440700531,
      0.22197756171226501,
      0.17317409813404083,
      0.16594859957695007,
      0.1961573362350464,
      0.16680696606636047,
      0.15003995597362518,
      0.17831633985042572,
      0.23233453929424286,
      0.1492016315460205,
      0.2028636336326599,
      0.22148379683494568,
      0.2380666732788086
    ],
    [
      0.2567637860774994,
      0.2627013027667999,
      0.2415195107460022,
      0.21972522139549255,
      0.229500412940979,
      0.2606694996356964,
      0.2393096387386322,
      0.21511316299438477,
      0.20434452593326569,
      0.25683489441871643,
      0.21414227783679962,
      0.1979813277721405,
      0.21659377217292786,
      0.16809318959712982,
      0.14968480169773102,
      0.18970109522342682,
      0.16160917282104492,
      0.20579871535301208,
      0.17590181529521942,
      0.18161697685718536,
      0.16774091124534607,
      0.21376146376132965,
      0.22183777391910553,
      0.2363283783197403
    ],
    [
      0.20440340042114258,
      0.25654783844947815,
      0.1374446302652359,
      0.19396153092384338,
      0.1962192803621292,
      0.16107912361621857,
      0.17641037702560425,
      0.19867126643657684,
      0.24329496920108795,
      0.18130502104759216,
      0.1520768702030182,
      0.11049921810626984,
      0.1293787956237793,
      0.11953796446323395,
      0.0928579568862915,
      0.10114848613739014,
      0.09467273950576782,
      0.09945888072252274,
      0.10200296342372894,
      0.14478428661823273,
      0.1189497783780098,
      0.17200374603271484,
      0.1936771273612976,
      0.1985057294368744
    ],
    [
      0.2432769536972046,
      0.25157418847084045,
      0.16876569390296936,
      0.26960697770118713,
      0.27757883071899414,
      0.26767584681510925,
      0.228751540184021,
      0.25074413418769836,
      0.2403990924358368,
      0.1973457932472229,
      0.1811830848455429,
      0.12259773910045624,
      0.13429616391658783,
      0.15487930178642273,
      0.12713032960891724,
      0.1505340039730072,
      0.13716983795166016,
      0.1891537308692932,
      0.1412196308374405,
      0.1553836464881897,
      0.13241831958293915,
      0.19398462772369385,
      0.1865234076976776,
      0.2098095715045929
    ],
    [
      0.26000961661338806,
      0.22490854561328888,
      0.1864921748638153,
      0.21654561161994934,
      0.20666083693504333,
      0.2743998169898987,
      0.23641645908355713,
      0.22996115684509277,
      0.20635807514190674,
      0.23322924971580505,
      0.21897339820861816,
      0.20736373960971832,
      0.17582951486110687,
      0.20755189657211304,
      0.17359842360019684,
      0.21167927980422974,
      0.17193402349948883,
      0.2192540466785431,
      0.19589409232139587,
      0.2064056247472763,
      0.18129108846187592,
      0.20824459195137024,
      0.22145605087280273,
      0.22487695515155792
    ],
    [
      0.2623461186885834,
      0.23889517784118652,
      0.2427438497543335,
      0.2183896005153656,
      0.21212175488471985,
      0.2549915611743927,
      0.2736354470252991,
      0.25609055161476135,
      0.20930369198322296,
      0.23947212100028992,
      0.265799880027771,
      0.1920579969882965,
      0.22129987180233002,
      0.17116500437259674,
      0.17505031824111938,
      0.17993392050266266,
      0.16749173402786255,
      0.16981178522109985,
      0.1628037393093109,
      0.22596648335456848,
      0.13856804370880127,
      0.2092561274766922,
      0.2158513367176056,
      0.21962465345859528
    ],
    [
      0.23094117641448975,
      0.22668606042861938,
      0.17656467854976654,
      0.17847001552581787,
      0.16526298224925995,
      0.23309478163719177,
      0.26719343662261963,
      0.23345161974430084,
      0.21316848695278168,
      0.22282817959785461,
      0.24405664205551147,
      0.1533631533384323,
      0.20008155703544617,
      0.15645970404148102,
      0.15092146396636963,
      0.15335902571678162,
      0.13733996450901031,
      0.13660195469856262,
      0.1559619903564453,
      0.207774356007576,
      0.13135121762752533,
      0.1833842545747757,
      0.21023869514465332,
      0.20720401406288147
    ],
    [
      0.2592028081417084,
      0.25664880871772766,
      0.2360069453716278,
      0.21612191200256348,
      0.20885628461837769,
      0.2487727850675583,
      0.26819995045661926,
      0.2429094761610031,
      0.21862539649009705,
      0.24332360923290253,
      0.2555370032787323,
      0.1892305463552475,
      0.23086266219615936,
      0.16233950853347778,
      0.16140367090702057,
      0.1611015349626541,
      0.152574360370636,
      0.16100963950157166,
      0.15578940510749817,
      0.21708610653877258,
      0.13967522978782654,
      0.21889738738536835,
      0.22125279903411865,
      0.22893789410591125
    ],
    [
      0.27269139885902405,
      0.22553175687789917,
      0.19097526371479034,
      0.16638636589050293,
      0.1558951586484909,
      0.2096937596797943,
      0.2685315012931824,
      0.2543770670890808,
      0.21055540442466736,
      0.2752577066421509,
      0.26798948645591736,
      0.22909453511238098,
      0.20012491941452026,
      0.1728508174419403,
      0.16888268291950226,
      0.17895495891571045,
      0.16437724232673645,
      0.1287786066532135,
      0.16454419493675232,
      0.25635433197021484,
      0.156284898519516,
      0.20373453199863434,
      0.2320765256881714,
      0.24099911749362946
    ],
    [
      0.2399776577949524,
      0.19774340093135834,
      0.12040511518716812,
      0.17430981993675232,
      0.14952245354652405,
      0.19929563999176025,
      0.224985733628273,
      0.22176413238048553,
      0.1922917366027832,
      0.20409567654132843,
      0.25289982557296753,
      0.19200778007507324,
      0.16204297542572021,
      0.1441410481929779,
      0.15837222337722778,
      0.14166474342346191,
      0.14306770265102386,
      0.09706231206655502,
      0.13410936295986176,
      0.20554989576339722,
      0.1296057403087616,
      0.17983072996139526,
      0.21856139600276947,
      0.18663389980793
    ],
    [
      0.26796963810920715,
      0.22898869216442108,
      0.1669774353504181,
      0.17010965943336487,
      0.14586003124713898,
      0.20535022020339966,
      0.2784668207168579,
      0.2654045820236206,
      0.2083790898323059,
      0.2497047334909439,
      0.2815644145011902,
      0.20456752181053162,
      0.20314502716064453,
      0.169352725148201,
      0.17502890527248383,
      0.16250362992286682,
      0.15693418681621552,
      0.10033249855041504,
      0.15738829970359802,
      0.264477014541626,
      0.15369150042533875,
      0.20302604138851166,
      0.2226494699716568,
      0.22143156826496124
    ],
    [
      0.22886651754379272,
      0.213748961687088,
      0.1932818442583084,
      0.1846708357334137,
      0.1813632845878601,
      0.23859666287899017,
      0.2645314931869507,
      0.23957371711730957,
      0.17174378037452698,
      0.23277658224105835,
      0.2669289708137512,
      0.18671949207782745,
      0.19439442455768585,
      0.17349472641944885,
      0.16484905779361725,
      0.1774735450744629,
      0.1660252958536148,
      0.15749013423919678,
      0.16001388430595398,
      0.23672233521938324,
      0.1400991678237915,
      0.19983777403831482,
      0.2172851860523224,
      0.21426326036453247
    ],
    [
      0.1923743486404419,
      0.19373348355293274,
      0.17283162474632263,
      0.18948030471801758,
      0.14847639203071594,
      0.16681919991970062,
      0.17672811448574066,
      0.19356709718704224,
      0.16680869460105896,
      0.20056791603565216,
      0.1732291877269745,
      0.13736668229103088,
      0.10220174491405487,
      0.1255796253681183,
      0.1126173585653305,
      0.12471763044595718,
      0.12397050112485886,
      0.17429769039154053,
      0.15778496861457825,
      0.17825056612491608,
      0.10998402535915375,
      0.14916373789310455,
      0.17331171035766602,
      0.22040322422981262
    ]
  ],
  "text_list": [
    "The luxurious villa is located in Valdichiana, in the countryside south-east of Siena.",
    "In the 7th century BC the area of the villa was inhabited by the Etruscans.",
    "A part of the park that surrounds the building is indeed decorated with authentic Etruscan cinerary urns.",
    "In the 13th century in the area was built a castle that, after a long struggle with Siena, passed under the control of the Republic of Orvieto.",
    "After having been conquered by Perugia, in the 15th century the castle passed again under the rule of Siena.",
    "Annexed to the Grand Duchy of Tuscany in the second half of the 16th century, the castle was transformed into the present villa.",
    "The gentle hills in southern Tuscany covered with woodland and cultivated field, the century-old trees and the flowers decorating the luxuriant park that surrounds the building: that is the panorama the windows of the historic villa open onto.",
    "The villa situated in the countryside south of Siena is an excellent departure point for visiting Montalcino, Pienza, Montepulciano, the abbey of Monte Oliveto Maggiore, Siena and other important cities of art in central Italy.",
    "Lake Trasimeno is easily reached by car for pleasant one-day trips.",
    "In the vicinity of the villa are also some thermal baths and a tennis club.",
    "The villa is surrounded by a wide fully-fenced park in which are an ancient amphitheatre, the private swimming pool (12x4 m; depth: 1.40 m), two gazebos, a table for guests to have meals al fresco and the barbecue equipment.",
    "An annex next to the swimming pool is divided into kitchen, dining room, sauna and a half bathroom.",
    "In the Italian garden is also a pond.",
    "The five-storey building accommodates 14 guests in 2 twin bedrooms and 5 double bedrooms, and has 7 bathrooms and 1 half bathroom.",
    "The ground floor composes of kitchen, pantry room, dining room with direct access to the park, wide lounge with fireplace and a half bathroom.",
    "On the first floor are a bedroom with two single beds that can be united to form a double bed and three double bedrooms.",
    "The second floor consists of a bedroom with two single beds that can be united to form a double bed and two double bedrooms.",
    "The third and fourth floors have maintained the original medieval style.",
    "All the bedrooms have an ensuite bathroom.",
    "The villa comes with swimming pool, barbecue, sauna, working fireplace, air conditioning, satellite TV, DVD player, Internet connection, microwave oven, electric oven, dishwasher, washing machine, cot bed and private car parking space.",
    "A cook, a baby sitter and breakfast and maid services are available upon request.",
    "A shuttle service from the parking lot in centre of the town to the villa is available for free upon request.",
    "Shops and services of any kind are 500 metres from the villa.",
    "Some thermal baths are within 15 kilometres of the property."
  ],
  "url": "http://www.florenceholidays.com/tuscany-luxury-villas-luxury-villa-siena-province-sarteano.html"
}

The paper

image

my problem

There are 5 different images bind to the same sentence with index 0 in the data example above.

And There are 12927533 data items with the same issue.

my solution

from scipy.optimize import linear_sum_assignment

row_ind, col_ind = linear_sum_assignment(raw_cost_matrix)
total_cost = raw_cost_matrix[row_ind, col_ind].sum()
result = list(zip(row_ind, col_ind))

ouput format : (image_index, sentence_index)

[(2, 3),
 (3, 0),
 (4, 16),
 (5, 7),
 (6, 22),
 (7, 20),
 (8, 2),
 (9, 14),
 (10, 23),
 (11, 6),
 (13, 17),
 (15, 13),
 (16, 15),
 (17, 18),
 (18, 11),
 (19, 1),
 (20, 8),
 (21, 4),
 (22, 5),
 (23, 12),
 (25, 21),
 (26, 9),
 (28, 19),
 (29, 10)]

CLIP ViT-L/14 weights

Hi, I was wondering which weights you used for computing the image features? So far, I tried with the HF ones and the openAI ones but the feature vectors I get for the images are significantly different from the precomputed ones you shared. I can share some minimal code of what I tried if it helps. Thanks!

Dataset available on Huggingface?

Hi guys! love the dataset, i want to use this dataset on for some training I am going to do and I want to use huggingface datasets. I can do it and make it public as long as you are cool with it. Let me know if you have any issues with it.

Missing or broken images (due to stale URLs)

For raw images, the main readme has a raw image interest list. For copyright/legal reasons, I can't directly distribute images. Can you provide some statistics about what percentage of missing images you're finding? If a very high number are missing, I can do more thinking about potential solutions.

Originally posted by @jmhessel in #10 (comment)

To answer your question:
I checked how many docs/samples there are in the originally published jsonl files, vs. how many intact docs we were able to extract.
For the full dataset (incl. faces), the percentage of missing samples is 16.8%, so quite high.
I counted the unique URLs in the original data and the unique URLs in our dataset which has been filtered for

  • Samples that are missing one or more images (could not be downloaded)
  • Samples that contain images that cannot be loaded, or decoded, or are missing part of the image

Duplicates and multiple versions of samples

Dear authors,
while processing the MMC4 dataset, we found some anomalies and we hope you can comment on or explain these.

Our Expectations

  • There is one full large dataset (mmc4) that includes samples with face detections and there are several subsets of that large dataset that have been filtered:
    • One subset that contains only the samples without face detections (mmc4-ff) (public)
    • One subset that contains only the "core" i.e. samples with strict filtering (mmc4-core)
    • One subset that contains only the intersection of all these (mmc4-core-ff) (public)
  • We assume that those are true subsets, e.g. every sample in mmc4-core-ff would also be contained in mmc4-ff etc.
  • We assume that within each of the subsets, every sample is unique
    • Means each web page on the internet resulted in at most one sample
    • Of course different web pages under the same domain could result in multiple samples

Our Findings

We found that

  • each of the subsets seems to contain many exact duplicate samples up to a rate of 1-2% of all samples
  • some samples occur multiple times in different subsets but slightly changed, for example with more images or with different similarity measures
  • some subsets don't seem to be true subsets but instead contain samples that are not part of the corresponding larger set or the larger set contains a variant of those

Exact Duplicates

At first, we matched samples by the MD5 hash of the JSON string to find exact duplicates.

For example for mmc4-core-ff, we found 5598117 total samples (i.e. json lines) among all shards, but only 5506430 unique samples.
This means that 1.6% within that subset are exact duplicates.

Other Duplicates

If we match just by the document URL string, the duplicate rate is higher, in the case of mmc4-core-ff we then obtain only 5492699 unique samples, so 1.9% are duplicates.
Interestingly, the duplicates appear not just twice but up to 88 times each.

Here are the top ten duplicate URLs with the number of appearances:

('https://site.clubrunner.ca/page/clubrunner-mobile-app-now-available', 88),
('https://www.amazon.com.au/All-New-Kindle-With-Front-Light-Black/dp/B07FQ4DJ83', 59),
('https://www.plentygram.com/blog/how-to-make-your-instagram-account-famous/', 46),
('http://www.fuelly.com/', 41),
('https://www.bhhsnv.com/', 39),
('https://www.kikocosmetics.com/en-us/', 34),
('http://www.manchesteruniversitypress.co.uk/articles/freedom-and-the-fifth-commandment-qa-with-brian-heffernan/', 31),
('http://www.manchesteruniversitypress.co.uk/articles/mup-advent-calendar-starts-thursday/', 31),
('https://emeraldcoastbyowner.com/', 29),
('https://www.ait.com/web-development/?typhon', 29)

We took a closer look at the first sample with 88 duplicates and found that 87 of those are exact duplicates but 1 is slightly different.
For that 1 sample, the image similarities and the similarity matrix are different altough the text and images match with those of the other 87 samples.

Faces vs. No Faces

We assumed that fewer faces dataset is simply a filtered version of the sets with faces.
We filtered the set with faces ourselves, keeping only the samples that have face_detections: None.
However, this does not result in the same set as the published fewer faces set.
This effect is related to the similar but slightly different samples mentioned above.
One example is this:
Compare mmc4_core_faces/docs_shard_4943_v3.jsonl.zip sample 113 with mmc4_full_faces/docs_shard_4943_v2.jsonl.zip sample 1523.
Both have the same URL and the core set should be a subset of the full set. However, the second sample contains an additional image with face detections, while all other images contain no face detections.

image

Questions

  • How were the 4 sets constructed by the authors?
  • Are our assumptions/expectations correct?
  • If there are multiple different versions of a sample (e.g. one with more images) which one is the correct one?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.