Coder Social home page Coder Social logo

Comments (14)

rstojnic avatar rstojnic commented on August 17, 2024

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

Dear @rstojnic ,

I hope this message finds you well. After reading your comment, I wanted to reach out and clarify that the activity you've observed might potentially be related to my research efforts (but this is not 100% sure). I've been collecting dataset information for research on dataset evolution, which involves gathering data from various sources, including your platform. Here is my code:

from paperswithcode import PapersWithCodeClient

client = PapersWithCodeClient(token=XXXX)

page = 1
scrape = True
dataset_full = {}

while scrape:
    try:
        dataset_page = client.dataset_list(page=page)
        for dataset in dataset_page.results:
            dataset_full[dataset.id] = {
                'name': dataset.name,
                'url': dataset.url,
            }
    except:
        scrape = False
    page += 1

Please note that my intentions are purely academic, and I sincerely apologize for any unintended strain my actions may have placed on your website. I can assure you that I am not engaged in any malicious activity, such as DDOS-ing.

Would there be a more appropriate method for me to collect this dataset information for research purposes without causing any issues to your platform? Your guidance and support in this matter would be greatly appreciated.

Thank you for your understanding, and I look forward to hearing from you.

Best regards,
Jimmy

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

We temporarily disabled dataset browsing because someone was DDOS-ing the website using a bot. It looks like they are running a broken bot that's trying all kinds of nonsensical dataset filters, which is why we've disabled them for now. Should be back shortly after we fully identify and block them.

Dear @rstojnic ,

I hope this message finds you well. After reading your comment, I wanted to reach out and clarify that the activity you've observed might potentially be related to my research efforts. I've been collecting dataset information for research on dataset evolution, which involves gathering data from various sources, including your platform. Here is my code:

from paperswithcode import PapersWithCodeClient

client = PapersWithCodeClient(token=XXXX)

page = 1
scrape = True
dataset_full = {}

while scrape:
    try:
        dataset_page = client.dataset_list(page=page)
        for dataset in dataset_page.results:
            dataset_full[dataset.id] = {
                'name': dataset.name,
                'url': dataset.url,
            }
    except:
        scrape = False
    page += 1
    
with open(f'{path_meta}/dataset_full.pkl', 'wb') as f:
    pickle.dump(dataset_full, f) 

Please note that my intentions are purely academic, and I sincerely apologize for any unintended strain my actions may have placed on your website. I can assure you that I am not engaged in any malicious activity, such as DDOS-ing.

Would there be a more appropriate method for me to collect this dataset information for research purposes without causing any issues to your platform? Your guidance and support in this matter would be greatly appreciated.

Thank you for your understanding, and I look forward to hearing from you.

Best regards, Jimmy

I indeed wrote up an email clarifying this a few days ago, but there is no reply yet so I just collect them using this API for a chance.

from paperswithcode-client.

rstojnic avatar rstojnic commented on August 17, 2024

Hi @zhimin-z there is no need to scrape the website, all the data is available on: https://github.com/paperswithcode/paperswithcode-data

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

Hi @zhimin-z there is no need to scrape the website, all the data is available on: https://github.com/paperswithcode/paperswithcode-data

Thank you for sharing the GitHub repository containing the Papers with Code data. I truly appreciate your guidance and support in accessing the dataset information for my research on the evolution of PWC datasets.

While exploring the repository, I noticed that it was last updated three years ago. I am particularly interested in the latest dataset information, as my research primarily focuses on the evolution and current state of PWC datasets. Access to the most up-to-date data is crucial for the accuracy and relevance of my work.

Would you be able to confirm if the dataset information in the GitHub repository is the most recent available, or is there another source I should refer to for the latest data? Your assistance in this matter is invaluable to the success of my research.

Thank you once again for your help, and I look forward to your response.

Best regards,
Jimmy

from paperswithcode-client.

rstojnic avatar rstojnic commented on August 17, 2024

The repo itself is old because it's just a README. The links point back to our S3 bucket that should be updated every day.

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

The repo itself is old because it's just a README. The links point back to our S3 bucket that should be updated every day.

Hmm.. I found some datasets are not available in the downloadable json files. For example, HELM and HEIM are not in the Datasets. That is the reason why I thought these files might be obsolete initially. I just wonder what are the criteria for your generating the Datasets and other Evaluation tables files?

from paperswithcode-client.

rstojnic avatar rstojnic commented on August 17, 2024

They should all be there. If they are not, the export might be stuck. @alefnula @andrewkuanop

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

They should all be there. If they are not, the export might be stuck. @alefnula @andrewkuanop

Also, the community evaluation tables, such as https://paperswithcode.com/sota/text-classification-on-glue and https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum are not available in the Evaluation tables as well. Is that possible to download the community evaluation tables using any source? Or any suggested time interval if I want to scrape on my own?

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

There seem to be a lot of evaluation tables missing in the Evaluation tables from the website.
Here is Evaluation tables gives (9238 datasets in total):
image
Here is what I collected (within 100 displayable pages of the PWC datasets, 4800 datasets in total):
image

Overall, at least ten thousand level records are missing from your online archive, and this does not even take into account the evaluations from the datasets beyond 100 pages from the PWC website. @rstojnic @alefnula @andrewkuanop

from paperswithcode-client.

andrewkuanop avatar andrewkuanop commented on August 17, 2024

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum

Thanks for your reply, @andrewkuanop

For evaluation tables, I found https://paperswithcode.com/sota/text-classification-on-glue is available in the Evaluation tables, but https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum is still not.

For datasets, I found both HELM and HEIM are not in the Datasets.

I think the issue still persists...

from paperswithcode-client.

andrewkuanop avatar andrewkuanop commented on August 17, 2024

from paperswithcode-client.

zhimin-z avatar zhimin-z commented on August 17, 2024

https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum

Thanks, @andrewkuanop

After checking, I found the dataset issue is solved. Both HELM and HEIM are in the Datasets now.

However, I found https://paperswithcode.com/sota/text-classification-on-glue is available in the Evaluation tables, but https://paperswithcode.com/sota/abstractive-dialogue-summarization-on-samsum is still not.

I think the issue still persists for evaluation tables.

from paperswithcode-client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.