continuumio / anaconda-package-data Goto Github PK

View Code? Open in Web Editor NEW

98.0 98.0 36.0 713 KB

Conda package download data

License: Creative Commons Attribution 4.0 International

Shell 0.63% Jupyter Notebook 99.37%

anaconda-package-data's People

Contributors

Stargazers

Watchers

anaconda-package-data's Issues

March data and pkg_python info

March data missing and pkg_python info missing. Issue moved from: conda-incubator/condastats#15

Add nvidia channel

Can we please add nvidia channel (https://anaconda.org/nvidia) so we can get download stats for all packages within?
Currently, I don't see any download counts using condastats.

Add mindspore channel

Can we please add nvidia channel (https://anaconda.org/mindspore)? we are working on open source evaluation,and need mindspore download stats from condastats.

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']

I ran into RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] while using the binder notebook in this repo.

Possible to get the same data for other channels?

Firstly: this is a great data source. Thanks for providing it!

I'd love to be able to get the same type of data for a specific anaconda cloud channel that isn't one of the big ones (i.e. not anaconda, conda-forge, or bioconda) so that I can more easily track adoption by OS and Python version for the packages we distribute. Is there an API (or scripts) that I can use for this?

Pandas/pyArrow/read_parquet error

As requested by @sophiamyang , I pass on an issue I opened for condastats since this package depends on the data pipeline in this very repo :

condastats version: 0.2.1
Python version: Python 3.11.3
Operating System: linux (Manjaro/Plasma)

Description

Unable to use condastats.cli.overall (internal error on pandas->pyArrow)

    dataconda = condastats.cli.overall([conda_module], monthly=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/lib/python3.11/site-packages/condastats/cli.py", line 62, in overall
    df = dd.read_parquet(
         ^^^^^^^^^^^^^^^^
  File "[...]/python3.11/site-packages/dask/backends.py", line 138, in wrapper
    raise type(e)(
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

Recent versions of Pandas experience PyArrow errors through `intake` and `condastats` use of anaconda-package-data

Thank you for making this data and the documented methods available - fantastic stuff!

I noticed when attempting to use the intake methods from the README.md there are Pandas PyArrow errors when using recent versions of Pandas (>=v2.0.0). This appears to also effect condastats though maybe through different means. I imagine but don't know whether this could be a Pandas or Dask DataFrame issue at the core, but also wondered about data type management within the Parquet files related to this repo (for ex. are there incompatible types which users should be made aware of?). While it might be an external issue in terms of a fix, maybe this issue could help with increased or updated documentation here.

Specifically, the errors I most often saw were:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

There also may have been errors regarding "Pandas categorical types".

I worked around the issue by looking at the last modified date of the README.md (around January 2020) and installing a version of Pandas from around that time (v1.3.5 worked for me).

Add PyTorch channels to the anaconda-package-data repo

Hello Anaconda team. We would like to retrieve anaconda download statistics for PyTorch packages.
For this we would need to add following channels to the anaconda-package-data repo:
pytorch : https://anaconda.org/pytorch/
pytorch-test : https://anaconda.org/pytorch-test/

This way we can query it using condastats package.

September & October data?

Appears the last data uploaded was for August. Would it be possible to include the last 2 months?

Is package named "Crypto" existed in Linux version only?

I have installed Anaconda3-2021.05-windows-x86_64.exe but no package named "Crypto" is found. Is this package only existed in Linux version only?

Missing packages?

Hi,

Is there some threshold or rule for inclusion in the stats? The package I'm looking for but can't find is conda-forge/arcticdb.
https://anaconda.org/conda-forge/arcticdb

The package page says 50k downloads but I can't find it in the monthly parquet files.

Thanks,

Anaconda for Linux Mint

Installing anaconda on a Linux Mint OS (a distro based on Ubuntu) runs into problems due to the missing keyword "linuxmint" in vscode.py when looking for the OS type. Presently, only "debian" and "ubuntu" are listed in this file for that branch of Linux distros using deb package managers. As a result, running anaconda-navigator fails absent this keyword.

The file supplied has the additions needed for Linux Mint. It runs perfectly. Location:
~/anaconda3/lib/python3.7/site-packages/anaconda_navigator/api/external_apps/vscode.py
vscode.py.zip

Getting an error when using Anaconda for packages

I get this error when trying to get download information from anaconda.

SyntaxError: invalid non-printable character U+202F

It was fine is June but this started in July

Binder examples not working

The examples in the binder notebook are failing with this error:

>>> df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
...                      storage_options={'anon': True})
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-37350afb994b> in <module>
      1 df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
----> 2                      storage_options={'anon': True})

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, **kwargs)
    135     if hasattr(path, "name"):
    136         path = stringify_path(path)
--> 137     fs, _, paths = get_fs_token_paths(path, mode="rb", storage_options=storage_options)
    138 
    139     paths = sorted(paths, key=natural_sort_key)  # numeric rather than glob ordering

/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
    313         cls = get_filesystem_class(protocol)
    314 
--> 315         options = cls._get_kwargs_from_urls(urlpath)
    316         path = cls._strip_protocol(urlpath)
    317         update_storage_options(options, storage_options)

AttributeError: type object 'S3FileSystem' has no attribute '_get_kwargs_from_urls'

I guess s3fs changed the API in a recent version and should be pinned in environment.yml.

Access denied when trying to load 2019 data

when running this (both from binder and in my own conda environment, python 3.7, both on windows and linux):

cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
df = cat.anaconda_package_data_by_year(year=2019).to_dask()

I get the following error:

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

This used to work one month ago or so. ANy ideas of what's wrong?
It seems to work fine if I say year=2018.

Thanks!

PermissionError: Access Denied

Hi,

I am using the condastats package, which relies on anaconda-package-data. When running

import condastats.cli
condastats.cli.overall('numpy')

I get the error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/condastats", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 387, in main
    overall(
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 87, in overall
    df = df.compute()
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 598, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 89, in __call__
    return read_parquet_part(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 587, in read_parquet_part
    dfs = [
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 588, in <listcomp>
    func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 435, in read_partition
    arrow_table = cls._read_table(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 1518, in _read_table
    arrow_table = _read_table_from_path(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 239, in _read_table_from_path
    return pq.ParquetFile(fil, **pre_buffer).read(
  File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 277, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1213, in pyarrow._parquet.ParquetReader.open
  File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1578, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/usr/local/lib/python3.8/site-packages/fsspec/caching.py", line 41, in _fetch
    return self.fetcher(start, stop)
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2030, in _fetch_range
    return _fetch_range(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2173, in _fetch_range
    resp = fs.call_s3(
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Access Denied

The owner of condastats asked me to open an issue here (see conda-incubator/condastats#16).

Thank you very much for your kind help,
Cheers,
Tom.

Include `.conda` packages

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

Update cudatoolkit package

Hi!

cudatoolkit package at https://anaconda.org/conda-forge/cudatoolkit is very old. How is it possible to update it till the latest version = 12.2.0?

Thanks!

Anything wrong with the data from March to today ?

condastats version: 1.2.1
Python version: 3.12.3
Operating System: manjaro

Description

Using condastats, the data show an exponential increase in downloads over the last few months. While we're confident in the quality of our package ;-), this seems unrealistic and, in any case, unexpected (*100 between 2023/12 and 2024/05 !).

Do you have any idea why these variations are occurring ?

What I Did

condastats overall pyagrum --monthly

[...]
          2023-08      2484
          2023-09      2433
          2023-10      4560
          2023-11      3154
          2023-12      1114
          2024-01      2829
          2024-02      2812
          2024-03     12573
          2024-04     66098
          2024-05    110944

Thank you for any hints, explanation or information on this subject

(Copy of conda-incubator/condastats#22)

Access Issue with April 2023 files.

We are facing access issues while accessing March data files from below s3 path.

s3://anaconda-package-data/conda/hourly/2023/03/

Error: fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

How could I get the last month data?

It looks I could only get the data from 3 months ago.
How could I get last month data?

condastats overall condastats --start_month 2021-01 --end_month 2021-05 --monthly

condastats overall condastats --start_month 2021-01 --end_month 2021-06 --monthly

Add plotly channel

pyviz.org fetches data from here to display the download stats of various viz and dashboards packages and related. Among those plotly is one that has its own conda channel and it gets downloaded from there a non negligible amount of times. Could the channel plotly be added to the dataset?

[Feature Request] Daily update of DB

it seems that the database for a specific month's daily download data is populated every month, not daily.

As of today (2022-06-16), download data for 2022-06-01~2022-06-15 is not available, which makes it not easy to collect statistics (e.g., the Download count for the last 30 days).

It would be great if the database is updated daily.

continuumio / anaconda-package-data Goto Github PK

anaconda-package-data's People

Contributors

Stargazers

Watchers

Forkers

anaconda-package-data's Issues

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org