Coder Social home page Coder Social logo

c-3lab / dim Goto Github PK

View Code? Open in Web Editor NEW
124.0 5.0 21.0 301.51 MB

📦 dim: Manage the open data in your project like a package manager.

License: MIT License

TypeScript 98.95% Shell 0.29% Python 0.53% HTML 0.24%
cli command-line-tool commads opendata dim package-manager data dataops public-data public-dataset

dim's People

Contributors

champierre avatar jqinglong avatar k-oizumi-abel avatar minheibis avatar mkyutani avatar osoken avatar ryo-ma avatar sheile avatar syuparn avatar t-kurasawa avatar ta-hirose avatar takahashim avatar takayasukoura avatar tanimuranaomichi avatar to-ki-o avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dim's Issues

Add "XLS to CSV" converter to preprocess

It seems that "XLS to CSV" converter is somehow half working.

I tried

dim install https://www.city.chofu.tokyo.jp/www/contents/1489047638868/simple/1.xls -n "東京都調布市市立小・中学校一覧" -p xlsx-to-csv

and the result was only 1.xls was downloaded but it was converted to CSV.
I would like to see that both 1.xls and 1.csv are placed in the data_files folder.

Python library to simplify the interaction with dim

Thank you for creating this project :)
Data installation manager is absolutely required for open source community. I faced some difficulties when developing a dataset and a data analysis tool with Python regarding COVID-19.

Is it possible to add Python (+R?) library to simplify the interaction with dim?
(I'm not sure we can call Deno from Python...)
Users may use the new library as follows.

  1. Install the library, like poetry add dim-python
  2. Write settings on "pyproject.toml" with commands. This TOML format file is the standard Python library management file currently.
[tool.dim]
directory = './data_files'
datasets = [
    {
        name = 'example',
        url = 'https://example.com',
        unzip = true,
        forced = true,
        encoding = 'utf-8',
        postprocess = ["poetry run python ./tests/test_custom_command.py",],
    },
]
  1. Update datasets with poetry run dim update, or

  2. Update/load the dataset with Python scripts.

import dim
dim.config(settings='./data_files/dim-lock.json')
data = dim.load(name='example')

I'm just a new user, but very interested in this project.

Necessity of -A option for update when name is specified

When name is specified and update is performed on a single data, the presence or absence of the -A option has no effect on the operation.

The operation of the following two commands is identical.

dim update name
dim update name -A

Proposals:

Disallow the name and -A option to be specified at the same time.

Deno update is preventing "Check type" from passing.

The version of Deno was updated to 1.26.0 on 2022.09.28.
Subsequently, the following error occurred in the Check type of CI.

error: TS1477 [ERROR]: An instantiation expression cannot be followed by a property access.
          () => Promise<number>.resolve(4),
                       ~~~~~~~~
    at file:///home/runner/work/dim/dim/tests/libs/actions.search.test.ts:451:24

[WIP] Change the structure of dim.json

dim.json

  • name
  • url
  • title
  • post_process (pre_prcocess)
  • revision
  • source
  • source_url
  • source_resouce_id

dim-lock.json

  • name
  • url
  • title
  • post_process (pre_prcocess)
  • revision
  • source
  • source_url
  • source_resouce_id
  • Integrity

Post-processing CMD of the install command to accommodate redirection of results

">" is not recognized as a redirect sign.
The deno.run command probably treats ">" as a string.

Proposals:

First draft

  1. "> xxxx" and other parts are extracted using regular expressions to obtain the file name of the redirect destination
  2. Specify "piped" to stdout and stderr during deno.run to enable the program to handle standard output
  3. If a redirection was specified, Deno.writeFileSync saves the output to stdout

Pros

A single code can be used in a variety of environments.

Cons

Have to handle complex file names and redirects yourself. (>>, 2>>, 1>&2, etc.)

Second draft

If -p "cmd wc -c > /tmp/test.txt" is specified, start using /bin/sh as follows.

Deno.run({ cmd: ["/bin/sh", "-c", "wc -c data_files/xxx/xxx.zip > /tmp/test.txt"]})

Pros

/bin/sh handles redirects, so no need to implement your own processing.
If the function to send downloaded files as standard input is implemented, the string received with the -p option can be used as is.

Deno.run({ cmd: ["/bin/sh", "-c", "wc -c > /tmp/test.txt"], stdin: xxxx })

Cons

Need to change commands for each environment. (/bin/sh for Linux and Mac, cmd for Windows)

Destination of the unzip decompression seems to be wrong

If the install command is executed with unzip in the -p option, the unzipped file is not generated in the data_file as in xlsx-to-csv, but in the current directory.

Proposals:

Change it so that it is generated in the same directory as the file before the change, as in xlsx-to-csv.

error of open api request is not shown as the error message

} catch (error) {
console.error(
Colors.red(`\n${error.message}`),
);

この部分について、
エラーが発生すると、"Request failed with status code 404 Not Found"
のように表示されますが、これは ky のエラーメッセージを表示しており、open AI 側のレスポンスのエラーメッセージ(例えば以下を参照)が表示されていないようです。

コマンド例

curl -i https://api.openai.com/v1/completions -H \
    "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
         "model": "gpt-3.5-turbo",
         "prompt": "generate the code that prints the following messange in python: this is a test",
         "max_tokens": 7,
         "temperature": 0
    }'

出力

HTTP/2 404 
date: Thu, 15 Jun 2023 12:13:02 GMT
content-type: application/json
content-length: 227
access-control-allow-origin: *
openai-organization: albert-inc-1
openai-processing-ms: 268
openai-version: 2020-10-01
strict-transport-security: max-age=15724800; includeSubDomains
x-ratelimit-limit-requests: 3500
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-requests: 3499
x-ratelimit-remaining-tokens: 89992
x-ratelimit-reset-requests: 17ms
x-ratelimit-reset-tokens: 4ms
x-request-id: 9bdd4ddbcedcce031c323eb331076f87
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 7d7ab9887db0afdb-NRT
alt-svc: h3=":443"; ma=86400

{
  "error": {
    "message": "This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?",
    "type": "invalid_request_error",
    "param": "model",
    "code": null
  }
}

このメッセージの"message": "This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?"を表示したいです。

Add dim clean

Delete only data_files.
To delete all, rm and init.

Install failed with error TS1192

dimインストール中にTS1192が発生するようになってしまいました。(再現性あり)

  • エラー内容
error: TS1192 [ERROR]: Module '"https://jspm.dev/xlsx.js"' has no default export.
import xlsxlib from 'https://jspm.dev/xlsx'
       ~~~~~~~
    at https://deno.land/x/[email protected]/src/xlsx.ts:1:8
  • エラー生起タイミング:deno install
  • バージョン等
# deno --version
deno 1.19.1 (release, x86_64-unknown-linux-gnu)
v8 9.9.115.7
typescript 4.5.2
# git for-each-ref
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/heads/main
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/remotes/origin/HEAD
5383d922b715002dc8706fb6af8e4a53b125b8bd commit refs/remotes/origin/main
8c54c493a4588103a9f715812e6f7dad467a9853 commit refs/tags/v0.1.3
5f4309e2f10008ad01582ccbc92ec7327a858df8 commit refs/tags/v0.1.4
b3092d7d4a9208437c60cae05de43a747503fbea commit refs/tags/v0.1.5

操作ログ

以下のような新しいubuntuコンテナの操作で再現しました。

$ sudo docker run -it --rm ubuntu /bin/bash

以下、コンテナ内操作。

  • 必要なものをインストール
# apt update; apt upgrade
# apt install git curl unzip
  • git config & SSH設定
  • denoインストール
# curl -fsSL https://deno.land/install.sh | sh
# echo 'export DENO_INSTALL="/root/.deno"' >> ~/.bashrc
# echo 'export PATH="$DENO_INSTALL/bin:$PATH"' >> ~/.bashrc
# source ~/.bashrc
  • dimインストール
# git clone [email protected]:ryo-ma/dim.git
# cd dim
# deno install --unstable --allow-read --allow-write --allow-run --allow-net dim.ts
Download https://cdn.skypack.dev/encoding-japanese
Download https://deno.land/[email protected]/fmt/colors.ts
Download https://deno.land/[email protected]/fs/mod.ts
...
Download https://deno.land/x/[email protected]/src/xlsx-types.ts
Download https://jspm.dev/xlsx
...
Check file:///root/dim/dim.ts
error: TS1192 [ERROR]: Module '"https://jspm.dev/xlsx.js"' has no default export.
import xlsxlib from 'https://jspm.dev/xlsx'
       ~~~~~~~
    at https://deno.land/x/[email protected]/src/xlsx.ts:1:8

Can't download file when specify URL without filename

Currently, downloaded file is created at data_files/{name}/{filename}.
If URL is matched with following patterns, current logic can't get filename and occur an error.

$ dim install https://www.example.com -n example1
Failed to install. Is a directory (os error 21), open './data_files/example1/'

Proposals:

Use Content-Disposition response header to determine filename.
And fallback to use --name option as filename if don't serve it.

catalogResourceId/Url is removed by `dim update`

catalogResourceId and catalogResourceUrl field are replaced with null by dim update when the data was fetched by dim search -i.

before

// dim.json
{
  "fileVersion": "1.1",
  "contents": [
    {
      "url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
      "name": "42_長崎県_長与町避難所_長与町避難所",
      "catalogUrl": "https://www.geospatial.jp/ckan/dataset/42000-013",
      "catalogResourceId": "f2d3ad73-83db-45e4-a11d-48bdd15fe60b",
      "postProcesses": [],
      "headers": {}
    }
  ]
}
// dim-lock.json
{
  "lockFileVersion": "1.1",
  "contents": [
    {
      "name": "42_長崎県_長与町避難所_長与町避難所",
      "url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
      "path": "./data_files/42_長崎県_長与町避難所_長与町避難所/14nagayotownhinan.csv",
      "catalogUrl": "https://www.geospatial.jp/ckan/dataset/42000-013",
      "catalogResourceId": "f2d3ad73-83db-45e4-a11d-48bdd15fe60b",
      "lastModified": "2018-08-27T14:32:21.000Z",
      "eTag": "ff6b437fe66ac28b776a16a249f62b36",
      "lastDownloaded": "2023-05-27T04:45:49.037Z",
      "integrity": "d3db097cb5c1213821bb79730d5c895160302f6b",
      "postProcesses": [],
      "headers": {}
    }
  ]
}

after

// dim.json
{
  "fileVersion": "1.1",
  "contents": [
    {
      "name": "42_長崎県_長与町避難所_長与町避難所",
      "url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
      "catalogUrl": null,
      "catalogResourceId": null,
      "postProcesses": [],
      "headers": {}
    }
  ]
}
// dim-lock.json
{
  "lockFileVersion": "1.1",
  "contents": [
    {
      "name": "42_長崎県_長与町避難所_長与町避難所",
      "url": "https://www.geospatial.jp/ckan/dataset/30b5f8dc-8957-4b4b-880f-f348e272f591/resource/f2d3ad73-83db-45e4-a11d-48bdd15fe60b/download/14nagayotownhinan.csv",
      "path": "./data_files/42_長崎県_長与町避難所_長与町避難所/14nagayotownhinan.csv",
      "catalogUrl": null,
      "catalogResourceId": null,
      "lastModified": "2018-08-27T14:32:21.000Z",
      "eTag": "ff6b437fe66ac28b776a16a249f62b36",
      "lastDownloaded": "2023-05-27T04:46:51.049Z",
      "integrity": "d3db097cb5c1213821bb79730d5c895160302f6b",
      "postProcesses": [],
      "headers": {}
    }
  ]
}

Add dim verify

Check for corruption under data_files using integirity in dim-lock.json.

SHA-512 is 128 characters in hexadecimal notation, so it is a little difficult to see.

If you are using it for checking corruption rather than for security, consider using a shorter notation such as SHA-1.

Since this is not a file that many people will see, using SHA-512 may not be too much of a problem.

Add support for deno 1.30.0

test fails on current code base running on deno 1.30.x with the following error message:

InstallAction ...
  with URL ...
    download and check that data_files, dim.json and dim-lock.json are saved. ... ok (20ms)
    exit with error when name is not specified ... ok (5ms)
    exit with error when run with "name" not recorded in dim.json ... ok (7ms)
    overwrite existing files when specified name is duplicated and force is true ... ok (7ms)
    download using request headers and check that they are recorded in dim.json and dim-lock.json when specify headers option ... ok (6ms)
    encode downloaded file to Shift-JIS and record in dim.json, dim-lock.json when specify "encode sjis" as postProcesses ... ok (8ms)
    exit with error when specify "encode utf-8 sjis" as postProcesses, and download ... ok (7ms)
    exit with error when specify "encode" as postProcesses, and download. ... ok (6ms)
    check that the command for darwin to extract the downloaded file is entered and recorded in dim.json and dim-lock.json. ... ok (6ms)
    check that the decompress method is called with two arguments when the os is not darwin. ... ok (6ms)
    exit with error when specify "unzip a" as postProcess and download ... ok (4ms)
    convert downloaded file from xlsx to csv and record in dim.json and dim-lock.json when specify "xlsx-to-csv" as postProcesses ... ok (26ms)
    convert downloaded file from xls to csv and record in dim.json and dim-lock.json when specify "xlsx-to-csv" as postProcesses ... ok (16ms)
    exit with error when specify "xlsx-to-csv a" as postProcesses and download ... ok (14ms)
    download file and execute echo command with downloaded file path as standard output when specify "cmd echo" as postProcesses ... ok (7ms)
    download file and execute echo command with "a" and downloaded file path as standard output when specify "cmd echo a" as postProcesses ... ok (6ms)
    exit with error when specify "cmd" as postProcesses and download ... ok (6ms)
    output log and ignore error when specify error command such as "cmd aaa" as postProcesses ... ok (10ms)
    exit with error when specify "aaa" as postProcess and download ... ok (5ms)
    exit with error when if the URL is incorrectly described. ... FAILED (6ms)
      error: AssertionError: spy not called with expected args:
      
      
          [Diff] Actual / Expected
      
      
          [
            "\x1b[31mFailed to install.\x1b[39m",
      -     "\x1b[31mInvalid URL: 'aaa'\x1b[39m",
      +     "\x1b[31mInvalid URL\x1b[39m",
          ]
      
              throw new AssertionError(
                    ^
          at assertSpyCall (https://deno.land/[email protected]/testing/mock.ts:542:15)
          at Object.<anonymous> (file:///home/osoken/Documents/works/projects/cfj/dim/repo/dim/tests/libs/actions.install.test.ts:798:9)
          at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:358:7)
          at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:346:9)
          at async Function.runTest (https://deno.land/[email protected]/testing/_test_suite.ts:346:9)
          at async fn (https://deno.land/[email protected]/testing/_test_suite.ts:316:13)
    exit with error when failed to download ... ok (7ms)
    exit with error when execute with URL and file path ... ok (6ms)
  with URL ... FAILED (204ms)

バージョンがアップデートされていない

以下でバージョンを定数として定義していますが、v1.0.4が最新版であるにもかかわらずv1.0.3のままになっています。これにより、リリースにあるバイナリの最新版をインストールしても New version available: v1.0.4 と出てきます。

export const VERSION = "v1.0.3";

Support to search function

  • Create a search command

  • Create a search function

    • Search data from package_search CKAN API
    • Specify the number of data to get by option -n (default 10)

Search Results

$ dim search xxxxxx
package_title1
- package_url
- package_description
- package_license
   1.resource_name1
    * resource_url1
    * resource_description1
    * created1
    * format
   2.resource_name2
    * resource_url2
    * resource_description2
    * created2
    * format
package_title2
- package_url
- package_description
- package_license
   3.resource_name3
    * resource_url3
    * resource_description3
    * created3
    * format
   4.resource_name4
    * resource_url4
    * resource_description4
    * created4
    * format

Support for interactive installation

  • Add interaction option to search command
$ dim search -i xxxx

package_title1
- package_url
- package_description
- package_license
   1.resource_name1
    * resource_url1
    * resource_description1
    * created1
    * format
   2.resource_name2
    * resource_url2
    * resource_description2
    * created2
    * format
package_title2
- package_url
- package_description
- package_license
   3.resource_name3
    * resource_url3
    * resource_description3
    * created3
    * format
   4.resource_name4
    * resource_url4
    * resource_description4
    * created4
    * format
...
Enter the number of the data to install
> 1

Enter the name. Enter blank if not required.
> 

Enter the post-processing you want to add separated by spaces.
Enter blank if not required.
(ex.: > unzip xlsx-to-csv)
> unzip

installing...
unzip
Installed to /xxx/xxx
  1. Write data information to dim.json from ckan
  2. Store the data to datafiles

Cannot target files after conversion when multiple postProcesses are specified.

dim install http://example.com/example.xlsx -p "xlsx-to-csv" -p "encode SJIS"

Currently, it is not possible to convert an xlsx file to a csv file and then to SJIS by executing the above command.

This is because if multiple postProcesses are specified, the path to the converted file is not passed to the next process.

Change the structure of dim.json and dim-lock.json.

dim.json

{
    "fileVersion": "1.1",
    "contents": [{
      "name": "xxxxxxx", // install時に指定したname 指定しなかった場合はURL
      "url": "https://xxxx.xxx.xx", //install時に指定したurl
      "catalogUrl": "https://ckan.xxx.xx", // search -i で取得した場合は packageのカタログURLを保管 それ以外の場合はnull
      "catalogResourceId": "123456abcd", // search -i で取得した場合は resourceのidを保管 それ以外の場合はnull
      "postProcesses": [
        { "type": "unzip", "arguments": { "password": "dummy", ... } },
        "csv_to_json"
      ], // install時に指定したpost_process 文字列かObject
      "headers": { "Fiware-Service": "servicce1" }, // install時に指定したheader key:value形式
    }]
}

dim-lock.json

{
  "lockfileVersion": "1.1",
  "contents": [{
    "name": "xxxxxxx", // install時に指定したname 指定しなかった場合はURL
    "url": "https://xxxx.xxx.xx", //install時に指定したurl
    "path": "xxx/xxx/xx.json" // installした際の保存先
    "catalogUrl": ""https://ckan.xxx.xx"", // search -i で取得した場合は packageのカタログURLを保管 それ以外の場合はnull
    "catalogResourceId": "123456abc", // search -i で取得した場合は resourceのidを保管 それ以外の場合はnull
    "lastModified": "2022-07-06T02:28:06.556Z", // 取得するデータのResponse headerのlast_modifiedから取得 フォーマットはISO8601 取得できない場合はnull
    "eTag": "xxx-xxxxx", // 取得するデータのResponse headerのe-tagから取得 取得できない場合はnull 提供されるデータの変更確認などに使用
    "lastDownloaded": "2022-07-06T02:28:06.556Z", //ダウンロードを実施した時刻 旧last_updatedから変更 フォーマットはISO8601
    "integrity": "sha1-xxxxxxxx", // npmのintegrityを参考 ダウンロードしてきた時点でのファイルのハッシュ化(sha1)を行う ダウンロード後のファイル変更確認などに使用
    "postProcesses": [
      { "type": "unzip", "arguments": { "password": "dummy", ... } },
      "csv_to_json"
    ], // install時に指定したpost_process 文字列かObject
    "headers": { "Fiware-Service": "service1" }, // install時に指定したheader key:value形式
  }]
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.