alibabaresearch / damo-convai Goto Github PK

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.

License: MIT License

Shell 2.26% Python 93.59% Jupyter Notebook 2.02% CSS 0.01% HTML 0.06% JavaScript 0.49% Dockerfile 0.08% Makefile 0.02% Thrift 0.01% Haskell 1.45%

conversational-ai deep-learning natural-language-processing dialog

damo-convai's Introduction

DAMO ConvAI

DAMO ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.


  ____    _    __  __  ___     ____                  _    ___ 
 |  _ \  / \  |  \/  |/ _ \   / ___|___  _ ____   __/ \  |_ _|
 | | | |/ _ \ | |\/| | | | | | |   / _ \| '_ \ \ / / _ \  | | 
 | |_| / ___ \| |  | | |_| | | |__| (_) | | | \ V / ___ \ | | 
 |____/_/   \_\_|  |_|\___/   \____\___/|_| |_|\_/_/   \_\___|

🔥 News

[2024-02]: 5 papers are accepted by LREC-COLING 2024 ！
[2023-10]: 7 papers are accepted by EMNLP 2023 ！
[2023-09]: BIRD-SQL is accepted by NeurIPS 2023 Spotlight ！
[2023-08]: SigDial 2023 DSTC11 workshop BEST PAPER ！
[2023-05]: 9 papers are accepted by ACL 2023 ！
[2022-11]: 🏆 Achieved the 1st rank on DSTC11-SIMMC track !
[2022-10]: Ten paper has been accepted by EMNLP 2022 !
[2022-05]: Two paper has been accepted by KDD 2022.
[2022-07]: SPACE 3.0 has been accepted by SIGIR 2022.
[2022-02]: S²SQL has been accepted by ACL 2022, and it achieves the first rank on the Spider leaderboard !
[2021-11]: SPACE 1.0 has been accepted by AAAI 2022.
[2020-11]: R²SQL has been accepted by AAAI 2021, and it achieves the first rank on the SparC and CoSQL leaderboard !

📝 License

DAMO-ConvAI is released under the MIT.

MIT License

Copyright (c) 2022 Alibaba Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

damo-convai's People

Contributors

Stargazers

Watchers

Forkers

tuozhanjun liyandan chavesliu allensky708 bigbigwatermalon cm-li huybery chen9154 catherinezhou oduse zeng-wh silverriver lukecq1231 zmzlois zeinabbo jzl0166 jemmryx zhihao-chen bchalamayya jharden34 xbsdsongnan intelligencegear aronbryant shuyhere ambier cao-shuang eight-corner ljunius andy-hhh-hub chengshuang18 aiyili nsddd-ict marvin-song fdumark worldlanguage yuqianglxf yuzi0123 its1999hakon jingqingzh qingqinggit1 ranchlai hflyzju wm19297 lyx0501 tinld 00mjk liminghao1630 pparreira angus1976 nnnnnaaaaaa tma15 xiaofengshi blackcatxj jacekplocharczyk apollohuang1 xuchaobo hzwh6910 kaimary stjordanis xin-zhou-smu gao-xiao-bai peterparkette chxmlmn joskid sohaib0399 dupkuvis mazinhozanelato peerasaki heril18 klonggan yaowenzhe pldlgb mdmmn378 accpatrick yangjiaxi chaochaobar cjopengler qdj0511 crazyivanz dumpmemory standardgalactic wgzv thanawatkmutnb azure-arc-0 tom0k1 nazem9 gokunwu leezhongshan hoorayeah yxwang8775 andydiwenzhu amaliujia aibibulaatawula khsibr airobotzhang amirhap gangzhao98 w0lker xinrong-meng luozhengdong

damo-convai's Issues

The data file download link has been inaccessible

Panel of BIRD Annotation Issues.

Hi all,

Although BIRD has incurred significant annotation costs, we still cannot guarantee that all the data is accurately labeled. We hope that the community can assist us in building BIRD together! You can continuously report any errors you find under this issue, and we will perform a dataset update at a designated time.

Thanks a lot!

Best,
Binyuan

Excuse me, why is there a non-functional code file, Database.py, in the data_all_in_folder on graphix?

Permission denied happend when i try to run `sudo make eval` graphix-3b

When i try to run evaluation, i just enter the sudo make eval command. When evaluating is finished, the scrits got some wrongs about:

Traceback (most recent call last):
  File "seq2seq/run_seq2seq_eval.py", line 284, in <module>
    main()
  File "seq2seq/run_seq2seq_eval.py", line 252, in main
    metric_key_prefix="eval",
  File "/app/seq2seq/utils/trainer.py", line 104, in evaluate
    output.metrics.update(self.compute_metrics(eval_preds))
  File "/app/seq2seq/utils/spider.py", line 142, in _compute_metrics
    return self.metric.compute(predictions=predictions, references=references)
  File "/opt/conda/lib/python3.7/site-packages/datasets/metric.py", line 419, in compute
    self.add_batch(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/metric.py", line 465, in add_batch
    self._init_writer()
  File "/opt/conda/lib/python3.7/site-packages/datasets/metric.py", line 539, in _init_writer
    cache_file_name, filelock = self._create_cache_file()  # get ready
  File "/opt/conda/lib/python3.7/site-packages/datasets/metric.py", line 257, in _create_cache_file
    filelock.acquire(timeout=timeout)
  File "/opt/conda/lib/python3.7/site-packages/datasets/utils/filelock.py", line 273, in acquire
    self._acquire()
  File "/opt/conda/lib/python3.7/site-packages/datasets/utils/filelock.py", line 408, in _acquire
    fd = os.open(self._lock_file, open_mode)
PermissionError: [Errno 13] Permission denied: '/transformers_cache/metrics/spider/both/default_experiment-1-0.arrow.lock'

What happened to the evaluation process of official github code? Could you please help me to solve it?

I try to use these command below:

chmod 777 train_db_id
chmod 777 transformers_cache
chmod 777 wandb

When i just use make eval, it still gets the permission wrong error.
So i try to use ``sudo make eval`, still error.

Could you please help me to solve the problem?

Deep thinking说白了就是重复唠叨，但还谈不上学习

在MR数据集上选用opt-125M，把挑选的两个示例重复5次再输入，结果第一个iteration就有67%的准确率，第二个就是达到73%
在同样的数据集上选用opt-125M，把示例的正负标签对换，结果还是没能学到新的标签的结果。。。

The train data of BIRD-SQL is uncompressed failed.

I try to uncompress the zip package of train data in BIRD-SQL，but got the follow error message:
unzip train.zip
Archive: train.zip
creating: train/
inflating: __MACOSX/.train
inflating: train/.DS_Store
inflating: __MACOSX/train/..DS_Store
inflating: train/train_databases.zip
error: invalid compressed data to inflate
bad CRC b09490af (should be 2eea441e)
inflating: __MACOSX/train/._train_databases.zip
inflating: train/train.json
inflating: __MACOSX/train/._train.json
inflating: train/train_gold.sql
inflating: __MACOSX/train/._train_gold.sql

Space-3 Pre-trained Checkpoint

Hi,

Thanks for your great and exciting work! I have a question about the pre-trained checkpoint: Is the SPACE-3 pre-trained checkpoint released in this repo based on the full-shot setting? If I want to reproduce the few-shot setting results, do I need to pre-train the model from scratch with only "10% of the training data of the target evaluation dataset if it exists in the AnPreDial."

Thanks!

The proton repo for file common_utils.py

Why is “used_schema['column'].add(val1[1])” is val1 instead of val2, I think this place is val2 in LGESQL, what is the meaning of the modification in this place？

请问STAR有LARGE的中文模型吗？

SAPCE-T的中文base模型貌似效果不太好，join无法支持
SPACE-T的英文large模型，按modelscope给的方式运行报错（dgl版本试过最新nightly的也不行）

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/johnsaxon/NL2SQL/SPACE-T/star/star.py:29 in <module>                                       │
│                                                                                                  │
│   26 │   │   │   "last_sql": last_sql,                                                           │
│   27 │   │   │   "database_id": test_case['database_id'],                                        │
│   28 │   │   │   'local_db_path': test_case['local_db_path']}                                    │
│ ❱ 29 │   results = pipeline(case)                                                                │
│   30 │   print(results)                                                                          │
│   31 │   history.append(item)                                                                    │
│   32                                                                                             │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ modelscope/pipelines/base.py:212 in __call__                                                     │
│                                                                                                  │
│   209 │   │   │   return self._process_iterator(input, *args, **kwargs)                          │
│   210 │   │                                                                                      │
│   211 │   │   else:                                                                              │
│ ❱ 212 │   │   │   output = self._process_single(input, *args, **kwargs)                          │
│   213 │   │   return output                                                                      │
│   214 │                                                                                          │
│   215 │   def _sanitize_parameters(self, **pipeline_parameters):                                 │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ modelscope/pipelines/base.py:240 in _process_single                                              │
│                                                                                                  │
│   237 │   │   forward_params = kwargs.get('forward_params', {})                                  │
│   238 │   │   postprocess_params = kwargs.get('postprocess_params', {})                          │
│   239 │   │   self._check_input(input)                                                           │
│ ❱ 240 │   │   out = self.preprocess(input, **preprocess_params)                                  │
│   241 │   │                                                                                      │
│   242 │   │   with device_placement(self.framework, self.device_name):                           │
│   243 │   │   │   if self.framework == Frameworks.torch:                                         │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ modelscope/pipelines/base.py:369 in preprocess                                                   │
│                                                                                                  │
│   366 │   │   assert self.preprocessor is not None, 'preprocess method should be implemented'    │
│   367 │   │   assert not isinstance(self.preprocessor, List),\                                   │
│   368 │   │   │   'default implementation does not support using multiple preprocessors.'        │
│ ❱ 369 │   │   return self.preprocessor(inputs, **preprocess_params)                              │
│   370 │                                                                                          │
│   371 │   def forward(self, inputs: Dict[str, Any],                                              │
│   372 │   │   │   │   **forward_params) -> Dict[str, Any]:                                       │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ modelscope/utils/type_assert.py:48 in wrapper                                                    │
│                                                                                                  │
│   45 │   │   │   │   │   if not isinstance(value, bound_types[name]):                            │
│   46 │   │   │   │   │   │   raise TypeError('Argument {} must be {}'.format(                    │
│   47 │   │   │   │   │   │   │   name, bound_types[name]))                                       │
│ ❱ 48 │   │   │   return func(*args, **kwargs)                                                    │
│   49 │   │                                                                                       │
│   50 │   │   return wrapper                                                                      │
│   51                                                                                             │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ modelscope/preprocessors/nlp/space_T_en/conversational_text_to_sql_preprocessor.py:113 in        │
│ __call__                                                                                         │
│                                                                                                  │
│   110 │   │   output_dataset = process_dataset(self.model_dir, self.processor,                   │
│   111 │   │   │   │   │   │   │   │   │   │    theresult, self.output_tables)                    │
│   112 │   │   output_dataset = \                                                                 │
│ ❱ 113 │   │   │   process_dataset_graph(                                                         │
│   114 │   │   │   │   self.graph_processor,                                                      │
│   115 │   │   │   │   output_dataset,                                                            │
│   116 │   │   │   │   self.output_tables,                                                        │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ text2sql_lgesql/preprocess/process_graphs.py:14 in process_dataset_graph                         │
│                                                                                                  │
│   11 │   │   │   continue                                                                        │
│   12 │   │   if (idx + 1) % 500 == 0:                                                            │
│   13 │   │   │   print('Processing the %d-th example ...' % (idx + 1))                           │
│ ❱ 14 │   │   entry = processor.process_graph_utils(entry, db, method=method)                     │
│   15 │   │   processed_dataset.append(entry)                                                     │
│   16 │   # print('In total, process %d samples, skip %d samples .' % (len(processed_dataset),    │
│   17 │   if output_path is not None:                                                             │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ text2sql_lgesql/preprocess/graph_utils.py:112 in process_graph_utils                             │
│                                                                                                  │
│   109 │   │   if method == 'rgatsql':                                                            │
│   110 │   │   │   ex = self.process_rgatsql(ex, db, relation)                                    │
│   111 │   │   elif method == 'lgesql':                                                           │
│ ❱ 112 │   │   │   ex = self.process_lgesql(ex, db, relation)                                     │
│   113 │   │   return ex                                                                          │
│   114                                                                                            │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ text2sql_lgesql/preprocess/graph_utils.py:93 in process_lgesql                                   │
│                                                                                                  │
│    90 │   │   match_ids = [idx for idx, r in enumerate(graph.global_edges) if 'match' in r[2]]   │
│    91 │   │   src, dst, eids = lg.edges(form='all', order='eid')                                 │
│    92 │   │   eids = [e for u, v, e in zip(src.tolist(), dst.tolist(), eids.tolist()) if not (   │
│ ❱  93 │   │   graph.lg = lg.edge_subgraph(eids, preserve_nodes=True).remove_self_loop().add_se   │
│    94 │   │   ex['graph'] = graph                                                                │
│    95 │   │   return ex                                                                          │
│    96                                                                                            │
│                                                                                                  │
│ /home/johnsaxon/aidd/localcolabfold/colabfold_batch/colabfold-conda/lib/python3.8/site-packages/ │
│ dgl/utils/internal.py:1052 in _fn                                                                │
│                                                                                                  │
│   1049 │                                                                                         │
│   1050 │   @wraps(func)                                                                          │
│   1051 │   def _fn(*args, **kwargs):                                                             │
│ ❱ 1052 │   │   return func(*args, **kwargs)                                                      │
│   1053 │                                                                                         │
│   1054 │   _fn.__doc__ = """Alias of :func:`dgl.{}`.""".format(func.__name__)                    │
│   1055 │   return _fn                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: edge_subgraph() got an unexpected keyword argument 'preserve_nodes'

请问提供的代码是否可以直接用于Chatglm-6b模型，以及是否支持自定义数据集

what is orocal knowledge

请问一下orocal knowledge 是什么？我再网页文章代码里都没搜到相关说明

https://bird-bench.github.io/

It's June now. Any code update for the paper Graphix-3B?

Your's paper has already been released, why haven't you released the source code after several months? Is there some unfair competition improvement technique in the code? If not, please publish the code immediately for review and reproduction, thank you.

how to find the model of the DialSTART?

Availability of Bird-SQL data

Hi,

Thanks for the exciting work on Bird-SQL data for text-to-SQL.

Any updates on the data availability?

Thanks!

setup.sh 中python -c "from embeddings import GloveEmbedding; emb = GloveEmbedding('common_crawl_48', d_emb=300)"bug

Traceback (most recent call last):
File "", line 1, in
File "D:\anaconda3\lib\site-packages\embeddings\glove.py", line 43, in init
self.db = self.initialize_db(self.path(path.join('glove', '{}:{}.db'.format(name, d_emb))))
File "D:\anaconda3\lib\site-packages\embeddings\embedding.py", line 22, in path
root = environ.get('EMBEDDINGS_ROOT', path.join(environ['HOME'], '.embeddings'))
File "D:\anaconda3\lib\os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'HOME'

Question：standard evaluation reuslts

Hello, I wonder if you evaluate SPACE-3 using the standard evaluation script? (https://github.com/Tomiinek/MultiWOZ_Evaluation)

Will api-bank be open source? If so, when will it be open source?

Implementation GNN-T5

Hi
Thanks for sharing your great job..
I have been seen your paper in https://arxiv.org/pdf/2301.07507.pdf. How to implement GNN-T5 model from your code..??

Thanks a lot!

Dater - missing files link in README

Hey, in Dater README there is missing link to saved files and prompts:

## Download
Download required prompts and saved files and moving files to target folder.

### Step 1
Download the [saved files and prompts]().  <--- HERE

### Step 2
Move saved files to target folder.

PRO: SFT loss为什么没有在token上做平均呢

process_manager.py文件中计算loss的代码如下：

        sum_scores = torch.cat(score_list, dim=1) #[batch, training_stage]
        suffix_mask = torch.cat(suffix_mask_list, dim=1) #[batch, training_stage]
        scores = sum_scores / suffix_mask #[batch, training_stage]
        total_loss = 0
        for time in range(temp_training_stage - 1):
            neg_reward = batch["rewards"][:, time+1:] # [batch, training_stage-time-1]
            pos_reward = batch["rewards"][:, time] # [batch]
            
            eps = 1e-10
            neg_temperatures = pos_reward.view(-1, 1) - neg_reward # [batch, training_stage-time-1]
            pos_temperature = torch.max(neg_temperatures, dim=1).values # [batch]
            loss = torch.log(eps + torch.exp(scores[:, time] * pos_temperature) + torch.sum(torch.exp(scores[:, time+1:] * neg_temperatures), dim=1)) - scores[:, time] * pos_temperature # [batch]
            loss = torch.mean(loss).to(local_outputs.hidden_states[0].dtype)
            
            print_loss[time].append(loss.item())
            total_loss += loss
        
        sft_index = batch["sft_index"].view(batch_size, 1)
        sft_scores = torch.gather(input = sum_scores, dim = 1, index = sft_index).view(batch_size) #[batch]
        sft_loss = torch.mean(-sft_scores).to(local_outputs.hidden_states[0].dtype)
        sft_loss = args.sft_weight * math.pow(temp_training_stage - 1, 2) * sft_loss
        total_loss += sft_loss

其中sft_loss部分gather的对象是sum_scores而不是scores，可以说下原因吗，实测下来即使beta=0.05，sft_loss也要比rank_loss大一个数量级。

Incorrect evidence in dev set example, db_id: california_schools

在 california_schools/frpm.csv 中，eligible free rate = Free Meal Count / Enrollment，同时 column 【Percent (%) Eligible Free (K-12)】确实是【Free Meal Count】 / 【Enrollment】。

但是在 dev set中第一条测试数据就存在错误：

原始数据：

db_id: california_schools
question:
What is the highest eligible free rate for K-12 students in the schools in Alameda County?
evidence:
Eligible free rate for K-12 = FRPM Count (K-12) / Enrollment (K-12)
Gold SQL:
SELECT FRPM Count (K-12) / Enrollment (K-12)
FROM frpm
WHERE County Name = 'Alameda'
ORDER BY (CAST(FRPM Count (K-12) AS REAL) / Enrollment (K-12)) DESC LIMIT 1

两处错误：
(1) evidence 错误，正确的应该为： Eligible free rate for K-12 = Free Meal Count (K-12) / Enrollment (K-12)
(2) Gold SQL 写得复杂了，除法是多此一举。直接写成如下的即可，无需除法。因为【Percent (%) Eligible Free (K-12)】就代表了 Eligible free rate for K-12

SELECT MAX(Percent (%) Eligible Free (K-12))
FROM frpm
WHERE County Name = 'Alameda'

类似的问题可能还有不少。

Any Update for Graphix-3B? It's May now.

Have you finished writing the code or have you completed it? If so, please complete the code immediately, as the community needs it. Thank you.

https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/graphix

Folder of "model" missing in dial2vec?

PLATO model missing

share Graphix-t5+picard predicted SQL file

Hello, may I ask if you can share the predicted SQL that you finally ran on the validation set? Replicating your work is too time-consuming for me as my server does not have a GPU.

Segmentation fault

Segmentation fault will be reported when running preprocess, what is the reason for this?

PRO:关于奖励的问题

大佬你好，我在看PRO的代码的时候，发现训练数据中是包含<x, y_i>这一对couple的reward值的，如果我想用自己的数据去做的话，这个reward值怎么获得呢，还是得自己训练一个RM模型来打分吗？那这和rlhf中的RM模型有什么区别吗

[PACE] Processing PhotoChat dataset

Hi,
This is Young-Jun Lee.

I have one question about fine-tuning the proposed pre-trained model on the PhotoChat dataset for the intent prediction task.

Before fine-tuning the model, I execute the write_photochat_intent.py file after downloading the PhotoChat dataset from the official repository. I encounter one problem when I print the result of paths variable in this line. The paths variable is just an empty list.

Can you elaborate on how to process the PhotoChat dataset?

Best regards,

BIRD dataset fixes for database descriptions & leaderboard update request (gpt4 and others)

Currently, I'm only playing with BIRD's dev set. Just found the following problems occurred in the dev set in database_description:

Encoding Issue with .csv files: These files are claimed to be utf-8 encoding with BOM, but many contain non-utf-8 characters. For example, in the formula_1 database's qualifying table, there's an invisible character in the first row's value_description:

... Sprint qualifying is essentially a short-form Grand Prix � a race that ...

This causes the pandas csv reader to fail. The .csv files' BOM start prevents using other encodings like latin1 or iso-8859-1. The current workaround is to delete these invisible characters in multiple files manually.

Schema Mismatch in .csv and .sqlite files: Some .csv files in database_description don't match the .sqlite table schema. For example, in the european_football_2 database, the table contains columns named home_player_<number>, absent in the .csv files. These files only contain home_player_X<number> and home_player_Y<number>, which are also not well-described.
Incorrect .csv File Names: Some .csv files have incorrect names. For example, in the card_games database, ruling.csv and Set_transactions.csv should be rulings.csv and Set_translations.csv, respectively.

Additionally, it seems that the leaderboard is not updating ever since March. Now that GPT-4 is open, could the leaderboard be updated with GPT-4 standings to avoid us running the scripts separately?

License of Bird-SQL data

Hi authors,

Thanks for the great work.

I have a confusion on the data license.

In the paper (https://arxiv.org/pdf/2305.03111.pdf), you mentioned: "The databases in this study are open-source with appropriate license and should be distributed under the CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/".

But in the repo, from README.md (https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird), I can see "License Notation: BIRD-SQL is constructed and distributed for academic use instead of commercial use. For non-academic purposes of this data, please contact corresponding authors.".

I am just wondering what the correct license for this Bird-SQL dataset is.
Any chance for commercial use of this dataset?

Thanks!

Download links invalid.

The download links in space-2 and 3 readme.md are all invalid. Can you fix it when you have time?

None of these links are available.

BIRD Issue： It seems that gpt-3.5-turbo api call is not implemented in bird/llm/src/gpt_request.py. Maybe authors forget to update code.

bird/llm/src/gpt_request.py

error:This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?\t----- bird -----\tcalifornia_schools

The origin function:

def connect_gpt(engine, prompt, max_tokens, temperature, stop):
# print(prompt)
try:
result = openai.Completion.create(engine=engine, prompt=prompt, max_tokens=max_tokens, temperature=temperature, stop=stop)
except Exception as e:
result = 'error:{}'.format(e)
return result

While gpt-3.5-turbo api call is:

response = openai.ChatCompletion.create(
model=engine,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
stop=stop
)

Regarding the connection between question tokens and database schema items.

Hi,
Good work@, especially the idea that introduced the relation between question tokens and schema items, is brilliant I have to say, but I am not quite sure how you established these connections during your graph construction process. What specific technique you have used to achieve that? I apologize if it is a silly question, but could you kindly explain it more, please? Thanks.

Best,
Zea

BIRD dataset low download speed

Is it possible to host the dataset in OneDrive or any other file hoster. I am getting very low download speed and it will take me weeks to download it if not.

Thanks

gpu memory usage increment of the model graphix-3b

Hello, does the graphix-3b model your team released have a bug of gpu memory accumulation? I have tried many times and found this problem. Can you solve this bug? Thank you.

By the way, to quote a sentence from your paper, not all research centers have A100 graphics cards, especially in the age of AIGC. LOL.

Loading heterograph issue of GraphixT5

Hi, I was trying to read your "graph_pedia_total.bin" but got the following error. Could you kindly advise me on this, please? IThanks.

This is the error.
"Can't get attribute 'DGLHeteroGraph' on <module 'dgl.heterograph' from '/usr/local/lib/python3.10/dist-packages/dgl/heterograph.py'>"

Best,
Zea

Hi - can't see Graphix code. Mind sharing? Thanks;

spectra processed dataset

I am highly interested in the spectra project and would like to conduct experiments and testing using some preprocessed data. Currently, I couldn’t find any preprocessed data readily available for use within the project. Therefore, I kindly request your assistance in providing some preprocessed data that I can use to better understand and utilize the spectra project. Thanks.

请问Chinese Goal-oriented Dialog (CGoDial)数据在哪里？DAMO-ConvAI-main\cgodial 目录下面没看到一个真实的数据？

请问论文《CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation》里面说的数据在哪里？
Dataset available at https://github.com/AlibabaResearch/DAMO-ConvAI/cgodial.
根据readme中找不到对应的数据。
请问是不是忘记上传了？

Flow-based Dialog

cd flow_based_dialog
The datasets is in ./data, there are two baselines:

## Retrieval_based Dialog 
`cd retrieval_based_dialog`  
The datasets is `train.json, dev.json, test.json`

Hi - Any update on code? Already April...

OLTQA 模型，针对MMMLU数据集的测试，为什么测试集只有284呢？

如题，MMMLU的数据集非常庞大，为什么测试集只会有284个samples？

https://huggingface.co/datasets/ncoop57/mmmlu

关于Table 2中，OLTQA模型是在unseen的数据上测试，但是其他的baseline也是如此吗？那些模型会在21个seen数据集上进行训练，然后再在这些unseen数据集上进行测试吗？

proton

According to README.md, when Preprocess dataset, there is a problem, as follows：

But the “tables.bin” does not exist in the original spider dataset.
Thank you！

Found two ground truth SQL queries in dev dataset that may be wrong

Hi team,

I may find two ground truth SQL queries in dev dataset that may be wrong. This is from the data/bird/dev.json. These two SQL queries are marked as ground truth but they cannot be executed against the sqlite DB.

The first SQL query is

"SQL": "SELECT T3.district, CAST((T3.A13 - T3.A12) AS REAL) * 100 / T3.A12 FROM loan AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id INNER JOIN district AS T3 ON T2.district_id = T3.district_id WHERE T1.status = 'D'"

However it might be

"SQL": "SELECT T3.A2, CAST((T3.A13 - T3.A12) AS REAL) * 100 / T3.A12 FROM loan AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id INNER JOIN district AS T3 ON T2.district_id = T3.district_id WHERE T1.status = 'D'"

This fix is to replace T3.district with T3.A2. There is no district in T3 (which is district table). Per the table schema I think it is actually T3.A2.

The second SQL. query is

"SQL": "SELECT COUNT(driverId) FROM target WHERE raceId = 18"

"SQL": "SELECT COUNT(driverId) FROM results WHERE raceId = 18"

This fix is to replace target table which does not exist. I think the table should be results.

SPACE3数据

链接里的数据集好像失效了，可以检查一下吗，谢谢~

sunsql problem with starting training

I followed the instructions in your readme markdown file and there was no problem with preprocessing.
But when I tried to start training, the program exited with the following error message:

Traceback (most recent call last):
  File "/home/SunSQL/scripts/text2sql.py", line 113, in <module>
assert now[0].query == now[1].query
IndexError: list index out of range

and this is the code snippet in which the error emerged:

for wl in range(0, len(cur_dataset), 2):
                # print(wl.query)
                now = cur_dataset[wl : wl+2]
                # print(now[0].query)
                # print(now[1].query)
                assert now[0].query == now[1].query

So I checked the cur_dataset variable and found that there was only 3 entries when the error happened, which well explained the problem. When wl=2, there was only 1 entry in the now variable.
I tried to circumvent this problem by breaking the loop when wl reaches len(cur_dataset)-1, but another error will happen when loss.backward() is executed telling me that data dimension is incorrect.

final_generation.json in STAR

Hi there. I have saw your great improvement on NL2SQL task vi STAR. Do you still have the final_generation.json file? Thanks.

dataloader question.

Hi, thanks for sharing the code for the brilliant work, Deep Thinking. But the data loader loop in the kv_iter loop, which I believe represents the total optimisation steps, really makes me confused. In the few-short learning setting, only k examples are given, and in your case, it should be exemplar_str. But why is there an extra data loader applied? Does it mean the proposed deep thinking process requiring extra data?

bird数据集验证中，diff_json_path这个参数的文件

diff_json_path的文件是什么？

Hyperparameters to reproduce reported scores for SPACE-3

Thanks for the great work.
I'm interested in intent prediction tasks using SPACE-3 and want to reproduce the reported scores of BANKING77, HWU64, CLINC150.

I had confirmed that the fine-tuned model, which is distributed on this link, obtained reported accuracy on BANKING77.
However, when I fine-tuned pre-trained models by myself, the accuracy did not match with the reported ones.

The following is scripts/banking/train.sh on my environment.
I had only changed PROJECT_ROOT and SAVE_ROOT from the original script.

#!/bin/bash
set -ux

# CUDA environment settings.
export CUDA_VISIBLE_DEVICES=1

# Parameters.
LEARNING_METHOD=super
MODEL=IntentUnifiedTransformer
TRIGGER_DATA=banking
TRIGGER_ROLE=user
PROJECT_ROOT=modelscope/damo/nlp_space_pretrained-dialog-model
VOCAB_PATH=${PROJECT_ROOT}/model/Bert/vocab.txt
DATA_DIR=${PROJECT_ROOT}/data/pre_train
LOAD_MODEL_NAME=SPACE-Intent
INIT_CHECKPOINT=${PROJECT_ROOT}/model/${LOAD_MODEL_NAME}
EXAMPLE=false
WITH_QUERY_BOW=false
WITH_RESP_BOW=false
WITH_CONTRASTIVE=false
WITH_RDROP=true
WITH_POOL=false
WITH_MLM=true
DYNAMIC_SCORE=true
GENERATION=false
POLICY=false
TOKENIZER_TYPE=Bert
DROPOUT_RATIO=0.25
TEMPERATURE=0.07
MLM_RATIO=0.1
KL_RATIO=5.0
LR=1e-4
PROMPT_NUM_FOR_POLICY=5
PROMPT_NUM_FOR_UNDERSTAND=5
BATCH_SIZE_LABEL=64
GRAD_ACCUM_STEPS=2
BATCH_SIZE_NOLABEL=0
NUM_PROCESS=1
NUM_INTENT=77
NUM_EPOCH=60
NUM_GPU=1
SEED=11
SAVE_ROOT=reproduce
SAVE_DIR=${SAVE_ROOT}/outputs/${TRIGGER_DATA}/94-94

# Data preprocess.
python -u preprocess.py \
  --data_dir=${DATA_DIR} \
  --with_mlm=${WITH_MLM} \
  --vocab_path=${VOCAB_PATH} \
  --num_process=${NUM_PROCESS} \
  --trigger_data=${TRIGGER_DATA} \
  --trigger_role=${TRIGGER_ROLE} \
  --dynamic_score=${DYNAMIC_SCORE} \
  --tokenizer_type=${TOKENIZER_TYPE} \
  --prompt_num_for_policy=${PROMPT_NUM_FOR_POLICY} \
  --prompt_num_for_understand=${PROMPT_NUM_FOR_UNDERSTAND}

# Main run.
python -u run_intent.py \
  --do_train=true \
  --do_infer=true \
  --do_test=true \
  --model=${MODEL} \
  --example=${EXAMPLE} \
  --policy=${POLICY} \
  --generation=${GENERATION} \
  --data_dir=${DATA_DIR} \
  --vocab_path=${VOCAB_PATH} \
  --num_process=${NUM_PROCESS} \
  --trigger_data=${TRIGGER_DATA} \
  --trigger_role=${TRIGGER_ROLE} \
  --dynamic_score=${DYNAMIC_SCORE} \
  --tokenizer_type=${TOKENIZER_TYPE} \
  --prompt_num_for_policy=${PROMPT_NUM_FOR_POLICY} \
  --prompt_num_for_understand=${PROMPT_NUM_FOR_UNDERSTAND} \
  --with_query_bow=${WITH_QUERY_BOW} \
  --with_resp_bow=${WITH_RESP_BOW} \
  --batch_size_label=${BATCH_SIZE_LABEL} \
  --gradient_accumulation_steps=${GRAD_ACCUM_STEPS} \
  --batch_size_nolabel=${BATCH_SIZE_NOLABEL} \
  --save_dir=${SAVE_DIR} \
  --init_checkpoint=${INIT_CHECKPOINT} \
  --learning_method=${LEARNING_METHOD} \
  --temperature=${TEMPERATURE} \
  --with_contrastive=${WITH_CONTRASTIVE} \
  --with_rdrop=${WITH_RDROP} \
  --with_pool=${WITH_POOL} \
  --with_mlm=${WITH_MLM} \
  --mlm_ratio=${MLM_RATIO} \
  --kl_ratio=${KL_RATIO} \
  --dropout=${DROPOUT_RATIO} \
  --embed_dropout=${DROPOUT_RATIO} \
  --attn_dropout=${DROPOUT_RATIO} \
  --ff_dropout=${DROPOUT_RATIO} \
  --num_intent=${NUM_INTENT} \
  --num_epoch=${NUM_EPOCH} \
  --gpu=${NUM_GPU} \
  --seed=${SEED} \
  --lr=${LR} \
  --log_steps=20 \
  --valid_steps=0 \
  --num_type_embeddings=2 \
  --save_checkpoint=true \
  --token_loss=true \
  --max_len=256

Do you have any ideas for reproducing reported scores?

Hello, excuse me for the interruption. Could you please explain the function of PROMPT_MAPPING mentioned in constants.py?

GRAPHIX_RELATIONS = ['question-question-dist-1', 'question-column-star', 'question-question-modifier', 'question-question-argument', 
'question-table-exactmatch', 'question-column-partialmatch', 'question-column-valuematch', 'table-question-exactmatch', 'table-column-pk', 
'table-column-has', 'table-table-fkr', 'column-question-partialmatch', 'column-table-pk', 'column-column-sametable', 'column-column-fkr', 
'column-column-star', 'column-table-has', 'column-question-valuematch', 'table-table-fk', 'column-column-fk', 'column-question-star', 
'question-column-exactmatch', 'column-question-exactmatch', 'question-table-partialmatch', 'table-question-partialmatch']

PROMPT_MAPPING = {
    'question-question-dist-1': 'This question word is the closet neighbor of another question word.',
    'question-column-star' : "This question word is connected with the special column item: '*'.",
    'question-question-modifier': 'The modifier of this question word is another word.',
    'question-question-argument': 'The argument of this question word is another word.',
    'question-table-exactmatch': 'This question word is matched exactly with the table item.',
    'question-column-partialmatch': 'This question word is matched partially with the column item.',
    'question-column-valuematch': 'This question word is matched with one of values in this column item.',
    'table-question-exactmatch': 'This table item is matched exactly with the question word.',
    'table-column-pk': 'This table item contains this column item as a primary key.',
    'table-column-has': 'This table item owns this column item as a normal relation.',
    'table-table-fkr': 'The other table item is linked with this table item by a foreign key.',
    'column-question-partialmatch': 'This column item is matched partially with the question word.',
    'column-table-pk': 'This column item is the unique primary key of the table item.',
    'column-column-sametable': 'This column item and the other column item appear in the same table.',
    'column-column-fkr': 'The other column item and this column item is linked as foreign key.',
    'column-column-star': "This column item is special column item: '*' or the other column item is the special column item: '*'.",
    'column-table-has': 'This column item belongs to the table item.',
    'column-question-valuematch': 'This column item contains the value that are matched with the question word.',
    'table-table-fk': 'This table item is linked with the other table item by a foreign key.',
    'column-column-fk': 'This column item and the other column item is linked as foreign key.',
    'column-question-star': "This speical column item: '*' links with the question word.",
    'question-column-exactmatch': 'This question word is matched exactly with the column item.',
    'column-question-exactmatch': 'This column item is matched exactly with the question word.',
    'question-table-partialmatch': 'This question word is matched partially with the table item.',
    'table-question-partialmatch': 'This table item is matched partially with the question word.'
}

I didn't see any calls, is this a technique you use to enhance performance?

SPECTRA model not found

The url of the code for Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment is
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SPECTRA. But it is 404 not found.

alibabaresearch / damo-convai Goto Github PK

damo-convai's Introduction

DAMO ConvAI

🔥 News

📝 License

damo-convai's People

Contributors

Stargazers

Watchers

Forkers

damo-convai's Issues

The download links in space-2 and 3 readme.md are all invalid. Can you fix it when you have time?

None of these links are available.

Flow-based Dialog

Recommend Projects

Recommend Topics

Recommend Org