Coder Social home page Coder Social logo

amzn / pecos Goto Github PK

View Code? Open in Web Editor NEW
494.0 20.0 101.0 5.83 MB

PECOS - Prediction for Enormous and Correlated Spaces

Home Page: https://libpecos.org/

License: Apache License 2.0

C++ 27.19% Python 47.93% Makefile 0.13% TeX 0.21% Jupyter Notebook 24.14% Dockerfile 0.14% Shell 0.26%
machine-learning-algorithms extreme-multi-label-classification extreme-multi-label-ranking transformers approximate-nearest-neighbor-search

pecos's Introduction

PECOS - Predictions for Enormous and Correlated Output Spaces

PyPi Latest Release License

PECOS is a versatile and modular machine learning (ML) framework for fast learning and inference on problems with large output spaces, such as extreme multi-label ranking (XMR) and large-scale retrieval. PECOS' design is intentionally agnostic to the specific nature of the inputs and outputs as it is envisioned to be a general-purpose framework for multiple distinct applications.

Given an input, PECOS identifies a small set (10-100) of relevant outputs from amongst an extremely large (~100MM) candidate set and ranks these outputs in terms of relevance.

Features

Extreme Multi-label Ranking and Classification

  • X-Linear (pecos.xmc.xlinear): recursive linear models learning to traverse an input from the root of a hierarchical label tree to a few leaf node clusters, and return top-k relevant labels within the clusters as predictions. See more details in the PECOS paper (Yu et al., 2020).

    • fast real-time inference in C++
    • can handle 100MM output space
  • XR-Transformer (pecos.xmc.xtransformer): Transformer based XMC framework that fine-tunes pre-trained transformers recursively on multi-resolution objectives. It can be used to generate top-k relevant labels for a given instance or simply as a fine-tuning engine for task aware embeddings. See technical details in XR-Transformer paper (Zhang et al., 2021).

    • easy to extend with many pre-trained Transformer models from huggingface transformers.
    • establishes the State-of-the-art on public XMC benchmarks.
  • ANN Search with HNSW (pecos.ann.hnsw): a PECOS Approximated Nearest Neighbor (ANN) search module that implements the Hierarchical Navigable Small World Graphs (HNSW) algorithm (Malkov et al., TPAMI 2018).

    • Supports both sparse and dense input features
    • SIMD optimization for both dense/sparse distance computation
    • Supports thread-safe graph construction in parallel on multi-core shared memory machines
    • Supports thread-safe Searchers to do inference in parallel, which reduces inference overhead

Requirements and Installation

  • Python (3.7, 3.8, 3.9, 3.10)
  • Pip (>=19.3)

See other dependencies in setup.py You should install PECOS in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

Supporting Platforms

  • Ubuntu 20.04 and 22.04
  • Amazon Linux 2

Installation from Wheel

PECOS can be installed using pip as follows:

python3 -m pip install libpecos

Installation from Source

Prerequisite builder tools

  • For Ubuntu (20.04, 22.04):
sudo apt-get update && sudo apt-get install -y build-essential git python3 python3-distutils python3-venv
  • For Amazon Linux 2:
sudo yum -y install python3 python3-devel python3-distutils python3-venv && sudo yum -y groupinstall 'Development Tools'

One needs to install at least one BLAS library to compile PECOS, e.g. OpenBLAS:

  • For Ubuntu (20.04, 22.04):
sudo apt-get install -y libopenblas-dev
  • For Amazon Linux 2:
sudo amazon-linux-extras install epel -y
sudo yum install openblas-devel -y

Install and develop locally

git clone https://github.com/amzn/pecos
cd pecos
python3 -m pip install --editable ./

Quick Tour

To have a glimpse of how PECOS works, here is a quick tour of using PECOS API for the XMR problem.

Toy Example

The eXtreme Multi-label Ranking (XMR) problem is defined by two matrices

Some toy data matrices are available in the tst-data folder.

PECOS constructs a hierarchical label tree and learns linear models recursively (e.g., XR-Linear):

>>> from pecos.xmc.xlinear.model import XLinearModel
>>> from pecos.xmc import Indexer, LabelEmbeddingFactory

# Build hierarchical label tree and train a XR-Linear model
>>> label_feat = LabelEmbeddingFactory.create(Y, X)
>>> cluster_chain = Indexer.gen(label_feat)
>>> model = XLinearModel.train(X, Y, C=cluster_chain)
>>> model.save("./save-models")

After learning the model, we do prediction and evaluation

>>> from pecos.utils import smat_util
>>> Yt_pred = model.predict(Xt)
# print precision and recall at k=10
>>> print(smat_util.Metrics.generate(Yt, Yt_pred))

PECOS also offers optimized C++ implementation for fast real-time inference

>>> model = XLinearModel.load("./save-models", is_predict_only=True)
>>> for i in range(X_tst.shape[0]):
>>>   y_tst_pred = model.predict(X_tst[i], threads=1)

Citation

If you find PECOS useful, please consider citing the following paper:

Some papers from PECOS team:

License

Copyright (2021) Amazon.com, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pecos's People

Contributors

arunsathiya avatar bhl00 avatar chepingt avatar dependabot[bot] avatar elichienxd avatar hallogameboy avatar houyuhan98 avatar jianhao2016 avatar jiong-zhang avatar justindhillon avatar jybai avatar lan-lc avatar lihaoya723 avatar mo-fu avatar nishant4995 avatar octoberchang avatar patrick-h-chen avatar rofuyu avatar weiliw-amz avatar xabilahu avatar xeisberg avatar xiusic avatar xuanqing94 avatar xyh97 avatar yangyili001 avatar yaushian avatar yuhchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pecos's Issues

Add the Ability to Extract the Cluster Label Tree from the Model

I cannot see a way to get the model to easily output the full label tree structure. It is stored internally in a series of matrices but it's non trivial to reverse engineer the label tree from that. I'd like to experiment with using the label tree for various tasks, such as training a separate model (so just using PECOS for clustering) or using the clusters themselves as a set of unsupervised features for input to a model. Can you add a feature to allow extraction of the label tree in some easily digestible format?

libpecos_float32 library cannot be found in Colab platform

Description

When I extract the part of the code of the base.py from the core folder of Pecos and executed it in the colab platform, the exception "libpecos_float32 library cannot be found and built." occurred.

How to Reproduce?

Running the related python code in colab platform

Steps to reproduce

After installing the Pecos package in the colab platform,it shows the path of package is "/usr/local/lib/python3.8/dist-packages".
Then,running all the code of class corelib from base.py and make sure it is successes to execute.
Finally, running the code "clib = corelib(os.path.join(os.path.dirname(os.path.abspath(pecos.file)), "core"), "libpecos")" .

(Paste the commands you ran that produced the error.)

1.clib = corelib(os.path.join(os.path.dirname(os.path.abspath(pecos.file)), "core"), "libpecos")

What have you tried to solve it?

1.I tried to convert the libpecos.cpp from core folder to the shared object file.After I converted the libpecos.cpp to libpecos.o file,I failed to convert it to .so file.

Error message or code output

libpecos_float32 library cannot be found and built

After the pecos package is installed in the colab,I check the folder of pecos package but I cannot find the core folder in the path
"/usr/local/lib/python3.8/dist-packages". I also printed the path of .so file which is
"/usr/local/lib/python3.8/dist-packages/pecos/core/libpecos_float32".As a result,I think the reason for the error is lacking the .so file or the path is wrong. Can you directly provide the .so file to the GitHub? I am willing to directly upload the .so file to the colab platform and use the ctypes.CDLL function to execute the .so file.Thank you

Environment

  • Operating system: linux
  • Python version:3.8.10
  • PECOS version:0.2.0

(Add as much information about your environment as possible, e.g. dependencies versions.)

Examples with text

Description

Current example of X and Y only has numeric values. Could you please provide one example where X and Y are both text? Think the paper/method is targeted to solve such problems.

Test Bug report issue

Description

test bug report, do not respond

How to Reproduce?

click on issue and choose bug report

Steps to reproduce

None

(Paste the commands you ran that produced the error.)

  1. write bug report
  2. submit

What have you tried to solve it?

Error message or code output

(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)

None

Environment

  • Operating system: Mac OS
  • Python version: 3.7
  • PECOS version: 0.1

(Add as much information about your environment as possible, e.g. dependencies versions.)

Validation set split

Hi,

I am just trying to replicate the benchmark result for Wiki10-31K using xtransformer and xlinear.
I wonder if there is internal step to split train set to validation set. Otherwise, should i pre-process to generate held-out validation set before train it?

Thanks!

Format of yt label

Hello,

hope you are fine, have 2 questions about the format

Question 1
Have one question about optimal format for label Yt.
Is it preferable to have Yt as:

(A) OneHot encoded with only one 1 per row.
(B) Mutiple OneHot encode with mutiple 1 per rows (as this is the case for Xt).

When the prediction is done, it seems only outputing only one 1 per row.

Question 2:

Is there any constraint by having Xt as having a mix of dense input and sparse input
instead of sparse input only ?

ScipyCsrF32 Issue

Trying to test the linear model using a sample. The code is as following:

X_trn = np.array([[0.3, 0.5, 0.1],[0.2, 0.1, 0.8]])
Y_trn = np.array([[0.0, 0.0, 0.0, 1.0],[1.0, 0.0, 0.0, 0.0]])

import scipy
scipy.sparse.save_npz('temp_xtrn.npz', sA)
scipy.sparse.save_npz('temp_ytrn.npz', sB)
X = XLinearModel.load_feature_matrix("temp_xtrn.npz")
Y = XLinearModel.load_label_matrix("temp_ytrn.npz", for_training=True)
xlm = XLinearModel.train(X, Y)

while running the last line produces the following error:
517 """
518 dtype = np.float32
--> 519 assert X.dtype == dtype
520 assert Y.dtype == dtype
521 if isinstance(X, (smat.csr_matrix, ScipyCsrF32)):

AssertionError:

Any ideas how you exactly mean by using Scipy? if the problem is coming from this, how should use Scipy to able to used for this arbitrary example.

Thanks

How to use GIANT-XRT model to pre-train my own data

Description

(I want to get embedding of each node by using GIANT-XRT model rather than a Bert model,which shows more effecient in the paper ,and also I have the edges between nodes.I emailed to the author who helped me a lot,I'm very appreciate it.Following are some steps)

  • The first step is to prepare the graph, edge_index: This is a 2*|E| tensor which include the information of edges of the underlying graph. For example, edge_index[:,0] will be the first edge and edge_index[0,0], edge_index[1,0] are the starting and ending node id of that edge respectively. Once your are done, modify line 29-32 accordingly.
  • Next I prepare the raw text of each node. It is a text file and line "n" contain the related raw text of node of id "n". And I put it in the path of "args.raw-text-path".
  • I have been run the complete training for ogbn-arxiv successfully,got 5 files"Y.trn.npz,Y.all.npz,X.trn.txt,X.all.tfidf.npz,X.trn.tfidf.npz",but I still confused because they are not my target,seems the file "X.all.xrt-emb.npy" may be what I want,but it was generated from XR-Transformer,should I run the XR-Transformer?

XR-Transformer prediction step

After training the XTransformer model, I have been trying to generate the predictions as follows:

test_texts, test_texts_rpr = ...
# test_texts: list of 34681 texts
# test_text_rpr: csr_matrix of shape (34681, 52172)

model = self._load_model(...) # XTransformer.load("model")
prediction = model.predict(test_texts, test_texts_rpr)

however, I end up getting the following error:

[2022-08-01 04:51:11,280][pecos.xmc.xtransformer.model][INFO] - Full model loaded from resource/model_checkpoint/XR-TFMR_EURLEX57K_0
[2022-08-01 04:51:11,280][pecos.utils.torch_util][INFO] - Setting device to cuda, number of active GPUs: 1
[2022-08-01 04:51:11,947][pecos.xmc.xtransformer.matcher][INFO] - ***** Encoding data len=34681 truncation=128*****
[2022-08-01 04:51:14,547][pecos.xmc.xtransformer.matcher][INFO] - ***** Finished with time cost=2.5997776985168457 *****
[2022-08-01 04:51:14,547][pecos.xmc.xtransformer.matcher][INFO] - Predict on input text tensors(torch.Size([34681, 128]))
Error executing job with overrides: ['tasks=[predict]', 'data.folds=[0]']
Traceback (most recent call last):
  File "main.py", line 29, in perform_tasks
    predict(params)
  File "main.py", line 15, in predict
    predict_helper.perform_predict()
  File "/home/celso/projects/XMTC-Baselines/source/helper/PredictHelper.py", line 68, in perform_predict
    prediction = model.predict(texts, texts_rpr)
  File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 588, in predict
    pred_csr = self.concat_model.predict(
  File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/pecos/xmc/xlinear/model.py", line 503, in predict
    Y_pred = self.model.predict(
  File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/pecos/xmc/base.py", line 1491, in predict
    assert X.shape[1] == self.nr_features
AssertionError

Must X and Xt have the same number of features?

XTransformer model can't be trained it always shows cuda out of memory

"CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.96 GiB total capacity; 1.40 GiB already allocated; 6.50 MiB free; 1.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" it always shows after some time of excution.

Pecos killed on ranker training step

Description

The training has been killed on this training step:

Data: Amazon-670k
Model: X-Transformer

[2022-12-01 21:38:23,019][pecos.xmc.xtransformer.model][INFO] - Start training ranker...
[2022-12-01 21:38:24,001][pecos.xmc.base][INFO] - Training Layer 0 of 4 Layers in HierarchicalMLModel, neg_mining=tfn..
[2022-12-01 21:39:05,191][pecos.xmc.base][INFO] - Training Layer 1 of 4 Layers in HierarchicalMLModel, neg_mining=tfn..
[2022-12-01 21:40:24,829][pecos.xmc.base][INFO] - Training Layer 2 of 4 Layers in HierarchicalMLModel, neg_mining=tfn..
[2022-12-01 21:43:25,293][pecos.xmc.base][INFO] - Training Layer 3 of 4 Layers in HierarchicalMLModel, neg_mining=tfn+man..

Environment

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic
Python 3.8.15
libpecos~=0.4.0
1 RTX A4500, 32 vCPU, and 250 GB RAM

What could it be? Is it possible to resume training from that stage?

I would love support for Cost Sensitive Classification

Description

Reading through the code it looks like the base XR linear models support assigning a cost to each data point so that they can be weighted differently, e.g. by miss classification cost. In a search and ranking setting, some labels may be more important than others, e,g. applies are more important than clicks for a job board, or movies watched are more important than movies clicked on for a streaming service. However, the more important class is often much sparser, and so being able to use both a positive instances, one as strong one as weak positives, would improve the utility of the model.

Right now, all training instances are treated equally, but if a weight matrix could be provided with instance weights, or a labelling scheme were training data points have different labels, and each label has a relative weight, that would meet this need. The code references a R matrix as input to MLmodel.train, but it cannot be passed in from the trainer.

Comments in MLProblem.init:

Args:
            X (csr_matrix, np.ndarray or ScipyDrmF32): Instance feature matrix.
            Y (csr_matrix, np.ndarray or ScipyCscF32): Instance-to-label matrix.
            C (csc_matrix, np.ndarray or ScipyCscF32, optional): Label-to-cluster matrix.
                If not given, create an all-one matrix of shape `(Y.shape[1], 1)`.
            M (csc_matrix, np.ndarray or ScipyCscF32, optional): Instance-to-cluster matrix.
                If not given, creates M from Y*C with multi-threading sparse_matmul.
            R (csc_matrix, np.ndarray or ScipyCscF32, optional): Relevance matrix.
                If not given, will use None.
            threads(int, optional): Number of threads for multi-threading. Default to 8.

I would like to be able to pass in R. However in HierarchicalMLModel.train (line 1186 in xmc/base.py) the following code prevents this (and it's not exposed in calling classes):

        if prob.R is not None:
            raise NotImplementedError(
                "Cost-senstive learning for HierarchicalMLModel is not yet supported"
            )

Is this just a case of exposing the variable to the different train classes in the parents? If so i would happily implement that myself (and issues a PR) but i was thinking there was probably a reason for it to be hidden. If it's that simple please let me know here and i will work on adding it in.

Can't reproduce XR-Transformer Neurips results

Trying to reproduce XR-Transformer results gives me the following error during training and predicting:

02/17/2023 08:39:04 - INFO - __main__ - Setting random seed 0
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 582, in <module>
    do_train(args)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 495, in do_train
    X_trn = smat_util.load_matrix(args.trn_feat_path, dtype=np.float32)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/utils/smat_util.py", line 117, in load_matrix
    mat = np.load(src)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/numpy/lib/npyio.py", line 407, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './xmc-base/wiki10-31k//tfidf-attnxml/X.trn.npz'
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/predict.py", line 176, in <module>
    do_predict(args)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/predict.py", line 145, in do_predict
    xtf = XTransformer.load(args.model_folder)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 195, in load
    text_encoder = TransformerMatcher.load(os.path.join(load_dir, "text_encoder"))
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 359, in load
    raise ValueError(f"text_encoder does not exist at {encoder_dir}")
ValueError: text_encoder does not exist at models/wiki10-31k/bert/text_encoder/text_encoder
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xlinear/evaluate.py", line 72, in <module>
    do_evaluation(args)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/xmc/xlinear/evaluate.py", line 62, in do_evaluation
    Y_true = smat_util.load_matrix(args.truth_path).tocsr()
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/utils/smat_util.py", line 117, in load_matrix
    mat = np.load(src)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/numpy/lib/npyio.py", line 407, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './xmc-base/wiki10-31k//Y.tst.npz'
Traceback (most recent call last):
  File "ensemble_evaluate.py", line 58, in <module>
    do_evaluation(args)
  File "ensemble_evaluate.py", line 49, in do_evaluation
    Y_true = sorted_csr(load_matrix(args.truth_path).tocsr())
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/pecos/utils/smat_util.py", line 117, in load_matrix
    mat = np.load(src)
  File "/home/celso/projects/venvs/pecos/lib/python3.8/site-packages/numpy/lib/npyio.py", line 407, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: './xmc-base/wiki10-31k//Y.tst.npz'

XR-transformer model hyperparameters setting

Thank you for great work. I have several questions about the
Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification (https://arxiv.org/pdf/2110.00685.pdf)
The first one is about the model hyperparameters setting. In the appendix section of paper, Table 7 lists the model hyperparameters. In particular, $HLT_{prelim}$ and $HLT_{refine}$ define the structures of the preliminary and refined hierarchical label trees. For example , for dataset Eurlex-4K, $HLT_{prelim}$ is set to 16-256-3956, what does it means? and does 16 represent the cluster size at the first layer? What is more, what is the relationship between the param nr_splits with 16-256-3956? Last but not least, why did you chose these settings? https://github.com/amzn/pecos/tree/mainline/examples/xr-transformer-neurips21/params

Each instance in my dataset has 48 labels. how to set or fine tune the nr_splits and other important parameters such as $\alpha$,$\lambda$ etc.

Switch back to X-Transformer

Description

I notice that X-Transformer entry point now is also used by XR-Transformer. Is there a option to choose/decide with algorithm is being used under the hood?

Prepare training data

Hi @jiong-zhang , we're testing pecos on our dataset and we got some problem with preparing training data because our limited labelling capabilities.
So In which case does pecos take better performance?

  1. 50000 samples with single label
  2. 10000 samples with average 5 labels

Morever, Is it necessary to fully label each sample?

tks.

How to Use XR-Transformer in Text2Text App

Description

I want to use XR-Transformer in text2text app, following the parameters given here. But setting --params-path to this .json file raise the error:

Traceback (most recent call last):
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/site-packages/pecos/apps/text2text/train.py", line 345, in <module>
    train(args)
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/site-packages/pecos/apps/text2text/train.py", line 328, in train
    t2t_model = Text2Text.train(
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/site-packages/pecos/apps/text2text/model.py", line 317, in train
    pred_params = pred_params.override_with_kwargs(kwargs)
  File "/home/huziyuan/miniconda3/envs/huggingface/lib/python3.9/site-packages/pecos/apps/text2text/model.py", line 126, in override_with_kwargs
    self.xlinear_params.override_with_kwargs(pred_kwargs)
AttributeError: 'NoneType' object has no attribute 'override_with_kwargs'

References

Online Inference Latency for XR-TRANSFORMER

hi!

When I use XR-TRANSFORMER for predict(per input), the online Inference lattency comes up to 400ms. this is why?

the system I use is ubtuntu18.04, and XR-TRANSFORMER are evaluated on a Nvidia Tesla V100 GPU.

Thanks!

bug of installing from source

Description

there is sonme problems when install pecos from source according to readme.md

How to Reproduce?

python3 -m pip install --editable ./
Obtaining file:///home/workspace/lishengchao/pecos
Requirement already satisfied: scipy>=1.4.1 in /opt/conda/lib/python3.8/site-packages (from libpecos==0.3.0) (1.6.1)
Requirement already satisfied: scikit-learn>=0.24.1 in /opt/conda/lib/python3.8/site-packages (from libpecos==0.3.0) (0.24.1)
Requirement already satisfied: torch>=1.8.0 in /opt/conda/lib/python3.8/site-packages (from libpecos==0.3.0) (1.8.0)
Collecting sentencepiece!=0.1.92,>=0.1.86
Using cached https://repo.huaweicloud.com/repository/pypi/packages/68/91/ded0f64f90abfc5413c620fc345a0aef1e7ff5addda8704cc6b3bf589c64/sentencepiece-0.1.96-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Requirement already satisfied: transformers>=4.1.1 in /opt/conda/lib/python3.8/site-packages (from libpecos==0.3.0) (4.8.2)
Collecting numpy>=1.19.5
Using cached https://repo.huaweicloud.com/repository/pypi/packages/38/c0/c45c5eb0e25247d5fbb333fd0b56e570ba21cf0e3dca3abad174fb780e8c/numpy-1.22.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from scikit-learn>=0.24.1->libpecos==0.3.0) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.8/site-packages (from scikit-learn>=0.24.1->libpecos==0.3.0) (1.0.1)
Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.8/site-packages (from torch>=1.8.0->libpecos==0.3.0) (3.7.4.3)
Collecting huggingface-hub==0.0.12
Downloading https://repo.huaweicloud.com/repository/pypi/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (2021.4.4)
Requirement already satisfied: sacremoses in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (0.0.45)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (2.24.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (21.3)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (0.10.3)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (4.62.3)
Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (3.0.12)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.8/site-packages (from transformers>=4.1.1->libpecos==0.3.0) (5.4.1)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers>=4.1.1->libpecos==0.3.0) (1.15.0)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers>=4.1.1->libpecos==0.3.0) (7.1.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.1.1->libpecos==0.3.0) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.1.1->libpecos==0.3.0) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.1.1->libpecos==0.3.0) (1.25.11)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers>=4.1.1->libpecos==0.3.0) (2020.12.5)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging->transformers>=4.1.1->libpecos==0.3.0) (3.0.6)
Installing collected packages: sentencepiece, numpy, libpecos, huggingface-hub
Attempting uninstall: numpy
Found existing installation: numpy 1.19.2
Uninstalling numpy-1.19.2:
Successfully uninstalled numpy-1.19.2
Running setup.py develop for libpecos
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/workspace/lishengchao/pecos/setup.py'"'"'; file='"'"'/home/workspace/lishengchao/pecos/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps
cwd: /home/workspace/lishengchao/pecos/
Complete output (28 lines):
Set version to 0.3.0
running develop
running egg_info
creating libpecos.egg-info
writing libpecos.egg-info/PKG-INFO
writing dependency_links to libpecos.egg-info/dependency_links.txt
writing requirements to libpecos.egg-info/requires.txt
writing top-level names to libpecos.egg-info/top_level.txt
writing manifest file 'libpecos.egg-info/SOURCES.txt'
reading manifest file 'libpecos.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.c' under directory 'pecos/core'
writing manifest file 'libpecos.egg-info/SOURCES.txt'
running build_ext
building 'pecos.core.libpecos_float32' extension
INFO: C compiler: gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC

creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/pecos
creating build/temp.linux-x86_64-3.8/pecos/core
INFO: compile options: '-Ipecos/core -I/usr/include/ -I/usr/local/include -I/opt/conda/include/python3.8 -c'
extra options: '-fopenmp -O3 -std=c++14'
INFO: gcc: pecos/core/libpecos.cpp
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/tmp/ccNJQf5g.s: Assembler messages:
/tmp/ccNJQf5g.s: Fatal error: can't close build/temp.linux-x86_64-3.8/pecos/core/libpecos.o: Input/output error
error: Command "gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ipecos/core -I/usr/include/ -I/usr/local/include -I/opt/conda/include/python3.8 -c pecos/core/libpecos.cpp -o build/temp.linux-x86_64-3.8/pecos/core/libpecos.o -fopenmp -O3 -std=c++14" failed with exit status 1
----------------------------------------

ERROR: Command errored out with exit status 1: /opt/conda/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/workspace/lishengchao/pecos/setup.py'"'"'; file='"'"'/home/workspace/lishengchao/pecos/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

Environment

  • Ubtuntu 18.04
  • Python 3.8
  • PECOS 0.3.0

(Add as much information about your environment as possible, e.g. dependencies versions.)

Error when installing from .zip file

Description

Downloaded from the GitHub folder as a zip file and tried to install on local VE. Does not work and produced the error:
...
AssertionError: This does not appear to be a Git repository.
.....
Command errored out with exit status 1: python.....

How to Reproduce?

Pip install pecos-mainline.zip

some formatting

Hi,
Thasnks for this.

Just would like to confirm the format of the input

X : CSR format x(i,k) = val Can valx be a float ? does it need to be binary or [0,1] value ?

Y: CSR format, y(i,k) = valy . does it need to be binary ( 0 or 1) ?

Thx

Test feature request issue

Description

(A clear and concise description of what the feature is.)
Test feature request, do not respond

References

  • list reference and related literature
  • list known implementations
  • None

Papers100M model requirement

Description

Hi!
I'm doing some research with giant's code. I did make use of the embeddings of papers100M and arxiv, as well as the encoder model of the latter one, but it seemed that the encoder model of papers100M is not available right now. I wonder if you could publicize the model (maybe by offering the download method) ? That would be really helpful.
Thx!

A problem when using the pecos model to train xtransformer

Description

When I train xtransformer with pecos model, a training error occurs in the matcher stage.
the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher

I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher

P_trn, inst_embeddings = matcher.predict(
                prob.X_text,
                csr_codes=csr_codes,
                pred_params=pred_params,
                batch_size=train_params.batch_size,
                batch_gen_workers=train_params.batch_gen_workers,
                max_pred_chunk=30000,
            )

But another problem happened, see the training log below。

05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333
05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335
05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000
Traceback (most recent call last):
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
do_train(args)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train
xtf = XTransformer.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train
P_trn, inst_embeddings = matcher.predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict
cur_P, cur_embedding = self._predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict
cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()]
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem
row, col = self._validate_indices(key)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices
row = self._asindices(row, M)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (30255) out of range

I'm not sure if this is a bug, can you give me some advice? Thanks!

Environment

  • Operating system: Ubuntu 20.04.4 LTS container
  • Python version: Python 3.8.16
  • PECOS version:libpecos 1.0.0

XTransformer Bug: cuda failed

Description

There exist some bugs when I run the XTransformer model. I run the training command as instrcucted in https://github.com/amzn/pecos/blob/mainline/pecos/xmc/xtransformer/README.md . However, I meet the bug: "RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select". I use the disable-gpu command and the code can be run. So I wonder if there exist some bugs in the gpu utils code of XTransformer. Thanks!

How to Reproduce?

Steps to reproduce

python3 -m pecos.xmc.xtransformer.train --trn-text-path ${X_txt_path} \
                                            --trn-feat-path ${X_path}  \
                                            --trn-label-path ${Y_path} \
                                            --model-dir ${model_dir}

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Error message or code output

 File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/pecos/xmc/xtransformer/model.py", line 375, in train
    return_dict=True,
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/pecos/xmc/xtransformer/matcher.py", line 1333, in train
    matcher.fine_tune_encoder(prob, val_prob=val_prob, val_csr_codes=val_csr_codes)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/pecos/xmc/xtransformer/matcher.py", line 1079, in fine_tune_encoder
    label_embedding=(text_model_W_seq, text_model_b_seq),
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/pecos/xmc/xtransformer/network.py", line 234, in forward
    inputs_embeds=inputs_embeds,
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 989, in forward
    past_key_values_length=past_key_values_length,
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Environment

  • Operating system: ubuntu
  • Python version: 3.6
  • Pytorch version: 1.5.1

text example error

tried to run the following code:
"
python3 -m pecos.apps.text2text.train
--input-text-path ./training-data.txt
--vectorizer-config-path ./config.json
--output-item-path ./output-labels.txt
--model-folder ./pecos-text2text-model
"

And got error saying No such file or directory: ' ./config,json'

Can you help?

Is there at least one example showing how to use Pecos from a plain text dataset?

It has been difficult to infer how to use the PECOS properly. The usage case is splited over several README.md files and through the issues.

Then, could you provide a toy example of an end-to-end approach (using XR-Transformer for instance)?

Consider the following scenario: We have the training and testing samples in plain text

#train samples:
    text: raw_text_1, labels: [L1, L7, ..., L3]
    text: raw_text_2, labels: [L8, L9]
    ...
    text: raw_text_N, labels: [L1, L7, ..., L4]

#test samples:
    text: test_raw_text_1
    text: test_raw_text_2
    ...
    text: test_raw_text_M

and someone has to:

  1. prepare the data to the accepted format;
  2. train the model;
  3. predict the top k labels.

Issue with --label-embed-type pifa_lf_concat::Z=${Z_pifa_file}

Description

I am trying to use ----label-embed-type parameter in the training and it produces this error.
ValueError: Object arrays cannot be loaded when allow_pickle=False - Coming from np.load() function.

I have tested loading of NPZ file for z_labels (compressed and uncompressed, both) it produces this error if allow_pickle=False
I have load data by defining the allow_pickle=True for np.load() function.

Can you please add description of this file format or can we sent this parameter as an input?

This is the data I have after loading npz file with allow_pickle = True

[array(['Trump', 'Bus', 'Trolly '], dtype='<U23')
 array(['Show', 'Disp'], dtype='<U20')
 array(['Recap rew'], dtype='<U24')
 array(['Core, '], dtype='<U32')
 array(['Hoe'], dtype='<U10')
 array(['Plan'], dtype='<U21')]

How to Reproduce?

Execute model training with numpy version 1.21.2

python -m pecos.apps.text2text.train \
  --label-embed-type pifa_lf_concat::Z=${Z_pifa_file} \
  -i ${train_file} \ 
  -m ${model_folder}

Error message or code output

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jupyter/pecos_git/pecos/pecos/apps/text2text/train.py", line 311, in <module>
    train(args)
  File "/home/jupyter/pecos_git/pecos/pecos/apps/text2text/train.py", line 302, in train
    workspace_folder=args.workspace_folder,
  File "/home/jupyter/pecos_git/pecos/pecos/apps/text2text/model.py", line 325, in train
    Z = smat_util.load_matrix(val)
  File "/home/jupyter/pecos_git/pecos/pecos/utils/smat_util.py", line 117, in load_matrix
    mat = np.load(src)
  File "/opt/conda/lib/python3.7/site-packages/numpy/lib/npyio.py", line 441, in load
    pickle_kwargs=pickle_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/numpy/lib/format.py", line 743, in read_array
    raise ValueError("Object arrays cannot be loaded when "
ValueError: Object arrays cannot be loaded when allow_pickle=False

Environment

  • Operating system: Unix Ubuntu (on GCP)
  • Python version: 3.8
  • PECOS version: 0.1.0
  • numpy version: 1.21.2

MasOS support

Hi all, this product is amazing!!

I was wondering is there is any trick to make it available for MacOS M1 systems.

Thank you :)

Text2Text Errors out if label-embed-type is not 'pifa'

Description

I am trying to train a text2text model using label embeddings other than just the PIFA embeddings. Ideally i'd like to use tf-idf features computed over the labels, as one of the described options in the paper. When i tried using

--label-embed-type pifa_lf_concat

I get the following error, as the code is not passing the Label Feature matrix Z to the label embedding logic

Traceback (most recent call last):
  File "/opt/conda/envs/pecos/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/pecos/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/pecos/lib/python3.8/site-packages/pecos/apps/text2text/train.py", line 308, in <module>
    train(args)
  File "/opt/conda/envs/pecos/lib/python3.8/site-packages/pecos/apps/text2text/train.py", line 277, in train
    t2t_model = Text2Text.train(
  File "/opt/conda/envs/pecos/lib/python3.8/site-packages/pecos/apps/text2text/model.py", line 315, in train
    label_feat_set[embed_type] = LabelEmbeddingFactory.create(Y, X, method=embed_type)
  File "/opt/conda/envs/pecos/lib/python3.8/site-packages/pecos/xmc/base.py", line 1540, in create
    return mapping[method.lower()](Y, X, **kwargs)

How to Reproduce?

Train a text2text model with --label-embed-type pifa_lf_concat

Environment

  • Operating system: Unix Ubuntu (on GCP)
  • Python version: 3.8
  • PECOS version: 0.1.0

Does 'sigmoid' post-processor output predicted probability of labels?

Hi,

I recently run XRTransformer and XRLinear models to benchmark datasets and want to get the predicted probability P(Y_{i}=1|X) of labels. I believe both models are anyway OVA classifiers. Does it mean that I could simply apply a 'sigmoid' post-processor and understand this result as a predicted probability?

Thanks!

Parameter confusion

Hi @jiong-zhang ,could you please explain more detaily about the difference between 'only_topk' parameter in 'matcher_params_chain' and in 'ranker_params.hlm_args.model_chain' ?

For example, In Eurlex-4K-roberta, why is "only_topk": 25 in ranker but "only_topk": 5 in matcher at bottom level? Should these two parameters be the same ? Or should topK in matcher be larger than in ranker?

tks.

Does 'sigmoid' post-processor output predicted probabilities over labels?

Hi,

I recently run XRTransformer and XRLinear models to benchmark datasets and want to get the predicted probability P(Y_{i}=1|X) of labels. I believe both models are anyway OVA classifiers. Does it mean that I could simply apply a 'sigmoid' post-processor and understand this result as a predicted probability?

Thanks!

Add Support for User Supplied Negatives to Text2Text App

It appears that the regular PECOS library allows you to pass in user supplied negative examples as additional training data. However this does not appear to be exposed in the Text2Text app as far as I can tell from reading the code. Please can you add support for this functionality to the app as well so i can augment a model with additional negative examples?

text2text model evaluation not working

Description

Model evaluation is not working properly to output the precision and recall

How to Reproduce?

I run the following line of code,

python3 -m pecos.apps.text2text.evaluate --pred-path ./test-prediction.txt --truth-path ./test.txt --text-item-path ./output-labels.txt

where,
--pred-path is the path of file produced during model prediction,
--truth-path is the path of test file, e.g. Out1, Out2, Out3 \t cheap door
Out1, Out2 and Out3 are the line number in the the following output file
--text-item-path ./output-labels.txt

What have you tried to solve it?

Error message or code output

Traceback (most recent call last):
  File "/home/khalid/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/khalid/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/pecos/apps/text2text/evaluate.py", line 130, in <module>
    do_evaluation(args)
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/pecos/apps/text2text/evaluate.py", line 119, in do_evaluation
    Y_true = smat.csr_matrix((val_t, (row_id_t, col_id_t)), shape=(num_samples_t, len(item_dict)))
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 55, in __init__
    dtype=dtype))
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/coo.py", line 196, in __init__
    self._check()
  File "/home/khalid/PECOS/pecos_venv/lib/python3.7/site-packages/scipy/sparse/coo.py", line 285, in _check
    raise ValueError('column index exceeds matrix dimensions')
ValueError: column index exceeds matrix dimensions

Environment

  • Operating system:
  • Python version:
  • PECOS version:

(Add as much information about your environment as possible, e.g. dependencies versions.)

On MACLR algorithm: How can clustering be applied to the training process of step 1?

I found that you apply the clustering results to the loss function, but after torch.eq transformation, only a diagonal matrix will be output

def loss_function_reg(label_emb, inst_emb, inst_emb_aug, reg_emb, labels, accelerator):
	assert label_emb.shape[0] == inst_emb.shape[0], "{} is not equal to {}".format(label_emb.shape[0], inst_emb.shape[0])
	assert label_emb.shape[1] == inst_emb.shape[1]

	all_label_emb = mpu_utils.AllgatherFromDataParallelRegion.apply(label_emb)
	all_inst_emb = mpu_utils.AllgatherFromDataParallelRegion.apply(inst_emb)
	all_inst_emb_aug = mpu_utils.AllgatherFromDataParallelRegion.apply(inst_emb_aug)
	all_reg_emb = mpu_utils.AllgatherFromDataParallelRegion.apply(reg_emb)

	labels = labels.contiguous().view(-1, 1)
	all_labels = accelerator.gather(labels)
	num_inst = all_label_emb.shape[0]
	num_reg = all_reg_emb.shape[0]

	mask = torch.eq(all_labels, all_labels.transpose(0, 1)).float()

So how can the clustering results be applied to the training process? Is there a code reference?
Hope to get your reply!

Decouple transformer vectorisers from sklearn-based ones

Description

Thank you very much for the nice repository. We are currently using XLinear, together with the multi-thread Tfidf vectoriser

class Tfidf(Vectorizer):
"""Multithreaded tfidf vectorizer with C++ backend.
Supports 'word', 'char' and 'char_wb' tokenization.
"""

It works as a charm. However, with the new transformer vectorisers, to use Tfidf I need to unnecessarily install all transformers/torch dependencies in my virtual environment. It would be good if the code was decoupled, so a user can just use Tfidf if needed, with a slim environment. In the past, I used to just install pecos with no dependencies, and then only install the specific dependencies for the specific modules - this is desirable for a production environment.

Proposal:

  • That the pertained transformers in
    class PretrainedTransformer(Vectorizer):
    """Vectorizer with a variety of Transformer models."""

    Be moved to an independent file, say pretrained_vectorizers.py, alongside transformers and torch imports.

I'm happy to give this change a go, and contribute with a MR if the maintainers upvote it.

How to get the raw feature of paper100M?

Description

After downloading ogbn-papers100M.tar.gz, I can only get X.all.xrt-emb.npy and X.all.xrt-emb.sgc.pt. There is no Xall.txt

References

  • list reference and related literature
  • list known implementations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.