Coder Social home page Coder Social logo

nvidia-merlin / nvtabular Goto Github PK

View Code? Open in Web Editor NEW
1.0K 34.0 144.0 100.74 MB

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

License: Apache License 2.0

Shell 0.51% Python 97.14% C++ 2.35%
deep-learning feature-engineering feature-selection gpu machine-learning nvidia preprocessing recommendation-system recommender-system

nvtabular's Issues

[BUG] Error when apply_offline=False

Issue by oyilmaz-nvidia
Friday May 15, 2020 at 20:53 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/143


Describe the bug
Getting the following error when apply_offline=False;

proc.apply(train_ds_iterator, apply_offline=False, record_stats=True, shuffle=True, output_path=output_train_dir, num_out_files=35)

from the criteo notebook.

TypeError                                 Traceback (most recent call last)
<timed eval> in <module>

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in apply(self, dataset, apply_offline, record_stats, shuffle, output_path, num_out_files, hugectr_gen_output, hugectr_output_path, hugectr_num_out_files)
    743                 shuffler=shuffler,
    744                 num_out_files=num_out_files,
--> 745                 huge_ctr=huge_ctr,
    746             )
    747         if shuffle:

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in apply_ops(self, gdf, start_phase, end_phase, record_stats, shuffler, output_path, num_out_files, huge_ctr)
    797             start = time.time()
    798             gdf, stat_ops_ran = self.run_ops_for_phase(
--> 799                 gdf, self.phases[phase_index], record_stats=record_stats
    800             )
    801             self.timings["preproc_apply"] += time.time() - start

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in run_ops_for_phase(self, gdf, tasks, record_stats)
    621             elif op._id in self.feat_ops:
    622                 gdf = self.feat_ops[op._id].apply_op(
--> 623                     gdf, self.columns_ctx, cols_grp, target_cols=target_cols
    624                 )
    625             elif op._id in self.df_ops:

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/ops.py in apply_op(self, gdf, columns_ctx, input_cols, target_cols, stats_context)
    122     ):
    123         target_columns = self.get_columns(columns_ctx, input_cols, target_cols)
--> 124         new_gdf = self.op_logic(gdf, target_columns, stats_context=stats_context)
    125         self.update_columns_ctx(columns_ctx, input_cols, new_gdf.columns, target_columns)
    126         return self.assemble_new_df(gdf, new_gdf, target_columns)

~/miniconda3/envs/recsys-0507/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/ops.py in op_logic(self, gdf, target_columns, stats_context)
    581         if not cont_names:
    582             return gdf
--> 583         z_gdf = gdf[cont_names].fillna(0)
    584         z_gdf.columns = [f"{col}_{self._id}" for col in z_gdf.columns]
    585         z_gdf[z_gdf < 0] = 0

TypeError: 'GPUDatasetIterator' object is not subscriptable```

[OP] Add cosine_similarity operation

Issue by rnyak
Tuesday May 19, 2020 at 22:15 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/150


Is your operator request related to a problem? Please describe.

Cosine similarity was used as a feature engineering technique in W&D Outbrain model. Cosine similarity is a metric used to measure similarity between two non-zero vectors of an inner product space.

Describe the solution you'd like

Apply cosine similarity as an operator.

A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type: Feature Engineering
  • input column type(s): Continuous numeric (X and Y as two vectors)
  • output column type(s): Continuous numeric within [-1, 1]
  • Expected transformation of the data after application: The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same.

Optional: Describe operation stages in detail*
Apply: compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||) see wikipedia for more information.

Additional context
Sklearn has a cosine_similarity function:

sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True) source.

One alternative way is to calculate cosine_distance and then apply 1- cosine_distance to calculate cosine_similarity value.

Currently cuml does not have python api or c++ api for cosine distance. There is only the prims API. The prims API are header only. The CUDA headers can be found in the following link:

https://github.com/rapidsai/cuml/tree/4084790afe605e82710597be575b63d8f57b1bbb/cpp/src_prims/distance

Wrappers need to be crated both in libcuml for easy C++ consumption as well as python.

[Task]Evaluate the RecSys 2020 challenge workflow and create github issues for all missing operators.

Issue by EvenOldridge
Friday May 15, 2020 at 20:41 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/142


What needs doing
Evaluate the RecSys 2020 challenge workflow and create github issues for all missing operators.
Issues should be under the nvTabular project and a part of the RecSys 2020 workflow milestone as this issue is.

Please create a new tab and add operators to both the workflow to operator master doc and the master operator doc. Please validate with the code in the source repo to ensure nothing is missing.

Additional context

Master operator doc:
https://docs.google.com/spreadsheets/d/1irirSo70PvuCovb_0nnJNjWUpSGhbjDOl72kxS_8fRk/edit#gid=1173451941

Workflow to operator master doc:
https://docs.google.com/spreadsheets/d/1EcY9n3uEUs3pPl7auEE4ahNQfjafaMTl6R-Zvk-7XlE/edit#gid=796720992

Source repo:
https://github.com/rapidsai/recsysChallenge2020/

[Task] Evaluate the W&D outbrains workflow and create tickets for all outstanding operators

Issue by EvenOldridge
Friday May 15, 2020 at 20:34 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/141


What needs doing
Evaluate the W&D outbrains workflow and create github issues for all outstanding operators.
Issues should be under the nvTabular project and a part of the outbrains milestone as this issue is.

The initial analysis of operators has been done in the workflow to operator master doc under the outbrains tab. Please validate with the code in the source repo to ensure nothing is missing.

Additional context

Master operator doc:
https://docs.google.com/spreadsheets/d/1irirSo70PvuCovb_0nnJNjWUpSGhbjDOl72kxS_8fRk/edit#gid=1173451941

Workflow to operator master doc:
https://docs.google.com/spreadsheets/d/1EcY9n3uEUs3pPl7auEE4ahNQfjafaMTl6R-Zvk-7XlE/edit#gid=796720992

Source repo:
https://gitlab-master.nvidia.com/dl/JoC/wide_deep_tf/tree/nvidia-release-20.04

[BUG] can't installed nvtabular with conda with cuda 10.1

Issue by benfred
Thursday May 14, 2020 at 19:43 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/137


Installing nvtabular with conda install -c nvidia -c rapidsai-nightly -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=10.1 fails with:

Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                             
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
Package cudatoolkit conflicts for:
nvtabular -> cudatoolkit[version='>=10.2.89,<10.3.0a0']
cudatoolkit=10.1
nvtabular -> cupy[version='>=7,<8.0.0a0'] -> cudatoolkit[version='10.0|10.0.*|10.2|10.2.*|9.2|9.2.*|10.1|10.1.*']
Package python conflicts for:
nvtabular -> cudf=0.14 -> python[version='>=3.6|>=3.6,<3.7.0a0|>=3.8,<3.9.0a0']
nvtabular -> python[version='>=3.7,<3.8.0a0']
python=3.7The following specifications were found to be incompatible with your CUDA driver:
  - feature:/linux-64::__cuda==10.1=0
  - feature:|@/linux-64::__cuda==10.1=0
Your installed CUDA driver is: 10.1

[FEA] Dedicated objects for columns_ctx, stats_ctx, phase, config

Issue by alecgunny
Tuesday Mar 24, 2020 at 19:28 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/28


Is your feature request related to a problem? Please describe.
Right now many of the central objects in a Workflow are constructed as lists or dicts with some expected structure. This can make parsing their meaning and use, as well as those of the constituent elements, difficult to understand when trying to debug or contribute. Examples include the columns_ctx, stats_ctx, phases, and the workflow config.

Describe the solution you'd like
Ideally, these objects would be replaced with dedicated objects with descriptive attributes and methods that make their functionality and components more clear. Methods on these objects could even simplify Workflow code by replacing methods which exist solely to update and retrieve information from these objects.

Describe alternatives you've considered
NamedTuples, or objects inheriting from NamedTuples, could be a simple way to ascribe fixed attributes, and could even maintain iterability to reduce short-term code updates.

[DOC] Update docstrings in groupby.py

Issue by rnyak
Friday May 15, 2020 at 18:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/139


Report incorrect documentation

Location of incorrect documentation
This request is for the GroupByMomentsCal class in the groupby.py. The definition of class params needs some clarification. It is not straightforward to understand what we need to feed to GroupByMomentsCal class to create instances.

class GroupByMomentsCal(object):
    """
    This is the class that GroupByMoments uses to
    calculate the basic statistics of the data that
    is grouped by a categorical feature.
    Parameters
    -----------
    col : str
        column name
    col_count : str
        column name to get group counts
    cont_col : list of str
        pre-calculated unique values.
    stats : list of str, default ['count']
        count of groups = ['count']
        sum of cont_col = ['sum']
...

Describe the problems or issues found in the documentation

what do col, col_count, cont_col refer to? The definitions do not really tell what params a user should provide. Does col refer to name of the cat_column that we want to apply groupby op on?

Suggested fix for documentation
The definitions for the GroupByMoments class reads well. We can modify the docstrings in the GroupByMomentsCal class accordingly.

class GroupByMoments(StatOperator):
    """
    One of the ways to create new features is to calculate
    the basic statistics of the data that is grouped by a categorical
    feature. This operator groups the data by the given categorical
    feature(s) and calculates the std, variance, and sum of requested continuous
    features along with count of every group. Then, merges these new statistics
    with the data using the unique ids of categorical data.
    Although you can directly call methods of this class to
    transform your categorical features, it's typically used within a
    Workflow class.
    Parameters
    -----------
    cat_names : list of str
        names of the categorical columns
    cont_names : list of str
        names of the continuous columns
    stats : list of str, default ['count']
        count of groups = ['count']
        sum of cont_col = ['sum']
...

Describe the documentation you'd like

GroupByMomentsCal params definition should be clarified.

[OP] add dropna() ops

Issue by rnyak
Tuesday May 26, 2020 at 14:34 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/170


Is your operator request related to a problem? Please describe.
dropna() is used in Outbrain's W&D model.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type (Feature Engineering or Preprocessing): Preprocessing
  • input column type(s): Continuous and Categorical
  • output column type(s): Continuous and Categorical
  • Expected transformation of the data after application: Missing/Null values are removed.

Additional context
cudf has dropna method works like below:

gdf = gdf.dropna(subset=['column1', 'column2'])

[OP] Add Min-Max scaling under Normalize() Class in ops.py

Issue by rnyak
Tuesday May 19, 2020 at 16:27 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/145


Is your operator request related to a problem? Please describe.
Min-Max scaling is one of the most common scaling techniques that is used for data pre-processing.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type: Preprocessing
  • input column type(s): Continuous numeric
  • output column type(s): Float numeric
  • Expected transformation of the data after application: Data is transformed to be within [0-1] range.

Optional: Describe operation stages in detail*
Statistics per chunk: Ex. compute the min and max of the column
Statistics combine: NA
Apply: (value-min)/(max-min)

Additional context

Example code for Sklearn MinMaxScaler class can be found here:
https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/preprocessing/_data.py#L200

[DOC] Fix the non-working links in the docs

Issue by rnyak
Monday May 11, 2020 at 21:44 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/111


Report incorrect documentation

Location of incorrect documentation

1.1 Introduction--> Examples and Tutorials:
Looks like DLRM Criteo Workflow hyperlink is not directly referring to the page.

Do we mean to refer to that page?
https://github.com/NVIDIA/DeepLearningExamples/tree/1cad1801646dee0e96e344b43c796181d6f05564/PyTorch/Recommendation/DLRM

1.2 Rossman Store Sales hyperlink is yielding page not found (HTTP ERROR 404).

2. Contributing

The Contributing.md hyper link does not work (see sentence below).
“If you wish to contribute to the library directly please see Contributing.md. “

3. Learn More

The API documentation hyperlink does not work (see sentence below).

"We also have API documentation that outlines in detail the specifics of the calls available within the library."

4. Introduction -->Getting started: Better to provide link for NVidia’s Triton Inference Server where it first appears (see sentence below).

"Integration with model serving frameworks like NVidia’s Triton Inference Server to make model deployment easy."

Rossmann example: end to end accuracy

In continuation of https://github.com/rapidsai/recsys/issues/75

This has bugged me for a long long long time now, and finally I've made some progress:

Known SOTA, i.e Kaggle top 10 LB is 0.108 (RMSPE).
Now NVTabular+fastai pipeline can match this SOTA.

  1. fastai data -> fastai preproc -> fastai tabular model:
    epoch train_loss valid_loss exp_rmspe time
    0 0.009749 0.014706 0.118227 00:17
    1 0.009324 0.012835 0.112409 00:17
    2 0.008561 0.012784 0.110864 00:17
    3 0.007877 0.012329 0.109368 00:17
    4 0.007448 0.012266 0.109173 00:17

  2. fastai data -> nvtab preproc -> back to pandas -> fastai tabular model:
    epoch train_loss valid_loss exp_rmspe time
    0 0.066737 0.049522 0.206644 00:18
    1 0.018532 0.016543 0.132428 00:18
    2 0.012569 0.014468 0.120802 00:18
    3 0.010587 0.012440 0.110982 00:18
    4 0.008863 0.011744 0.108335 00:19

  3. fastai data -> nvtab preproc -> fastai tabular model:
    epoch train_loss valid_loss my_exp_rmspe time
    0 0.011587 0.012796 0.119197 00:12
    1 0.011569 0.013420 0.115161 00:12
    2 0.009520 0.011074 0.106952 00:12
    3 0.007833 0.011341 0.105943 00:12
    4 0.006971 0.010858 0.104360 00:12

there's some thing funny re. data ordering, and in case (3), shuffling data lead to SOTA results

proc.apply(train_ds_iterator, apply_offline=True, record_stats=True, output_path=PREPROCESS_DIR_TRAIN, shuffle=True, num_out_files=3)
proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True, num_out_files=3)

you can see my notebooks at /mnt/dldata/nvtabular-notebook-share

[OP] Hash Bucketing

Issue by alecgunny
Wednesday May 20, 2020 at 16:40 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/153


Op for hashing categoricals and taking the modulo to map to some specified number of bins. Important for replacing inefficiencies in TensorFlow hash bucket feature column.

Implementation could also be used to help with hashing out-of-vocabulary values as referenced in #29, as well as with an alternative frequency thresholding technique (i.e. don't map everything below threshold to one value, hash them into bins).

Something like

class Hash(TransformOperator):
    def __init__(self, num_buckets, columns=None, **kwargs):
          if columns is None and not isinstance(num_buckets, int):
              # this is a potential API issue, and has implications for any op
              # that can take different arguments for each feature:
              # if I specify them individually at construction, I have no way
              # of checking the number of args against the number of
              # columns without just feeding `cat_names` again as `columns`,
              # which feels redundant. Maybe this isn't an issue, since you'll rarely
              # want to hash _everything_ and so would need to provide specific
              # `columns` regardless, but worth thinking about
              raise ValueError("Can't specify individual bucket counts without specifying columns")
        super(Hash, self).__init__(columns=columns, **kwargs)
        self.num_buckets = num_buckets

    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        if isinstance(self.num_buckets, int):
            num_buckets = [self.num_buckets for _ in target_columns]
         else:
            num_buckets = self.num_buckets

        # this is the part that would need cudf support
        # I also don't know for sure if modulo is supported like that
        gdf[target_columns] = gdf[target_columns].hash() % num_buckets
        return gdf

[FEA] Custom parquet Metadata for HugeCTR

Issue by benfred
Tuesday May 26, 2020 at 22:58 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/173


In order for the HugeCTR parquet reader to be able to read the output of NVTabular, we're going to need to add custom metadata to the output dataset describing what columns are labels/continuous features/categorical features to the output dataset.

For this to work in NVTabular, we need to ensure that we can write custom metadata fields from python with cudf. Likewise we need to be able to read this metadata from HugeCTR in C++.

[FEA] Multi-hot categorical support

Issue by alecgunny
Wednesday May 20, 2020 at 17:25 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/155


Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows. The ability to robustly learn dense representations of these features in an end-to-end manner represents one of the main advantages of deep learning approaches. Supporting representations of and transformations on these features is critical to broader adoption, and complex enough to possibly warrant a dedicated milestone. I'm creating this issue to start the discussion of how/where we want to start mapping out these solutions.

The most obvious first step to me is to decide on a representation pattern, as this will determine how we build op support on those representations. Some options include

  1. Dense dataframes padded with zeroes to some max number of elements
  2. Sparse dataframe version of the above
  3. Ragged-array series elements in dataframes

Option 1 would require the least overhead to extend support to, but obviously wastes memory and could be prohibitive for features that have category counts ranging over orders of magnitude (as is common). It also requires users to specify the max number of elements beforehand, which may not be known (unless we give them an op to compute it) and could change over time, potentially wasting memory or throwing out good data.

Options 2 and 3 would probably end up being pretty similar (I would imagine that specifying a max number of elements would end up being necessary for option 3), but 3 feels cleaner as it keeps features localized to individual columns (instead of spread out over many sparse columns) and keeps us from having to mix sparse and dense dataframes. It's also technically more memory efficient, since instead of each row requiring N (row_index, column_index, value) tuples, where N is the number of categories taken on by a given sample, you just need the array of N values and a single offset int.

One thing worth considering, though, is that if repeated categories are common, the ragged representation can become more memory intensive, since the value int in the sparse tuple would represent the k number of times that category occurs, while you would need k ints in the ragged representation for each time the category occurred.

One deciding factor in all of this is how we expect the APIs for multi-hot embeddings to be implemented. One method is to implement essentially a sparse matrix multiplication against the multi-hot encoded vector for each sample (with possibly some normalization to implement things like mean aggregation instead of just sum), which will be more efficient in the case of commonly repeated categories and, obviously, lends itself to the sparse representation. The other is to just perform a regular lookup on all the values and aggregate in segments using the offsets, which will lend itself to the ragged representation.

Long term, offering options for both representation and embedding choices will probably be most valuable to users. In the short term, it's worth picking one and starting to work on pushing for cudf support for it so we can begin to build op support. My personal vote is the ragged array option, since it will already be consistent with the PyTorch EmbeddingBag API, which we can port to TensorFlow, and seems like it would require the least overhead to support (since the sparse option seems like an extension of its functionality). Either way, even if it's not SOL in all cases, having one version accelerated is better than the existing options.

This requires some support in cudf to add list types:

  • Add nested list types to cudf
  • Read parquet files with nested lists
  • Write parquet files with nested lists
  • Python API support for for list dtypes
  • Read/Write access to list values in a cudf dataframe (for instance to hash / categorify the elements of a list

NVTabular changes include:

  • Support for list types in Categorify op
  • Support list types in Hashing op
  • Tensorflow dataloader suport
  • Pytorch dataloader

[Task] Remove redundant dataset write operations

Issue by benfred
Saturday May 02, 2020 at 00:31 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/86


What needs doing
There are multiple different spots that we're writing out a dataset - we should cleanup and remove the redundant code.

Additional context

The current code that is actively writing out a dataset is https://github.com/rapidsai/recsys/blob/31e6620c907782de425d4b2f01adb8e48ae639d5/nvtabular/nvtabular/preproc.py#L884-L886

this switches between the shuffled vs non shuffled code.

We also have code in ds_writer that still uses the pyarrow writer: https://github.com/rapidsai/recsys/blob/master/nvtabular/nvtabular/ds_writer.py

We also have an 'Export' op that also writes out the dataset: https://github.com/rapidsai/recsys/blob/31e6620c907782de425d4b2f01adb8e48ae639d5/nvtabular/nvtabular/ops.py#L579-L633

We should cleanup so that there is one definitive spot for writing a dataset, probably by adding missing features to the Export operator and deleting the other references.

Rossmann example: shuffle data failed

Shuffling data without setting num_out_files will crash:

Need to manually set num_out_files or set a default value.

proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-88bd63371ae5> in <module>
      1 proc.apply(train_ds_iterator, apply_offline=True, record_stats=True, output_path=PREPROCESS_DIR_TRAIN, shuffle=True, num_out_files=3)
----> 2 proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True)

/nvtabular/nvtabular/workflow.py in apply(self, dataset, apply_offline, record_stats, shuffle, output_path, num_out_files)
    758             self.finalize()
    759         if shuffle:
--> 760             shuffler = Shuffler(output_path, num_out_files=num_out_files)
    761         if apply_offline:
    762             self.update_stats(

/nvtabular/nvtabular/io.py in __init__(self, out_dir, num_out_files, num_threads)
    364     def __init__(self, out_dir, num_out_files=30, num_threads=4):
    365         self.queue = queue.Queue(num_threads)
--> 366         self.write_locks = [threading.Lock() for _ in range(num_out_files)]
    367         self.writer_files = [os.path.join(out_dir, f"{i}.parquet") for i in range(num_out_files)]
    368         self.writers = [ParquetWriter(f, compression=None) for f in self.writer_files]

TypeError: 'NoneType' object cannot be interpreted as an integer

[BUG] GroupByMoments requires cont_names and has to be a list

Issue by bschifferer
Wednesday May 20, 2020 at 14:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/152


The doc string of GroupByMoments is not aligned with the functionality of the operator:

  1. If I use "count" as only stats, then there is no need to define cont_names. In that case, there will be an error
  2. The doc string defines, that cont_names can be a string. In the case that cont_names is a single column as a string, then there is an error

Steps to reproduce

import cudf as gd
import pandas as pd
import nvtabular as nvt
from nvtabular.ops import Normalize, FillMissing, Categorify, Moments, Median, Encoder, LogOp, ZeroFill, GroupByMoments
from nvtabular.ops import *

cat_1 = np.asarray(['a']*12 + ['b']*10 + ['c']*10)
num_1 = np.asarray([1,1,2,2,2,1,1,5,4,4,4,4, 1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10])
pdf_1 = pd.DataFrame({'cat': cat_1, 'num': num_1})

gdf = cudf.from_pandas(pdf_1)

proc = nvt.Workflow(cat_names=['cat'], cont_names=['num'], label_name=[])
proc.finalize()

# First bug
gm = GroupByMoments(
    cat_names='cat',
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()
#TypeError: only integer scalar arrays can be converted to a scalar index

# Second bug
gm = GroupByMoments(
    cat_names='cat',
    cont_names='num',
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()
#AttributeError: 'str' object has no attribute 'copy'

# This works
gm = GroupByMoments(
    cat_names='cat',
    cont_names=['num'],
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()

[FEA] make merge operation optional after Groupby

Issue by rnyak
Thursday May 21, 2020 at 17:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/163


Is your feature request related to a problem? Please describe.

Currently, after applying Groupby operation, groupbyed columns are merged with the original data frame (gdf) and we get a new_gdf (see the op_logic function below). I was thinking can we have some flexibility here? Like adding merge option(param) as merge = True, or inplace=True in groupby operation, then, the groupbyed features would be merged, if False then, the user can have another dataframe only with groupbyed columns (cats and conts).

class GroupBy(DFOperator):
    """
    One of the ways to create new features is to calculate
    the basic statistics of the data that is grouped by a categorical
    feature. This operator groups the data by the given categorical
    feature(s) and calculates the std, variance, and sum of requested continuous
    features along with count of every group. Then, merges these new statistics
    with the data using the unique ids of categorical data.
    Although you can directly call methods of this class to
    transform your categorical features, it's typically used within a
    Workflow class.
    Parameters
    -----------
   ....
    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        if self.cat_names is None:
            raise ValueError("cat_names cannot be None.")

        new_gdf = cudf.DataFrame()
        for name in stats_context["moments"]:
            tran_gdf = stats_context["moments"][name].merge(gdf)
            new_gdf[tran_gdf.columns] = tran_gdf

        return new_gdf

Describe the solution you'd like

This is an example how we can apply Groupby operation:
proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']))

can we add a merge param here like below?

proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']), merge =True)

One aspect is that, if merge=False then we need to return a separate df as a result of Groupby operation. Better to find/think of use-cases where this will be practically useful:

  • we are applying Groupby and all we want to use groupbyed features.
  • we are applying Groupby and all we want to use the stats that we obtain, and may be use these stats as a normalization factor for some other feature?
  • we want to merge it with the original gdf, but can we drop the original columns (if we want to)?

[OP] Add dropDuplicates() operator

Issue by rnyak
Tuesday May 26, 2020 at 18:00 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/172


Is your operator request related to a problem? Please describe.

dropDuplicates() method is used in the Outbrain W&D model, and it is one of the commonly used methods in data preprocessing.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type (Feature Engineering or Preprocessing): Preprocessing
  • input column type(s): Continuous and categorical
  • output column type(s): Continuous and categorical
  • Expected transformation of the data after application: Return DataFrame with duplicate rows removed.

Additional context
cudf has dropDuplicates() method and applied as below:

cdf.drop_duplicates(keep= 'first', inplace=True)

[FEA] Move or remove get_emb_sz method on Categorify op

Issue by alecgunny
Wednesday May 13, 2020 at 16:53 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/121


It's not entirely clear to me why Categorify.get_emb_sz and its associated helper methods are included as methods on the Categorify object, especially when nothing in the code actually uses attributes from the object itself (with the exception of the self.embed_sz, which gets set rather than read from):

    def get_emb_sz(self, encoders, cat_names):
        work_in = {}
        for key in encoders.keys():
            work_in[key] = encoders[key] + 1
        # sorted key required to ensure same sort occurs for all values
        ret_list = [
            (n, self.def_emb_sz(work_in, n))
            for n in sorted(cat_names, key=lambda entry: entry.split("_")[0])
        ]
        return ret_list

    def emb_sz_rule(self, n_cat: int) -> int:
        return min(16, round(1.6 * n_cat ** 0.56))

    def def_emb_sz(self, classes, n, sz_dict=None):
        """Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`.
        """
        sz_dict = sz_dict if sz_dict else {}
        n_cat = classes[n]
        sz = sz_dict.get(n, int(self.emb_sz_rule(n_cat)))  # rule of thumb
        self.embed_sz[n] = sz
        return n_cat, sz

While I'm personally of the opinion that it's not our library's job to be providing rules of thumb for building deep learning models, just to provide the data for whatever rules the user wants to use, if we're intent on having this function somewhere it seems like it would be better served as a standalone function of a single encoder or as a property of the encoder itself:

emb_szs = [get_embed_sz(proc.stats["categories"][column]) for column in proc.columns_ctx["categorical"]["base"]]

# or
emb_szs = [proc.stats["categories"][column].embed_sz[1] for column in proc.columns_ctx["categorical"]["base"]]

[OP] Join onto external dataframe, file, dictionary

Issue by EvenOldridge
Thursday Mar 26, 2020 at 02:22 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/32


Is your operator request related to a problem? Please describe.
Outbrains dataset

Join to external data in multiple formats. This mechanism will also be used / shared by https://github.com/rapidsai/recsys/issues/31 to apply the groupby operations.

Describe the solution you'd like

  • Type: Feature Engineering
  • input column type(s): [Categorical]
  • input options:
    • Columns to join off of: [Categorical] (just in case naming isn't the same?)
    • Categorical column names to be joined
    • Continuous column names to be joined
  • Expected transformation of the data after application: Data is joined

Optional: Describe operation stages in detail*
Statistics per chunk: N/A
Statistics combine: N/A
Apply: Join

Context:
Note that this is meant to be an iterator to gdf join and not an iterator to iterator join which we'll figure out in the future.

[FEA] Faster HugeCTR ouput generation

Issue by oyilmaz-nvidia
Thursday May 21, 2020 at 16:03 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/161


Generating HugeCTR outputs is very slow in the current version of the code. We are using python and frameworks like cudf, pandas, numpy to do this.

A C++/CUDA code can actually do a lot better job here. If we pass the GPU memory references from cudf to C++/CUDA layer, we can quickly create the HugeCTR file format and write the data into files.

[BUG] NV-Tabular: end-to-end accuracy on Rossmann dataset not achieved

Issue by vinhngx
Monday Apr 27, 2020 at 22:37 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/75


Describe the bug
hi folks, I spent sometime trying to improve the accuracy of the model on the Rossmann data, but it's not anywhere near competitive. The fastai augmented data set (as available from https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-rossmann.ipynb) when used with the fastai preprocessing pipeline achieve top 10 Kaggle leaderboard. When using nvTabular pipeline, I tried to mimic the same pipeline but the end accuracy is far from competitive.

Im using the same engineered data provided by fastai
Including weather google trend etc
So just preprocessing and encoding is different now by the two prep pipeline
Accuracy should be similar
Steps/Code to reproduce bug

I've shared the notebook and data here on dlcluster: /mnt/dldata/nvtabular-notebook-share if you'd like to have a look

Expected behavior
Im using the same engineered data provided by fastai
Including weather google trend etc
So just preprocessing and encoding is different now by the two prep pipeline
Accuracy should be similar

Environment details (please complete the following information):
On dlcluster

nvidia-docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 -p 3001:3001 --ipc=host --name dev_nite --net=host -v /mnt:/mnt -v /mnt/dldata/vinhn/recsys:/recsys gitlab-master.nvidia.com:5005/rapidsdl/docker/rapidsdl_nite:latest /bin/bash

cd /recsys/nv-tabular
pip install -e .
source activate rapids && jupyter-lab --allow-root --ip='0.0.0.0' --NotebookApp.token=''

Additional context
Add any other context about the problem here.

[FEA] DLLabelEncoder kwarg for raising errors on out-of-vocabulary

Issue by alecgunny
Tuesday Mar 24, 2020 at 19:44 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/29


Is your feature request related to a problem? Please describe.
DLLabelEncoder by default reserves 0 for missing or out-of-vocabulary entries. While this is sensible default behavior, I can imagine scenarios where you know explicitly all the categories beforehand, and any sample with a value outside of these categories is problematic and should raise an error. In this case, the categories would map to [0, num_categories-1].

Describe the solution you'd like
Add a kwarg to DLLabelEncoder that can toggle this behavior. One possibility, used by TensorFlow's tf.feature_column.categorical_column_with_vocabulary_list is a num_oov_buckets kwarg that defaults to 1, but can be set to 0 indicating that no out-of-vocabulary inputs should be tolerated.

As a possible, but not strictly necessary, addition, higher values can be used to hash oov inputs into different bins. In this case, unclear whether to assign oov to the first num_oov_buckets integers or the range [num_categories, num_categories+num_oov_buckets-1].

Describe alternatives you've considered
I'm open to the argument that in the case that out-of-vocabulary are unacceptable, the onus is on the data scientist to make sure of this when feeding data in. But it feels like a silent failure, which isn't desirable. It also forces them to reserve the category 0 for a value that will never come, which can be a minor inconvenience.

[OP] Add sampling technique that allows stratified sampling

Issue by rnyak
Tuesday May 19, 2020 at 20:38 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/149


Is your operator request related to a problem? Please describe.

This method was used in the W&D Outbrain repo to generate validation set.

Describe alternatives you've considered
The method used is Spark's sampleBy function:

sampleBy(col, fractions, seed=None)¶
Parameters: 
col – column that defines strata
fractions – sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed – random seed
Returns: a new DataFrame that represents the stratified sample

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

  • Type: Preprocessing
  • input column type(s): Dataframe and a specific column used as input- [categorical, continuous].
  • output column type(s): [categorical, continuous]
  • Expected transformation of the data after application: Returns a stratified sample with or without replacement based on the fraction given on each stratum.

Additional context
Sklearn and pandas have the existing sampling methods, whereas cudf does not have such operation yet:

  1. sklearn.model_selection.train_test_split(*arrays, **options) source

  2. https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/model_selection/_split.py#L2029source

[FEA] epsilon kwarg for LogOp and possible rename

Issue by alecgunny
Monday May 11, 2020 at 16:05 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/105


  1. Add eps kwarg to LogOp such that the math implemented is log(x+eps) instead of only log(x+1). I get the need for numeric stability but this might be confusing for some users who haven't read the source and expected log to just mean log, especially when they try to transform back to the original space (for e.g. metric calculation). Small values like 1e-7 could be useful to users who want "log-like" behavior but still need numeric stability.

  2. LogOp is the only Op that has "Op" in its name. Might be worth just calling Log for consistency.

[FEA] Custom Operators

Issue by benfred
Tuesday May 12, 2020 at 17:56 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/113


Is your feature request related to a problem? Please describe.

Creating a custom operator isn't as straightforward as it should be.

  • we should simplify the process to create a custom op
  • and provide an example of how to do this in the examples folder

Decoupling ZeroFill from Relu op

Currently the ZeroFill op also implicitly does something akin to a Relu op i.e. replacing negative values with 0 via z_gdf[z_gdf < 0] = 0. I think these two behaviours should be decoupled and Relu should be made into an explicit op.

class ZeroFill(TransformOperator):
    default_in = CONT
    default_out = CONT

    @annotate("ZeroFill_op", color="darkgreen", domain="nvt_python")
    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        cont_names = target_columns
        if not cont_names:
            return gdf
        z_gdf = gdf[cont_names].fillna(0)
        z_gdf.columns = [f"{col}_{self._id}" for col in z_gdf.columns]
        z_gdf[z_gdf < 0] = 0
        return z_gdf

[FEA] Automatically calculate memory utilization parameters like limit_frac in dl_encoder.py

Issue by oyilmaz-nvidia
Wednesday Feb 26, 2020 at 15:57 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/17


Is your feature request related to a problem? Please describe.
It's not exactly a problem but a suggestion to utilize the GPU or host memory better. When we use a constant value for memory utilization such as "limit_frac = 0.05", we might be under utilize the memory depending on the data.

Describe the solution you'd like
Maybe, we can find a way to calculate these parameters automatically. If this doesn't seem feasible, we can leave it as it is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.