nvidia-merlin / nvtabular Goto Github PK

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

License: Apache License 2.0

Shell 0.51% Python 97.14% C++ 2.35%

deep-learning feature-engineering feature-selection gpu machine-learning nvidia preprocessing recommendation-system recommender-system

nvtabular's Issues

[BUG] Error when apply_offline=False

Issue by oyilmaz-nvidia
Friday May 15, 2020 at 20:53 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/143

Describe the bug
Getting the following error when apply_offline=False;

proc.apply(train_ds_iterator, apply_offline=False, record_stats=True, shuffle=True, output_path=output_train_dir, num_out_files=35)

from the criteo notebook.

TypeError                                 Traceback (most recent call last)
<timed eval> in <module>

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in apply(self, dataset, apply_offline, record_stats, shuffle, output_path, num_out_files, hugectr_gen_output, hugectr_output_path, hugectr_num_out_files)
    743                 shuffler=shuffler,
    744                 num_out_files=num_out_files,
--> 745                 huge_ctr=huge_ctr,
    746             )
    747         if shuffle:

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in apply_ops(self, gdf, start_phase, end_phase, record_stats, shuffler, output_path, num_out_files, huge_ctr)
    797             start = time.time()
    798             gdf, stat_ops_ran = self.run_ops_for_phase(
--> 799                 gdf, self.phases[phase_index], record_stats=record_stats
    800             )
    801             self.timings["preproc_apply"] += time.time() - start

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/workflow.py in run_ops_for_phase(self, gdf, tasks, record_stats)
    621             elif op._id in self.feat_ops:
    622                 gdf = self.feat_ops[op._id].apply_op(
--> 623                     gdf, self.columns_ctx, cols_grp, target_cols=target_cols
    624                 )
    625             elif op._id in self.df_ops:

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/ops.py in apply_op(self, gdf, columns_ctx, input_cols, target_cols, stats_context)
    122     ):
    123         target_columns = self.get_columns(columns_ctx, input_cols, target_cols)
--> 124         new_gdf = self.op_logic(gdf, target_columns, stats_context=stats_context)
    125         self.update_columns_ctx(columns_ctx, input_cols, new_gdf.columns, target_columns)
    126         return self.assemble_new_df(gdf, new_gdf, target_columns)

~/miniconda3/envs/recsys-0507/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

~/miniconda3/envs/recsys-0507/lib/python3.7/site-packages/nvtabular-0.1.0-py3.7.egg/nvtabular/ops.py in op_logic(self, gdf, target_columns, stats_context)
    581         if not cont_names:
    582             return gdf
--> 583         z_gdf = gdf[cont_names].fillna(0)
    584         z_gdf.columns = [f"{col}_{self._id}" for col in z_gdf.columns]
    585         z_gdf[z_gdf < 0] = 0

TypeError: 'GPUDatasetIterator' object is not subscriptable```

[DOC] Broken links in api documentation in rossmann notebook

Issue by benfred
Wednesday May 13, 2020 at 21:00 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/126

Rossmann Store Sales Example; http://10.33.227.156:12800/examples/rossmann-store-sales-preproc.ipynb cannot be found

This might be hard to fix because of how we're rendering the notebook with nbsphinx . See https://github.com/rapidsai/recsys/pull/115 for how we fixed links in the README

[OP] Add cosine_similarity operation

Issue by rnyak
Tuesday May 19, 2020 at 22:15 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/150

Is your operator request related to a problem? Please describe.

Cosine similarity was used as a feature engineering technique in W&D Outbrain model. Cosine similarity is a metric used to measure similarity between two non-zero vectors of an inner product space.

Describe the solution you'd like

Apply cosine similarity as an operator.

A clear and concise description of the operation you'd like to perform on the column. Please include:

Type: Feature Engineering
input column type(s): Continuous numeric (X and Y as two vectors)
output column type(s): Continuous numeric within [-1, 1]
Expected transformation of the data after application: The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same.

Optional: Describe operation stages in detail*
Apply: compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||) see wikipedia for more information.

Additional context
Sklearn has a cosine_similarity function:

sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True) source.

One alternative way is to calculate cosine_distance and then apply 1- cosine_distance to calculate cosine_similarity value.

Currently cuml does not have python api or c++ api for cosine distance. There is only the prims API. The prims API are header only. The CUDA headers can be found in the following link:

https://github.com/rapidsai/cuml/tree/4084790afe605e82710597be575b63d8f57b1bbb/cpp/src_prims/distance

Wrappers need to be crated both in libcuml for easy C++ consumption as well as python.

[Task] Perfomance Optimization: use parquet writer in dl_encoder

Issue by benfred
Monday May 04, 2020 at 18:43 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/87

We are currently using to_pandas / from_pandas to spill to host memory in dl_encoder.py

Using the parquet writer in cudf seems to be about 3x faster than using pandas - and since cudf now lets you write parquet files to memory we can get a decent performance improvement from using that instead.

[Task]Evaluate the RecSys 2020 challenge workflow and create github issues for all missing operators.

Issue by EvenOldridge
Friday May 15, 2020 at 20:41 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/142

What needs doing
Evaluate the RecSys 2020 challenge workflow and create github issues for all missing operators.
Issues should be under the nvTabular project and a part of the RecSys 2020 workflow milestone as this issue is.

Please create a new tab and add operators to both the workflow to operator master doc and the master operator doc. Please validate with the code in the source repo to ensure nothing is missing.

Additional context

Master operator doc:
https://docs.google.com/spreadsheets/d/1irirSo70PvuCovb_0nnJNjWUpSGhbjDOl72kxS_8fRk/edit#gid=1173451941

Workflow to operator master doc:
https://docs.google.com/spreadsheets/d/1EcY9n3uEUs3pPl7auEE4ahNQfjafaMTl6R-Zvk-7XlE/edit#gid=796720992

Source repo:
https://github.com/rapidsai/recsysChallenge2020/

[FEA] ORC file format support

Issue by benfred
Friday May 22, 2020 at 19:51 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/167

We should add support for reading datasets as ORC, in addition to the CSV/Parquet formats we already support

[Task] Evaluate the W&D outbrains workflow and create tickets for all outstanding operators

Issue by EvenOldridge
Friday May 15, 2020 at 20:34 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/141

What needs doing
Evaluate the W&D outbrains workflow and create github issues for all outstanding operators.
Issues should be under the nvTabular project and a part of the outbrains milestone as this issue is.

The initial analysis of operators has been done in the workflow to operator master doc under the outbrains tab. Please validate with the code in the source repo to ensure nothing is missing.

Additional context

Master operator doc:
https://docs.google.com/spreadsheets/d/1irirSo70PvuCovb_0nnJNjWUpSGhbjDOl72kxS_8fRk/edit#gid=1173451941

Workflow to operator master doc:
https://docs.google.com/spreadsheets/d/1EcY9n3uEUs3pPl7auEE4ahNQfjafaMTl6R-Zvk-7XlE/edit#gid=796720992

Source repo:
https://gitlab-master.nvidia.com/dl/JoC/wide_deep_tf/tree/nvidia-release-20.04

[FEA] Multi-gpu pytorch dataloader

Issue by jperez999
Wednesday May 27, 2020 at 16:31 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/176

trying to get a multi_gpu dataloader setup up for the JoC release in june/july. Should coincide with our 0.2 release.

[BUG] can't installed nvtabular with conda with cuda 10.1

Issue by benfred
Thursday May 14, 2020 at 19:43 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/137

Installing nvtabular with conda install -c nvidia -c rapidsai-nightly -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=10.1 fails with:

Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                             
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Output in format: Requested package -> Available versions
Package cudatoolkit conflicts for:
nvtabular -> cudatoolkit[version='>=10.2.89,<10.3.0a0']
cudatoolkit=10.1
nvtabular -> cupy[version='>=7,<8.0.0a0'] -> cudatoolkit[version='10.0|10.0.*|10.2|10.2.*|9.2|9.2.*|10.1|10.1.*']
Package python conflicts for:
nvtabular -> cudf=0.14 -> python[version='>=3.6|>=3.6,<3.7.0a0|>=3.8,<3.9.0a0']
nvtabular -> python[version='>=3.7,<3.8.0a0']
python=3.7The following specifications were found to be incompatible with your CUDA driver:
  - feature:/linux-64::__cuda==10.1=0
  - feature:|@/linux-64::__cuda==10.1=0
Your installed CUDA driver is: 10.1

[FEA] Dedicated objects for columns_ctx, stats_ctx, phase, config

Issue by alecgunny
Tuesday Mar 24, 2020 at 19:28 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/28

Is your feature request related to a problem? Please describe.
Right now many of the central objects in a Workflow are constructed as lists or dicts with some expected structure. This can make parsing their meaning and use, as well as those of the constituent elements, difficult to understand when trying to debug or contribute. Examples include the columns_ctx, stats_ctx, phases, and the workflow config.

Describe the solution you'd like
Ideally, these objects would be replaced with dedicated objects with descriptive attributes and methods that make their functionality and components more clear. Methods on these objects could even simplify Workflow code by replacing methods which exist solely to update and retrieve information from these objects.

Describe alternatives you've considered
NamedTuples, or objects inheriting from NamedTuples, could be a simple way to ascribe fixed attributes, and could even maintain iterability to reduce short-term code updates.

[DOC] Update docstrings in groupby.py

Issue by rnyak
Friday May 15, 2020 at 18:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/139

Report incorrect documentation

Location of incorrect documentation
This request is for the GroupByMomentsCal class in the groupby.py. The definition of class params needs some clarification. It is not straightforward to understand what we need to feed to GroupByMomentsCal class to create instances.

class GroupByMomentsCal(object):
    """
    This is the class that GroupByMoments uses to
    calculate the basic statistics of the data that
    is grouped by a categorical feature.
    Parameters
    -----------
    col : str
        column name
    col_count : str
        column name to get group counts
    cont_col : list of str
        pre-calculated unique values.
    stats : list of str, default ['count']
        count of groups = ['count']
        sum of cont_col = ['sum']
...

Describe the problems or issues found in the documentation

what do col, col_count, cont_col refer to? The definitions do not really tell what params a user should provide. Does col refer to name of the cat_column that we want to apply groupby op on?

Suggested fix for documentation
The definitions for the GroupByMoments class reads well. We can modify the docstrings in the GroupByMomentsCal class accordingly.

class GroupByMoments(StatOperator):
    """
    One of the ways to create new features is to calculate
    the basic statistics of the data that is grouped by a categorical
    feature. This operator groups the data by the given categorical
    feature(s) and calculates the std, variance, and sum of requested continuous
    features along with count of every group. Then, merges these new statistics
    with the data using the unique ids of categorical data.
    Although you can directly call methods of this class to
    transform your categorical features, it's typically used within a
    Workflow class.
    Parameters
    -----------
    cat_names : list of str
        names of the categorical columns
    cont_names : list of str
        names of the continuous columns
    stats : list of str, default ['count']
        count of groups = ['count']
        sum of cont_col = ['sum']
...

Describe the documentation you'd like

GroupByMomentsCal params definition should be clarified.

[Task] Setup CI against NVTabular repo

Issue by benfred
Tuesday May 19, 2020 at 18:43 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/146

We need to set up CI against the NVTabular repo, both to changes that happen to the master branch but also on every PR submitted.

[OP] add dropna() ops

Issue by rnyak
Tuesday May 26, 2020 at 14:34 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/170

Is your operator request related to a problem? Please describe.
dropna() is used in Outbrain's W&D model.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

Type (Feature Engineering or Preprocessing): Preprocessing
input column type(s): Continuous and Categorical
output column type(s): Continuous and Categorical
Expected transformation of the data after application: Missing/Null values are removed.

Additional context
cudf has dropna method works like below:

gdf = gdf.dropna(subset=['column1', 'column2'])

[Task] Add unittest for HugeCTR

Issue by benfred
Wednesday May 20, 2020 at 20:52 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/158

[Task] Split io.py / ops.py into submodules

Issue by benfred
Thursday May 21, 2020 at 21:01 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/165

We should split out io.py and ops.py into separate submodules. For instance something like:

nvtabular / io /
    __init__.py
    shuffler.py
    hugectr.py 
    ...

nvtabular / ops / 
    __init__.py
   categorify.py
   zerofill.py

[FEA] Import from S3

Issue by benfred
Saturday May 23, 2020 at 02:42 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/169

We should be able to load files from S3 for both preprocessing and dataloading. We need to add the ability to cudf to efficiently read from S3 , and then wire up nvtabular to support this.

[FEA] JOC Outbrains container / workflow

Issue by EvenOldridge
Tuesday May 19, 2020 at 19:37 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/148

Is your feature request related to a problem? Please describe.
Support JOC in the creation of the W&D outbrains container doing preprocessing using NVTabular

Additional context
Pawel Morkisz is leading this effort on the JOC side.

[FEA] XGBoost integration

Issue by benfred
Tuesday May 19, 2020 at 18:44 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/147

Add a sample notebook showing how to integrate NVTabular with XGBoost

[OP] Add Min-Max scaling under Normalize() Class in ops.py

Issue by rnyak
Tuesday May 19, 2020 at 16:27 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/145

Is your operator request related to a problem? Please describe.
Min-Max scaling is one of the most common scaling techniques that is used for data pre-processing.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

Type: Preprocessing
input column type(s): Continuous numeric
output column type(s): Float numeric
Expected transformation of the data after application: Data is transformed to be within [0-1] range.

Optional: Describe operation stages in detail*
Statistics per chunk: Ex. compute the min and max of the column
Statistics combine: NA
Apply: (value-min)/(max-min)

Additional context

Example code for Sklearn MinMaxScaler class can be found here:
https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/preprocessing/_data.py#L200

[DOC] Fix the non-working links in the docs

Issue by rnyak
Monday May 11, 2020 at 21:44 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/111

Report incorrect documentation

Location of incorrect documentation

1.1 Introduction--> Examples and Tutorials:
Looks like DLRM Criteo Workflow hyperlink is not directly referring to the page.

Do we mean to refer to that page?
https://github.com/NVIDIA/DeepLearningExamples/tree/1cad1801646dee0e96e344b43c796181d6f05564/PyTorch/Recommendation/DLRM

1.2 Rossman Store Sales hyperlink is yielding page not found (HTTP ERROR 404).

2. Contributing

The Contributing.md hyper link does not work (see sentence below).
“If you wish to contribute to the library directly please see Contributing.md. “

3. Learn More

The API documentation hyperlink does not work (see sentence below).

"We also have API documentation that outlines in detail the specifics of the calls available within the library."

4. Introduction -->Getting started: Better to provide link for NVidia’s Triton Inference Server where it first appears (see sentence below).

"Integration with model serving frameworks like NVidia’s Triton Inference Server to make model deployment easy."

Rossmann example: end to end accuracy

In continuation of https://github.com/rapidsai/recsys/issues/75

This has bugged me for a long long long time now, and finally I've made some progress:

Known SOTA, i.e Kaggle top 10 LB is 0.108 (RMSPE).
Now NVTabular+fastai pipeline can match this SOTA.

fastai data -> fastai preproc -> fastai tabular model:
epoch train_loss valid_loss exp_rmspe time
0 0.009749 0.014706 0.118227 00:17
1 0.009324 0.012835 0.112409 00:17
2 0.008561 0.012784 0.110864 00:17
3 0.007877 0.012329 0.109368 00:17
4 0.007448 0.012266 0.109173 00:17
fastai data -> nvtab preproc -> back to pandas -> fastai tabular model:
epoch train_loss valid_loss exp_rmspe time
0 0.066737 0.049522 0.206644 00:18
1 0.018532 0.016543 0.132428 00:18
2 0.012569 0.014468 0.120802 00:18
3 0.010587 0.012440 0.110982 00:18
4 0.008863 0.011744 0.108335 00:19
fastai data -> nvtab preproc -> fastai tabular model:
epoch train_loss valid_loss my_exp_rmspe time
0 0.011587 0.012796 0.119197 00:12
1 0.011569 0.013420 0.115161 00:12
2 0.009520 0.011074 0.106952 00:12
3 0.007833 0.011341 0.105943 00:12
4 0.006971 0.010858 0.104360 00:12

there's some thing funny re. data ordering, and in case (3), shuffling data lead to SOTA results

proc.apply(train_ds_iterator, apply_offline=True, record_stats=True, output_path=PREPROCESS_DIR_TRAIN, shuffle=True, num_out_files=3)
proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True, num_out_files=3)

you can see my notebooks at /mnt/dldata/nvtabular-notebook-share

[OP] Hash Bucketing

Issue by alecgunny
Wednesday May 20, 2020 at 16:40 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/153

Op for hashing categoricals and taking the modulo to map to some specified number of bins. Important for replacing inefficiencies in TensorFlow hash bucket feature column.

Implementation could also be used to help with hashing out-of-vocabulary values as referenced in #29, as well as with an alternative frequency thresholding technique (i.e. don't map everything below threshold to one value, hash them into bins).

Something like

class Hash(TransformOperator):
    def __init__(self, num_buckets, columns=None, **kwargs):
          if columns is None and not isinstance(num_buckets, int):
              # this is a potential API issue, and has implications for any op
              # that can take different arguments for each feature:
              # if I specify them individually at construction, I have no way
              # of checking the number of args against the number of
              # columns without just feeding `cat_names` again as `columns`,
              # which feels redundant. Maybe this isn't an issue, since you'll rarely
              # want to hash _everything_ and so would need to provide specific
              # `columns` regardless, but worth thinking about
              raise ValueError("Can't specify individual bucket counts without specifying columns")
        super(Hash, self).__init__(columns=columns, **kwargs)
        self.num_buckets = num_buckets

    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        if isinstance(self.num_buckets, int):
            num_buckets = [self.num_buckets for _ in target_columns]
         else:
            num_buckets = self.num_buckets

        # this is the part that would need cudf support
        # I also don't know for sure if modulo is supported like that
        gdf[target_columns] = gdf[target_columns].hash() % num_buckets
        return gdf

[FEA] Custom parquet Metadata for HugeCTR

Issue by benfred
Tuesday May 26, 2020 at 22:58 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/173

In order for the HugeCTR parquet reader to be able to read the output of NVTabular, we're going to need to add custom metadata to the output dataset describing what columns are labels/continuous features/categorical features to the output dataset.

For this to work in NVTabular, we need to ensure that we can write custom metadata fields from python with cudf. Likewise we need to be able to read this metadata from HugeCTR in C++.

[FEA] Multi-hot categorical support

Issue by alecgunny
Wednesday May 20, 2020 at 17:25 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/155

Multi-hot categorical features are a ubiquitous and important element of all production deep recommendation workflows. The ability to robustly learn dense representations of these features in an end-to-end manner represents one of the main advantages of deep learning approaches. Supporting representations of and transformations on these features is critical to broader adoption, and complex enough to possibly warrant a dedicated milestone. I'm creating this issue to start the discussion of how/where we want to start mapping out these solutions.

The most obvious first step to me is to decide on a representation pattern, as this will determine how we build op support on those representations. Some options include

Dense dataframes padded with zeroes to some max number of elements
Sparse dataframe version of the above
Ragged-array series elements in dataframes

Option 1 would require the least overhead to extend support to, but obviously wastes memory and could be prohibitive for features that have category counts ranging over orders of magnitude (as is common). It also requires users to specify the max number of elements beforehand, which may not be known (unless we give them an op to compute it) and could change over time, potentially wasting memory or throwing out good data.

Options 2 and 3 would probably end up being pretty similar (I would imagine that specifying a max number of elements would end up being necessary for option 3), but 3 feels cleaner as it keeps features localized to individual columns (instead of spread out over many sparse columns) and keeps us from having to mix sparse and dense dataframes. It's also technically more memory efficient, since instead of each row requiring N (row_index, column_index, value) tuples, where N is the number of categories taken on by a given sample, you just need the array of N values and a single offset int.

One thing worth considering, though, is that if repeated categories are common, the ragged representation can become more memory intensive, since the value int in the sparse tuple would represent the k number of times that category occurs, while you would need k ints in the ragged representation for each time the category occurred.

One deciding factor in all of this is how we expect the APIs for multi-hot embeddings to be implemented. One method is to implement essentially a sparse matrix multiplication against the multi-hot encoded vector for each sample (with possibly some normalization to implement things like mean aggregation instead of just sum), which will be more efficient in the case of commonly repeated categories and, obviously, lends itself to the sparse representation. The other is to just perform a regular lookup on all the values and aggregate in segments using the offsets, which will lend itself to the ragged representation.

Long term, offering options for both representation and embedding choices will probably be most valuable to users. In the short term, it's worth picking one and starting to work on pushing for cudf support for it so we can begin to build op support. My personal vote is the ragged array option, since it will already be consistent with the PyTorch EmbeddingBag API, which we can port to TensorFlow, and seems like it would require the least overhead to support (since the sparse option seems like an extension of its functionality). Either way, even if it's not SOL in all cases, having one version accelerated is better than the existing options.

This requires some support in cudf to add list types:

Add nested list types to cudf
Read parquet files with nested lists
Write parquet files with nested lists
Python API support for for list dtypes
Read/Write access to list values in a cudf dataframe (for instance to hash / categorify the elements of a list

NVTabular changes include:

Support for list types in Categorify op
Support list types in Hashing op
Tensorflow dataloader suport
Pytorch dataloader

[Task] Remove redundant dataset write operations

Issue by benfred
Saturday May 02, 2020 at 00:31 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/86

What needs doing
There are multiple different spots that we're writing out a dataset - we should cleanup and remove the redundant code.

Additional context

The current code that is actively writing out a dataset is https://github.com/rapidsai/recsys/blob/31e6620c907782de425d4b2f01adb8e48ae639d5/nvtabular/nvtabular/preproc.py#L884-L886

this switches between the shuffled vs non shuffled code.

We also have code in ds_writer that still uses the pyarrow writer: https://github.com/rapidsai/recsys/blob/master/nvtabular/nvtabular/ds_writer.py

We also have an 'Export' op that also writes out the dataset: https://github.com/rapidsai/recsys/blob/31e6620c907782de425d4b2f01adb8e48ae639d5/nvtabular/nvtabular/ops.py#L579-L633

We should cleanup so that there is one definitive spot for writing a dataset, probably by adding missing features to the Export operator and deleting the other references.

add Shuffle keyword that allows a batch to be created from multiple files

Issue by jperez999
Tuesday Jan 28, 2020 at 05:39 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/14

Is your feature request related to a problem? Please describe.
Need to ensure that no two batches are ever the same, even when doing multiple epochs

Describe the solution you'd like
A keyword, shuffle. The argument will trigger the shuffle.

[FEA] Add progress bars

Issue by benfred
Monday May 04, 2020 at 19:08 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/89

We should default displaying progress information (using tqdm) so that users can view progress when running nvtabular

[FEA] Import from S3

Issue by benfred
Saturday May 23, 2020 at 02:42 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/169

We should be able to load files from S3 for both preprocessing and dataloading. We need to add the ability to cudf to efficiently read from S3 , and then wire up nvtabular to support this.

[FEA] Support "online" apply with dask

Issue by benfred
Tuesday May 26, 2020 at 23:59 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/175

We should support the 'online' apply with the dask multigpu version, so that we can apply statistics computed by dask to the dataloaders.

With the current version, this might have an issue with using the stats computed by the dask Encoder in the Categorify op.

[BUG] Add Testing for StatOperators not used with DFOperators

Issue by jperez999
Thursday Dec 05, 2019 at 21:40 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/6

Describe the bug
Need to ensure testing of statistics added but not used in DFOperators.

Rossmann example: shuffle data failed

Shuffling data without setting num_out_files will crash:

Need to manually set num_out_files or set a default value.

proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-88bd63371ae5> in <module>
      1 proc.apply(train_ds_iterator, apply_offline=True, record_stats=True, output_path=PREPROCESS_DIR_TRAIN, shuffle=True, num_out_files=3)
----> 2 proc.apply(valid_ds_iterator, apply_offline=True, record_stats=False, output_path=PREPROCESS_DIR_VALID, shuffle=True)

/nvtabular/nvtabular/workflow.py in apply(self, dataset, apply_offline, record_stats, shuffle, output_path, num_out_files)
    758             self.finalize()
    759         if shuffle:
--> 760             shuffler = Shuffler(output_path, num_out_files=num_out_files)
    761         if apply_offline:
    762             self.update_stats(

/nvtabular/nvtabular/io.py in __init__(self, out_dir, num_out_files, num_threads)
    364     def __init__(self, out_dir, num_out_files=30, num_threads=4):
    365         self.queue = queue.Queue(num_threads)
--> 366         self.write_locks = [threading.Lock() for _ in range(num_out_files)]
    367         self.writer_files = [os.path.join(out_dir, f"{i}.parquet") for i in range(num_out_files)]
    368         self.writers = [ParquetWriter(f, compression=None) for f in self.writer_files]

TypeError: 'NoneType' object cannot be interpreted as an integer

[FEA] JOC DLRM container / workflow

Issue by benfred
Thursday Apr 23, 2020 at 16:42 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/63

Develop an end-to-end replacement for the spark joc-dlrm container. Because we can't publish cleaned up parquet files, this needs to include code to convert the original CSV files published by criteo

[BUG] GroupByMoments requires cont_names and has to be a list

Issue by bschifferer
Wednesday May 20, 2020 at 14:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/152

The doc string of GroupByMoments is not aligned with the functionality of the operator:

If I use "count" as only stats, then there is no need to define cont_names. In that case, there will be an error
The doc string defines, that cont_names can be a string. In the case that cont_names is a single column as a string, then there is an error

Steps to reproduce

import cudf as gd
import pandas as pd
import nvtabular as nvt
from nvtabular.ops import Normalize, FillMissing, Categorify, Moments, Median, Encoder, LogOp, ZeroFill, GroupByMoments
from nvtabular.ops import *

cat_1 = np.asarray(['a']*12 + ['b']*10 + ['c']*10)
num_1 = np.asarray([1,1,2,2,2,1,1,5,4,4,4,4, 1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10])
pdf_1 = pd.DataFrame({'cat': cat_1, 'num': num_1})

gdf = cudf.from_pandas(pdf_1)

proc = nvt.Workflow(cat_names=['cat'], cont_names=['num'], label_name=[])
proc.finalize()

# First bug
gm = GroupByMoments(
    cat_names='cat',
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()
#TypeError: only integer scalar arrays can be converted to a scalar index

# Second bug
gm = GroupByMoments(
    cat_names='cat',
    cont_names='num',
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()
#AttributeError: 'str' object has no attribute 'copy'

# This works
gm = GroupByMoments(
    cat_names='cat',
    cont_names=['num'],
    stats=['count']
)
gm.apply_op(gdf, columns_ctx=proc.columns_ctx, input_cols='all', target_cols=['base'])
gm.read_fin()

[FEA] make merge operation optional after Groupby

Issue by rnyak
Thursday May 21, 2020 at 17:35 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/163

Is your feature request related to a problem? Please describe.

Currently, after applying Groupby operation, groupbyed columns are merged with the original data frame (gdf) and we get a new_gdf (see the op_logic function below). I was thinking can we have some flexibility here? Like adding merge option(param) as merge = True, or inplace=True in groupby operation, then, the groupbyed features would be merged, if False then, the user can have another dataframe only with groupbyed columns (cats and conts).

class GroupBy(DFOperator):
    """
    One of the ways to create new features is to calculate
    the basic statistics of the data that is grouped by a categorical
    feature. This operator groups the data by the given categorical
    feature(s) and calculates the std, variance, and sum of requested continuous
    features along with count of every group. Then, merges these new statistics
    with the data using the unique ids of categorical data.
    Although you can directly call methods of this class to
    transform your categorical features, it's typically used within a
    Workflow class.
    Parameters
    -----------
   ....
    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        if self.cat_names is None:
            raise ValueError("cat_names cannot be None.")

        new_gdf = cudf.DataFrame()
        for name in stats_context["moments"]:
            tran_gdf = stats_context["moments"][name].merge(gdf)
            new_gdf[tran_gdf.columns] = tran_gdf

        return new_gdf

Describe the solution you'd like

This is an example how we can apply Groupby operation:
proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']))

can we add a merge param here like below?

proc.add_cat_feature(GroupBy(cat_names=cat_names[0], cont_names=cols[0:2], stats=['count', 'sum']), merge =True)

One aspect is that, if merge=False then we need to return a separate df as a result of Groupby operation. Better to find/think of use-cases where this will be practically useful:

we are applying Groupby and all we want to use groupbyed features.
we are applying Groupby and all we want to use the stats that we obtain, and may be use these stats as a normalization factor for some other feature?
we want to merge it with the original gdf, but can we drop the original columns (if we want to)?

[OP] Add dropDuplicates() operator

Issue by rnyak
Tuesday May 26, 2020 at 18:00 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/172

Is your operator request related to a problem? Please describe.

dropDuplicates() method is used in the Outbrain W&D model, and it is one of the commonly used methods in data preprocessing.

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

Type (Feature Engineering or Preprocessing): Preprocessing
input column type(s): Continuous and categorical
output column type(s): Continuous and categorical
Expected transformation of the data after application: Return DataFrame with duplicate rows removed.

Additional context
cudf has dropDuplicates() method and applied as below:

cdf.drop_duplicates(keep= 'first', inplace=True)

[FEA] Move or remove get_emb_sz method on Categorify op

Issue by alecgunny
Wednesday May 13, 2020 at 16:53 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/121

It's not entirely clear to me why Categorify.get_emb_sz and its associated helper methods are included as methods on the Categorify object, especially when nothing in the code actually uses attributes from the object itself (with the exception of the self.embed_sz, which gets set rather than read from):

    def get_emb_sz(self, encoders, cat_names):
        work_in = {}
        for key in encoders.keys():
            work_in[key] = encoders[key] + 1
        # sorted key required to ensure same sort occurs for all values
        ret_list = [
            (n, self.def_emb_sz(work_in, n))
            for n in sorted(cat_names, key=lambda entry: entry.split("_")[0])
        ]
        return ret_list

    def emb_sz_rule(self, n_cat: int) -> int:
        return min(16, round(1.6 * n_cat ** 0.56))

    def def_emb_sz(self, classes, n, sz_dict=None):
        """Pick an embedding size for `n` depending on `classes` if not given in `sz_dict`.
        """
        sz_dict = sz_dict if sz_dict else {}
        n_cat = classes[n]
        sz = sz_dict.get(n, int(self.emb_sz_rule(n_cat)))  # rule of thumb
        self.embed_sz[n] = sz
        return n_cat, sz

While I'm personally of the opinion that it's not our library's job to be providing rules of thumb for building deep learning models, just to provide the data for whatever rules the user wants to use, if we're intent on having this function somewhere it seems like it would be better served as a standalone function of a single encoder or as a property of the encoder itself:

emb_szs = [get_embed_sz(proc.stats["categories"][column]) for column in proc.columns_ctx["categorical"]["base"]]

# or
emb_szs = [proc.stats["categories"][column].embed_sz[1] for column in proc.columns_ctx["categorical"]["base"]]

[Task] Merge GroupBy and Encoder

Issue by oyilmaz-nvidia
Wednesday May 20, 2020 at 16:48 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/154

GroupBy and Encoder are doing very similar operations including groupby function. When we run these two in a workflow, we do redundant operations.

We need to find a way to make a one groupby function call for both GroupBy and Encoder (or Categorify) object for efficiency.

[OP] Join onto external dataframe, file, dictionary

Issue by EvenOldridge
Thursday Mar 26, 2020 at 02:22 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/32

Is your operator request related to a problem? Please describe.
Outbrains dataset

Join to external data in multiple formats. This mechanism will also be used / shared by https://github.com/rapidsai/recsys/issues/31 to apply the groupby operations.

Describe the solution you'd like

Type: Feature Engineering
input column type(s): [Categorical]
input options:
- Columns to join off of: [Categorical] (just in case naming isn't the same?)
- Categorical column names to be joined
- Continuous column names to be joined
Expected transformation of the data after application: Data is joined

Optional: Describe operation stages in detail*
Statistics per chunk: N/A
Statistics combine: N/A
Apply: Join

Context:
Note that this is meant to be an iterator to gdf join and not an iterator to iterator join which we'll figure out in the future.

[FEA] Faster HugeCTR ouput generation

Issue by oyilmaz-nvidia
Thursday May 21, 2020 at 16:03 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/161

Generating HugeCTR outputs is very slow in the current version of the code. We are using python and frameworks like cudf, pandas, numpy to do this.

A C++/CUDA code can actually do a lot better job here. If we pass the GPU memory references from cudf to C++/CUDA layer, we can quickly create the HugeCTR file format and write the data into files.

[Task] Write global parquet metadata file

Issue by benfred
Friday May 22, 2020 at 19:52 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/168

We should write out a global metadata file when exporting a parquet dataset.

[Task] Refactor apply_ops / exec_phase

Issue by benfred
Wednesday May 20, 2020 at 20:54 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/159

In workflow.py, apply_ops and exec_phase are largely identical - we should take the common code out to a common shared function.

[BUG] NV-Tabular: end-to-end accuracy on Rossmann dataset not achieved

Issue by vinhngx
Monday Apr 27, 2020 at 22:37 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/75

Describe the bug
hi folks, I spent sometime trying to improve the accuracy of the model on the Rossmann data, but it's not anywhere near competitive. The fastai augmented data set (as available from https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-rossmann.ipynb) when used with the fastai preprocessing pipeline achieve top 10 Kaggle leaderboard. When using nvTabular pipeline, I tried to mimic the same pipeline but the end accuracy is far from competitive.

Im using the same engineered data provided by fastai
Including weather google trend etc
So just preprocessing and encoding is different now by the two prep pipeline
Accuracy should be similar
Steps/Code to reproduce bug

I've shared the notebook and data here on dlcluster: /mnt/dldata/nvtabular-notebook-share if you'd like to have a look

Expected behavior
Im using the same engineered data provided by fastai
Including weather google trend etc
So just preprocessing and encoding is different now by the two prep pipeline
Accuracy should be similar

Environment details (please complete the following information):
On dlcluster

nvidia-docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 -p 3001:3001 --ipc=host --name dev_nite --net=host -v /mnt:/mnt -v /mnt/dldata/vinhn/recsys:/recsys gitlab-master.nvidia.com:5005/rapidsdl/docker/rapidsdl_nite:latest /bin/bash

cd /recsys/nv-tabular
pip install -e .
source activate rapids && jupyter-lab --allow-root --ip='0.0.0.0' --NotebookApp.token=''

Additional context
Add any other context about the problem here.

[FEA] DLLabelEncoder kwarg for raising errors on out-of-vocabulary

Issue by alecgunny
Tuesday Mar 24, 2020 at 19:44 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/29

Is your feature request related to a problem? Please describe.
DLLabelEncoder by default reserves 0 for missing or out-of-vocabulary entries. While this is sensible default behavior, I can imagine scenarios where you know explicitly all the categories beforehand, and any sample with a value outside of these categories is problematic and should raise an error. In this case, the categories would map to [0, num_categories-1].

Describe the solution you'd like
Add a kwarg to DLLabelEncoder that can toggle this behavior. One possibility, used by TensorFlow's tf.feature_column.categorical_column_with_vocabulary_list is a num_oov_buckets kwarg that defaults to 1, but can be set to 0 indicating that no out-of-vocabulary inputs should be tolerated.

As a possible, but not strictly necessary, addition, higher values can be used to hash oov inputs into different bins. In this case, unclear whether to assign oov to the first num_oov_buckets integers or the range [num_categories, num_categories+num_oov_buckets-1].

Describe alternatives you've considered
I'm open to the argument that in the case that out-of-vocabulary are unacceptable, the onus is on the data scientist to make sure of this when feeding data in. But it feels like a silent failure, which isn't desirable. It also forces them to reserve the category 0 for a value that will never come, which can be a minor inconvenience.

[OP] Add sampling technique that allows stratified sampling

Issue by rnyak
Tuesday May 19, 2020 at 20:38 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/149

Is your operator request related to a problem? Please describe.

This method was used in the W&D Outbrain repo to generate validation set.

Describe alternatives you've considered
The method used is Spark's sampleBy function:

sampleBy(col, fractions, seed=None)¶
Parameters: 
col – column that defines strata
fractions – sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
seed – random seed
Returns: a new DataFrame that represents the stratified sample

Describe the solution you'd like
A clear and concise description of the operation you'd like to perform on the column. Please include:

Type: Preprocessing
input column type(s): Dataframe and a specific column used as input- [categorical, continuous].
output column type(s): [categorical, continuous]
Expected transformation of the data after application: Returns a stratified sample with or without replacement based on the fraction given on each stratum.

Additional context
Sklearn and pandas have the existing sampling methods, whereas cudf does not have such operation yet:

sklearn.model_selection.train_test_split(*arrays, **options) source
https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/model_selection/_split.py#L2029source

[FEA] epsilon kwarg for LogOp and possible rename

Issue by alecgunny
Monday May 11, 2020 at 16:05 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/105

Add eps kwarg to LogOp such that the math implemented is log(x+eps) instead of only log(x+1). I get the need for numeric stability but this might be confusing for some users who haven't read the source and expected log to just mean log, especially when they try to transform back to the original space (for e.g. metric calculation). Small values like 1e-7 could be useful to users who want "log-like" behavior but still need numeric stability.
LogOp is the only Op that has "Op" in its name. Might be worth just calling Log for consistency.

[FEA] Custom Operators

Issue by benfred
Tuesday May 12, 2020 at 17:56 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/113

Is your feature request related to a problem? Please describe.

Creating a custom operator isn't as straightforward as it should be.

we should simplify the process to create a custom op
and provide an example of how to do this in the examples folder

[Task] End-to-End HugeCTR Example / Integration Test

Issue by benfred
Thursday May 21, 2020 at 20:57 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/164

We should create an end-to-end example of using NVTabular and HugeCTR together

This involves:

Creating a docker image with both HugeCTR and NVTabular preinstalled
Creating an example notebook showing how to use together
Creating an integration test that can be automatically run to test both NVTabular and HugeCTR

Decoupling ZeroFill from Relu op

Currently the ZeroFill op also implicitly does something akin to a Relu op i.e. replacing negative values with 0 via z_gdf[z_gdf < 0] = 0. I think these two behaviours should be decoupled and Relu should be made into an explicit op.

class ZeroFill(TransformOperator):
    default_in = CONT
    default_out = CONT

    @annotate("ZeroFill_op", color="darkgreen", domain="nvt_python")
    def op_logic(self, gdf: cudf.DataFrame, target_columns: list, stats_context=None):
        cont_names = target_columns
        if not cont_names:
            return gdf
        z_gdf = gdf[cont_names].fillna(0)
        z_gdf.columns = [f"{col}_{self._id}" for col in z_gdf.columns]
        z_gdf[z_gdf < 0] = 0
        return z_gdf

[FEA] Automatically calculate memory utilization parameters like limit_frac in dl_encoder.py

Issue by oyilmaz-nvidia
Wednesday Feb 26, 2020 at 15:57 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/17

Is your feature request related to a problem? Please describe.
It's not exactly a problem but a suggestion to utilize the GPU or host memory better. When we use a constant value for memory utilization such as "limit_frac = 0.05", we might be under utilize the memory depending on the data.

Describe the solution you'd like
Maybe, we can find a way to calculate these parameters automatically. If this doesn't seem feasible, we can leave it as it is.

[FEA] Implement full shuffle

Issue by benfred
Tuesday Apr 21, 2020 at 17:48 GMT
Originally opened as https://github.com/rapidsai/recsys/issues/51

We're currently only shuffling each chunk, and then writing the shuffled values to partitions in order. We should also shuffle each partition afterwards to truly randomize the order

nvidia-merlin / nvtabular Goto Github PK

nvtabular's Issues

Report incorrect documentation

Report incorrect documentation

Recommend Projects

Recommend Topics

Recommend Org