linkedin / photon-ml Goto Github PK

A scalable machine learning library on Apache Spark

License: Other

Groovy 0.23% Shell 0.09% Python 0.06% XSLT 0.16% Scala 48.13% Java 0.06% Terra 51.28%

photon-ml's Introduction

Photon Machine Learning (Photon ML)

Photon ML is a machine learning library based on Apache Spark. It was originally developed by the LinkedIn Machine Learning Algorithms Team. Currently, Photon ML supports training different types of Generalized Linear Models(GLMs) and Generalized Linear Mixed Models(GLMMs/GLMix model): logistic, linear, and Poisson.

Features
Experimental Features
How to Build
How to Use
Modules and directories
- Source code
- Other
IntelliJ IDEA setup
How to Contribute
Reference

Features

Generalized Linear Models

Linear Regression
Logistic Regression
Poisson Regression

GAME - Generalized Additive Mixed Effects

The GAME algorithm uses coordinate descent to expand beyond traditional GLMs to further provide per-entity (per-user, per-item, per-country, etc.) coefficients (also known as random effects in statistics literature). It manages to scale model training up to hundreds of billions of coefficients, while remaining solvable within Spark's framework.

For example, a GAME model for movie recommendations can be formulated as (fixed effect model + per-user random effect model + per-movie random effect model + user-movie matrix factorization model). More details on GAME models can be found here.

The type of GAME model currently supported by Photon ML is the GLMM or 'GLMix' model. Many of LinkedIn's core products have adopted GLMix models: jobs search and recommendation, news feed ranking, Ads CTR prediction and "People Also Viewed". More details on GLMix models can be found here.

Configurable Optimizers

LBFGS
TRON

Regularization

L1 (LASSO) regularization
L2 (Tikhonov) regularization (only type supported by TRON)
Elastic-net regularization

Feature scaling and normalization

Standardization: Zero-mean, unit-variant normalization
Scaling by standard deviation
Scaling by maximum magnitude to range [-1, 1]

Offset training

A typical naive way of training multi-layer models, it's used to insert another model's response into a global model. For example, when doing a typical binary classification problem, a model could be trained against a subset of all the features. Next, data could be scored with this model and the response scores set as 'offset' values. In this way, future models will learn against the residuals of the 1st layer model's response while having the benefits of combining the two models together.

Feature summarization

Provides typical metrics (mean, min, max, std, variance, etc.) on a per feature basis.

Model validation

Compute evaluation metrics for the trained models over a validation dataset, such as AUC, RMSE, or Precision@k.

Warm-start training

Load existing models and use their coefficients as a starting point for optimization. When training multiple models in succession, use the coefficients of the previous model.

Partial re-training

Load existing models, but lock their coefficients. Allows efficient re-training of portions of a GAME model.

Incremental Learning

Load existing models, use their coefficients and variances to construct an informative prior to train models incrementally. Incremental trained models have comparable performance as the model using both the previous data and current data.

Experimental Features

Photon ML currently contains a number of experimental features that have not been fully tested.

Smoothed Hinge Loss Linear SVM

In addition to the Generalized Linear Models described above, Photon-ML also supports an optimizer-friendly approximation for linear SVMs as described here by Jason D. M. Rennie.

Hyperparameter Auto-Tuning

Automatically explore the hyperparameter space for your GAME model. Two types of search exist:

Random search: Use Sobol sequences to randomly, but evenly, explore the hyperparameter space
Bayesian search: Use a Gaussian process to perform a directed search throughout the hyperparameter space

How to Build

Note: Before building, please make sure environment variable JAVA_HOME is pointed at a Java 8 JDK property. Photon ML is not compatible with JDK < 1.8. The below commands are for Linux/Mac users, for Windows, please use gradlew.bat instead of gradlew.

# Build binary jars only:
./gradlew assemble

# Build with all tests (unit and integration):
./gradlew clean build

# Build with only unit tests:
./gradlew clean build -x integTest

# Build with only integration tests:
./gradlew clean build -x test

# Build with no tests:
./gradlew clean build -x test -x integTest

# Run unit tests:
./gradlew test

# Run integration tests:
./gradlew integTest

# Check License with Apache Rat
./gradlew rat

# Check scala style
./gradlew scalastyle

# Check everything
./gradlew check

How to Use

Drivers

To use Photon ML from the command line, 3 default drivers exist: the Legacy Photon driver for GLM training, the GAME training driver, and the GAME scoring driver. Each of these have their own input parameters. We recommend using the GAME drivers, as a GLM is a special case of GAME model. The Legacy Photon driver has not been developed for some time and is deprecated.

API

Photon ML can be imported just like Spark ML, and the API layer used directly. Where possible, we have tried to make the interfaces identical to those of Spark ML. See the driver source code for examples of how to use the Photon ML API.

Avro Schemas

The currently available drivers read/write data in Apache Avro format. The detailed schemas are declared at photon-avro-schemas module.

What about other formats?

LinkedIn uses primarily Avro formatted data. While Avro does provide a unified and rigorous way of managing all critical data representations, we think it is also important to allow other data formats to make Photon ML more flexible. Contributions of DataReaders for other formats to Photon ML are welcome and encouraged.

Input Data Format

Photon ML reserves the following field names in the Avro input data:

response: double (required)
- The response/label for the event
weight: double (optional)
- The relative weight of a particular sample compared to other samples
- Default = 1.0
offset: double (optional)
- The residual score computed by some other model
- Default = 0.0
- Computed scores always take the form (x * B) + offset, where x is the feature vector and B is the coefficient vector
uid: string, int, or long (optional)
- A unique ID for the sample
metadataMap: map: [string] (optional)
- A map of non-feature metadata for the sample
features: array: [FeatureAvro] (required by Legacy Photon driver)
- An array of features to use for training/scoring

All of these default names can be overwritten using the GAME drivers. However, they are reserved and cannot be used for purposes other than their default usage (e.g. cannot specify response as your weight column).

Additional fields may exist in the record, and in fact are necessary for certain features (e.g. must have ID fields to group data by for random effect models or certain validation metrics).

Features loaded through the existing drivers are expected to follow the LinkedIn naming convention. Each feature must be an Avro record with the following fields:

name: string

The feature name/category

term: string

The feature sub-category

value: double

The feature value

To demonstrate the difference between name and term, consider the following categorical features:

  name = "age"
  term = "0-10"
  value = 1.0
  
  name = "age"
  term = "11-20"
  value = 0.0
  
  ...

Models

Legacy Photon outputs model coefficients directly to text:

# For each line in the text file:
[feature_string]\t[feature_id]\t[coefficient_value]\t[regularization_weight]

GAME models are output using the BayesianLinearModelAvro Avro schema.

Shaded Jar

photon-all module releases a shaded jar containing all the required runtime dependencies of Photon ML, other than Spark and Hadoop. Shading is a robust way of creating fat/uber jars. It does not only package all dependencies into one single place, but also smartly renames a few selected class packages to avoid dependency conflicts. Although photon-all.jar is not a necessity, and it is fine for users to provide their own copies of dependences, it is highly recommended to be used in cluster environment where complex dependency conflicts could happen between system and user jars. (See Gradle Shadow Plugin for more about shading).

Below is a command to build the photon-all jar:

./gradlew :photon-all:assemble

Try It Out!

The easiest way to get started with Photon ML is to try the tutorial we created to demonstrate how GLMix models can be applied to build a personalized recommendation system. You can view the instructions on the wiki here.

Alternatively, you can follow these steps to try Photon ML on your machine.

Install Spark

This step is platform-dependent. On OS X, you can install Spark with Homebrew using the following command:

brew install apache-spark

For more information, see the Spark docs.

Get and Build the Code

git clone [email protected]:linkedin/photon-ml.git
cd photon-ml
./gradlew build -x test -x integTest

Grab a Dataset

For this example, we'll use the "a1a" dataset, acquired from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Currently the Photon ML dataset converter supports only the LibSVM format.

curl -O https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a
curl -O https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a.t

Convert the data to the Avro format that the Photon ML drivers use.

mkdir -p a1a/train
mkdir -p a1a/test
pip install avro
python dev-scripts/libsvm_text_to_trainingexample_avro.py a1a dev-scripts/TrainingExample.avsc a1a/train/a1a.avro
python dev-scripts/libsvm_text_to_trainingexample_avro.py a1a.t dev-scripts/TrainingExample.avsc a1a/test/a1a.t.avro

The first command might be different, depending on the configuration of your system. If it fails, try your platform's standard approach for installing a Python library.

Train the Model

Now we're ready to train the model with Photon ML on your local dev box. Run the following command from the "photon-ml" directory:

spark-submit \
  --class com.linkedin.photon.ml.Driver \
  --master local[*] \
  --num-executors 4 \
  --driver-memory 1G \
  --executor-memory 1G \
  "./build/photon-all_2.10/libs/photon-all_2.10-1.0.0.jar" \
  --training-data-directory "./a1a/train/" \
  --validating-data-directory "./a1a/test/" \
  --format "TRAINING_EXAMPLE" \
  --output-directory "out" \
  --task "LOGISTIC_REGRESSION" \
  --num-iterations 50 \
  --regularization-weights "0.1,1,10,100" \
  --job-name "demo_photon_ml_logistic_regression"

Alternatively, to run the exact same training using the GAME training driver, use the following command:

spark-submit \
  --class com.linkedin.photon.ml.cli.game.GameTrainingDriver \
  --master local[*] \
  --num-executors 4 \
  --driver-memory 1G \
  --executor-memory 1G \
  "./build/photon-all_2.10/libs/photon-all_2.10-1.0.0.jar" \
  --input-data-directories "./a1a/train/" \
  --validation-data-directories "./a1a/test/" \
  --root-output-directory "out" \
  --feature-shard-configurations "name=globalShard,feature.bags=features" \
  --coordinate-configurations "name=global,feature.shard=globalShard,min.partitions=4,optimizer=LBFGS,tolerance=1.0E-6,max.iter=50,regularization=L2,reg.weights=0.1|1|10|100" \
  --coordinate-update-sequence "global" \
  --coordinate-descent-iterations 1 \
  --training-task "LOGISTIC_REGRESSION"

When this command finishes, you should have a new folder named "out" containing the trained model.

Running Photon ML on Cluster Mode

In general, running Photon ML is no different from running other general Spark applications. As a result, using the spark-submit script in Spark’s bin directory we can run Photon ML on different cluster modes:

Below is a template for running a logistic regression training job with minimal setup on YARN. For running Photon ML using other cluster modes, the relevant arguments to spark-submit can be modified as detailed in http://spark.apache.org/docs/latest/submitting-applications.html.

spark-submit \
  --class com.linkedin.photon.ml.Driver \
  --master yarn \
  --deploy-mode cluster \
  --num-executors $NUM_EXECUTORS \
  --driver-memory $DRIVER_MEMORY \
  --executor-memory $EXECUTOR_MEMORY \
  "./build/photon-all_2.10/libs/photon-all_2.10-1.0.0.jar" \
  --training-data-directory "path/to/training/data" \
  --validating-data-directory "path/to/validating/data" \
  --output-directory "path/to/output/dir" \
  --task "LOGISTIC_REGRESSION" \
  --num-iterations 50 \
  --regularization-weights "0.1,1,10" \
  --job-name "demo_photon_ml_logistic_regression"

TODO: This example should be updated to use the GAME training driver instead. There is also a more complex script demonstrating advanced options and customizations of using Photon ML at example/run_photon_ml.driver.sh.

Detailed usages are described via command:

./run_photon_ml.driver.sh [-h|--help]

Note: Not all configurations are currently exposed as options in the current script. Please directly modify the configurations if any customization is needed.

Modules and directories

Source code

TODO: Photon ML modules are in need of a refactor. Once this is complete, this section will be updated.

Other

build-scripts contains scripts for Gradle tasks
buildSrc contains Gradle plugin source code
dev-scripts contains various scripts which may be useful for development
examples contains a script which demonstrates how to run Photon ML from the command line
gradle contains the Gradle wrapper jar
travis contains scripts for controlling Travis CI test execution

IntelliJ IDEA setup

When set up correctly, all the tests (unit and integration) can be run from IntelliJ IDEA, which is very helpful for development (IntelliJ IDEA's debugger can be used with all the tests).

Run ./gradlew idea
Open project as "New/Project from Existing Source", choose Gradle project, and set Gradle to use the local wrapper.

How to Contribute

We welcome contributions. The following are good ways to get started: reporting an issue, fixing an existing issue, or participating in a discussion. For major functionality changes, it is highly recommended to exchange thoughts and designs with reviewers beforehand. Well communicated changes will have the highest probability of getting accepted.

Reference

XianXing Zhang, Yitong Zhou, Yiming Ma, Bee-Chung Chen, Liang Zhang and Deepak Agarwal. GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016

photon-ml's People

Contributors

Stargazers

Watchers

Forkers

timyitong joshvfleming ashelkovnykov xianxing tddft liguo86 brucezhou2012 ankans yaozhang-cs mindis sergejau djeraseit desperado1992 krasnyanskiy fangzheng354 eliasah linux-devil smarthi rishabh135 mechelix mdbconsulting harishraj tspannhw intellifora yilab hangjun rezn townie codeaudit mbrukman winning1120xx jufangshen bigdata0803 knowledgehacker wtest simonclouds davidmr001 wangcrystal liqingfei tomzhang euwen is00hcw qlw jkerfs haydenliu weichenxu123 zhichao-li ljzzju convexquad liangzhou xtutran hangyao vgmartinez jq pranny himanshuarora viswamy ab212 coloratto lovehoroscoper dandanw924 nagyistge tonytongzhao ferd36 flying0er pranayhasan kioco rugby110 criteo-forks sethah jinyu0310 gengliangwang fastier-li lidi100 harikiranvuyyuru lzhang32 namitk treper zhenchuan strategist922 gth158a melody-xiaomi yourluckier jamesbconner tpnguyen fullofinspiration aprilwebster zhangkunpanda craigbrownphd gdtm86 xkzju hailingc marketingattribution maver1ck leejaywu zhouyonglong lujin77 adam-abdulhamid mirali1984 zhanglae

photon-ml's Issues

Check whether the output already exists or not before running the driver

Currently both the training and scoring driver for GAME and the training driver for Photon would automatically delete the user specified output directory if it exists already, which is equivalent to setting the force overwrite to always true for the output directory.

Such behavior is inconsistent with both Pig and Spark, as both Pig and Spark by default (e.g., http://spark.apache.org/docs/latest/configuration.html) will throw an exception if the output directory already exists.

Since most of our customers are Pig or Spark users as well, it might be a good idea to keep things consistent and have Photon throw an exception if the specified output directory already exists.

Allow better control over whether model diagnostics should be run or not

We only allow trainingDiagnostics to be set to false which disables diagnostics on the training data. In the diagnose method, getFitReport and getLambdaBootstrapmap are the places where the trainingDiagnosticsEnabled flag gets used. However, a whole lot is happening before and after that.

We have had many cases where the consumers want to turn of diagnostics all together or have had failures in those methods that users do not care about. Given this, we should have a parameter like diagnosticsEnabled which should not run the diagnose() method at all or make it a pass-through.

Adding local AUC computation to the evaluation package

Currently the only AUC computation supported in Photon/GAME is RDD based, e.g., the input scores and labels are in the RDD[(Double, Double)] format.

However, in order to enable per random effect AUC computation as in issue #41, computing AUC locally in the Array[(Double, Double)] or Seq[(Double, Double)] format will be a useful feature to have.

Decouple Avro specific functions/classes from the rest of Photon-ML

Currently Avro is tightly coupled with Photon-ML, especially through various type of drivers, as well as the use of "name-term-value" to represent the features.

In order to make Photon-ML more friendly to a wider audience, it's useful to decouple Avro specific functions/classes from the rest of Photon-ML.

We can discuss what are the possible short vs. long solutions.

Add ```equals``` method to Coefficients, FixedEffectModel and RandomEffectModel

Add equals method to Coefficients, FixedEffectModel and RandomEffectModel, in order to facilitate further integration tests for the GAME module.

GAME scoring module improvement

Currently GAME's scoring module is tightly coupled with evaluation, and the output schema of the computed scores are not user friendly, because the computed scores can't be joined back to the original data for downstream use cases.

We would like to have an improved GAME scoring module with the following features:

Evaluation will no longer be part of the scoring module, as we shouldn't assume the label/response is available for evaluation purpose for each input example to be scored.
Input data schema: compared to the Avro schema for training data, the scoring data's schema is almost identical except the following three things:
1 Each input example is expected to contain a String typed field called uid (for Avro record), which should represent the unique id of each input example to be scored
2 The response and weight field is no longer required as part of the input
3 If the offset is included as part of the input, then the final score would be offset + scoreComputedFromModel
Output score schema: will leverage ScoringResultAvro.avsc found in photon-avro-schemas.

GAME's log level is fixed to INFO, which misses DEBUG level information that is necessary for debugging.

This is a follow-up issue of #62. Recently when I was working with @TDDFT to debug some runtime issue with GAME, we realized that the DEBUG level log is gone after we switched to use the latest version of GAME on Github, and this is because the default log level is INFO for GAME.

The DEBUG level log is generally useful because we used to rely on the such log to help our users at LinkedIn identify the issues they had when running GAME.

IntegTest runtime should be reduced

The integTest takes too long for now. @TDDFT made a good point that a majority of them is coming from ObjectveFunctionIntegTest. It's actually also doubled compared to previously since we have scala 2.10, 2.11 crossbuild! We should figure out a way to cut down sampling rate and cut down number of cases to verify.

Personally, I think tests taking too long is affecting productivity.

IOUtils.writeStringsToHDFS output wrong encoding

Currently it is using DataOutputStream.writeBytes API, which doesn't handle international languages (in UTF-8 encoding) well. We should change it to use writeUTF API.

Consistent data structure for GAME model

Current GAME model has several ad-hoc representations in Photon-ML.

For example, in CoordinateDescent GAME model is represented as Map[String, Model], where the key is the coordinate id and the value is the corresponding fixed/random effect model. And in ModelProcessingUtils GAME model is represented as Iterable[Model], where each element represents a fixed/random effect model.

However, it might be helpful to have a consistent representation of GAME model throughout Photon's code base, which takes care of some of the common operations on GAME model, such as scoring GAME data set.

Log level constants in PhotonLogger should be able to be accessible within Photon modules.

Currently the way to set the log level of PhotonLogger is to retrieve the constants in org.slf4j.spi.LocationAwareLogger, such as DEBUG_INT or INFO_INT, and pass the constants to the setLogLevel function, which kind of reveals the implementation details of PhotonLogger.

Since the companion object PhotonLogger already have the log level defined with a few constants such as LogLevelDebug, LogLevelError and LogLevelInfo, it might be a better practice to call these constants defined in PhotonLogger object than calling the constants in LocationAwareLogger.

However, currently the scope of PhotonLogger object is private and nothing can be accessed outside of PhotonLogger. As a result, it's better to change the scope of the PhotonLogger object such that other modules in Photon would be able to access those constants.

Coefficients type should be flexible when loaded from Avro model file

When the model coefficients are loaded from Avro model file via the AvroUtils.loadMeanVectorFromBayesianLinearModelAvro function, the type of the returned coefficients is DenseVector.

However in some cases when the underlying coefficients has large dimension but relatively small support, they should be represented as a SparseVector.

It would be helpful if the coefficients type can be flexible when loaded from Avro model file, instead of always being a DenseVector[Double], for example, the coefficient type could be either dense vector or sparse vector based on the sparsity of the underlying array.

GAME's model selection logic is wrong

Current GAME decides the "best" model by looking at the evaluation metrics computed on the validating data with each model, then select the one with largest value.

Such logic works fine with evaluation metrics like AUC, however, for other evaluation metrics such as RMSE, larger value actually corresponds to worse model behavior.

As a result, GAME is now actually reporting the model with highest RMSE as the best model when the task type is specified to be LINEAR_REGRESSION.

This is a bug and should be fixed.

GLMSuite needs a refactoring to limit the tasks it tries to do

Different tasks GLMSuite is trying to do should be put into separate places. This class currently takes on too much responsibility.

GAME's unique Id type should be configurable

Currently both GAME's model training and data scoring logic depend on a Long typed unique Id associated with each data point (each GameDatum).

More specifically, in GAME's model training logic such unique id is generated automatically in a way similar to RDD's zipWithUniqueId function, however, unique ids generated in this way are not fault tolerant, meaning that the unique id associated with each data point may change after the it get re-computed with a different executor due to node failure.

On the other hand, in GAME's data scoring logic (see #71 for more details), such unique Ids are provided with each data point as part of the input, so potentially GAME can leverage the unique Ids generated by the user, however, in this way we may put too much burden on the user.

We don't have a very good way to address this problem, any suggestions are welcome!

GLMSuite handling of selected features and creating an index mapping has scalability issues

In GLMSuite, we do something like:

@transient var Set[String]  selectedFeatures = getSelectedFeatureSetFromFile(sc, selectedFeaturesFile)
@transient var featureKeyToIdMap: Map[String, Int] =  loadFeatureKeyToIdMap(avroRDD, selectedFeatures)

The featureKeyToIdMap is stored in main memory of the driver and also broadcast to each worker during data conversion.

Storing selectedFeatures in the driver and having featureKeyToIdMap stored in the driver and broadcast can significantly limit the number of features we can gracefully support without large container sizes.

Another way to potentially do this would be to

Scan avro data to generate an RDD that is a list of features
If selected feature file has been specified, load that into an RDD and intersect the two sets using operations on RDDs
Assign unique indices to the features, creating a feature "map" that is an RDD with feature, id tuples
Scan avro data and generate an RDD that is a list of feature, value, item tuples
Join RDD feature mapping from 3 with feature tuples from 4, emit featured, value, item tuples as an RDD.
Scan avro data and generate an item id, label tuples.
Group output of 5 by id, join with output of 6 to reconstruct the labeled dataset.

Change some of the functions/classes in GAME from public to private/protected.

Currently most of the functions/classes in GAME are with scope public. However, many of them are not supposed to be called or used outside of some certain context.

For example, the functions provided within each coordinate will not be used anywhere other than the coordinate descent algorithm.

And be more conservative on the scope of the functions might make the APIs of Photon-ML/GAME more stable.

Functions in AvroUtils and AvroIOUtils are duplicated

Some functions in AvroUtils and AvroIOUtils are duplicated, such as readAvroFiles in AvroUtils and readFromAvro in AvroIOUtils.

Enable AUC per random effect computation.

Currently the AUC evaluation metric is computed on the overall testing data set. However, it's useful to include a per-random effect level AUC computation, e.g., compute AUC at the memberId level for each member's data, and take a weighted average to obtain the final AUC.

This per random effect AUC evaluation metric is similar in spirit to the row/col oriented mean percentile rank proposed in [1] and the row/col oriented AUC found in [2], with row and col Ids being the corresponding random effect Id.

[1] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars Schmidt-Thieme. BPR: Bayesian Personalized Ranking from Implicit Feedback
[2] LibMF: https://www.csie.ntu.edu.tw/~cjlin/libmf/

Model selection codes are using memory excessively

During my testing for pull request #17, I've noticed that even with off-heap map, we can successfully finish training 2M features with ease; we also finished computing all model performance metrics than we'll fail at model selection stage if data validation stage is turned on. The error message suggests that we blow off some container's memory (most likely it's driver's though I'm not entirely sure according to the message). We can still blow off memory even the driver's memory is as high as 10GB.

We should investigate why model selection codes are so hungry for memory. Theoretically, one model (a dense double vector) should only take 100-200MB even if its dimension reaches 2M.

Number of output files is not configurable when writing the random effect models to HDFS.

Currently the number of output files is not configurable in GAME when writing the random effect models to HDFS, as a result, large random effect models may lead to a large number of output files and the quota might be exceeded.

As a result, allowing the number of output files for random effect model to be configurable might a good idea.

Remove legacy intercept code

At one time, Photon added the intercept term during the job preparation stage instead of during the data preparation stage. This involved several checks on a member variable that reflected whether or not an intercept had been added instead of checking the data. Much of this code is still around and should be removed.

No data validators are available for GAME

For Photon's GLM model training, different generalized linear algorithms are equipped with different data validators in DataValidators that perform sanity checks for the input data.

At the same time, No data validators are available for GAME. It would be useful if GAME can leverage some of the existing data validators built for GLM, and add a few more validators at the feature's level, such as checking whether duplicated features exist for each input data point.

Compiled avro classes isn't showing up in integTest folder in Idea IntelliJ

We probably need to figure out what's proper configuration to include the compiled classes into IntelliJ's configuration.

Travis CI needs more rigorous and thorough setups

I quickly set up travis CI to make it work.

The configuration is located at: .travis.yml

Here is the script I used:

Comments are annotated as using "#"

# Linux is more closely to our currently known build env, 
# a TODO item is we might also want to support OS X build properly
os:
  - linux
language: java
# Must choose JDK8, our codes aren't compatible with JDK < 1.8
jdk:
  - oraclejdk8
# These are cache set ups, similar to the local .gradle cache.
# It's used to speed up builds. Travis CI's jar downloading speed seemed to be not so good.
# Also due to this reason, I replaced mavenCentral with jcenter which is said to have better 
# CDN services 
before_cache:
  - rm -f $HOME/.gradle/caches/modules-2/modules-2.lock
cache:
  directories:
    - $HOME/.gradle/caches/
    - $HOME/.gradle/wrapper/
install:
  - ./gradlew assemble
# This is to remove unused services to free memory; free version only allows 3G in total, 
# otherwise, our test process will be killed due to excessive memory usage.
#  Please see this issue: https://discuss.gradle.org/t/gradle-travis-ci/11928/2
before_script:
  - sudo service postgresql stop || true
  - sudo service mysql stop || true
  - sudo service memcached stop || true
  - sudo service bootlogd stop || true
  - sudo service elasticsearch stop || true
  - sudo service mongodb stop || true
  - sudo service neo4j stop || true
  - sudo service cassandra stop || true
  - sudo service riak stop || true
  - sudo service rsync stop || true
  - sudo service x11-common stop || true
# Still concerning here considering that integTest takes a long time, Travis CI will
# kill the container if it runs longer than 120 minutes.
script:
  - ./gradlew check

A few things worth noting:

I replaced mavenCentral with jcenter, because I've encountered jar downloading timeout issues for once. And also jcenter is said to have better CDN;
Scala cross build with 2.11 isn't working. Looks like a Gradle issue, we can also recreate it in OS X. There are some references like this: https://discuss.gradle.org/t/how-can-i-resolve-cannot-read-zip-file/4491
Supposedly this should only occur with earlier versions of Gradle, but somehow it still occurs with our cases.

Adding intercept in GAME should be configurable

Currently GAME adds an intercept term to each fixed and random effect model automatically, no matter whether the intercept term is desired or not.

It will be helpful to have a configurable parameter exposed to the user, that let the user decide whether the intercept term for each fixed/random effect model should be enabled or not.

Getting started documentation

Hi Guys - nice job.

I think the project would benefit from a simple walk through on some public data. I'm just digging into photon now and it takes a little digging to understand the flow correctly.

Default value for ```isNullOK``` in ```getStringAvro``` should not be true.

Currently the default value of isNullOK in the getStringAvro function of com.linkedin.photon.ml.util.Utils is set to true.

However, sometimes Photon's user/developer may not be aware of or overlook such default setting, which would lead to unexpected code behavior, because when the corresponding field of the key is null in the Avro record, an empty string will always be returned as the retrieved result. And such behavior is inconsistent with other get*Avro functions, where an exception will be thrown if the field is null.

As a result, we may want to set the default value of isNullOK in the getStringAvro function to false.

Numerically Unstable Pearson Correlation Computation in LocalDataSet.scala

The Pearson Correlation computation in computePearsonCorrelationScore() uses a naive, one-pass, numerically unstable algorithm.

It should be updated to use a more stable algorithm such as the one presented here: http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf

Bumping up tolerance parameter for GAME's per-random effect feature selection

Currently the per-random effect feature selection algorithm used in GAME is based on computing Pearson correlation score, and in particular, the one pass algorithm is adopted.

However, since the one pass algorithm is numerically unstable for certain cases, the current tolerance parameter used is 1e-12 which might not work super well together with the one pass algorithm.

We may want to bump up the tolerance parameter (to be addressed in the issue) and figure out a numerically more stable algorithm down the road (to be addressed in issue #50).

Allow the start date to be the same as end date when the user specify date range.

Currently in the DateRange class it is required that the startDate always comes strictly before endDate.

However, we still have some use cases out there that require the start date to be the same as the end date (e.g., in some weekly/monthly data used by feed and jymbii).

Naming convention in GAME: globalId to uniqueId

In GAME, a Long typed variable named as globalId is used across different places and act as the unique identifier for each data point (each GameDatum).

As a result, to better reflect what it is used for, it may be a good idea to rename globalId to uniqueId.

Link the published GLMix paper to README.md

Link the GLMix KDD paper to REAM.md as part of the documentation for GAME.

But the problem is, how do we deal with the copy right issue of this KDD paper?

Separate NameAndTerm from the drivers

In GAME, we use NameAndTerm to represent the key of the feature, and it is tightly coupled with both the training and scoring Drivers. However, using Name and Term to represent features is too LinkedIn specific, which makes it difficult to provide a walk through for GAME on some public data (e.g., #76) with formats such as LibSVM.

One proposal is to constrain the usage of NameAndTerm to the Avro I/O utility functions, for example, when (i) reading data from an Avro record and (ii) writing model as an Avro record with NameAndTerm format. And after stage (i) or before stage (ii), Drivers should be agnostic about the upstream/downstream formats of data and model, respectively.

Optimizer is logging huge but unnecessary optimizer state info

Current in Optimizer.scala, the optimize function is logging the current state info, which contains the full coefficients and gradients vector. However, no one would really look into each coefficient/gradient even for debugging purpose. At the same time, summary level logs for the learned coefficients are already provided at the model level.

When the size of the coefficients and gradients is large, logging the current state info not only produces huge amount of logs, but also potentially slows down the overall optimization process.

Fix all Scala style warnings

In order to enable the Scala style check, we need to fix all the existing warnings issues reported by ./gradlew scalaStyle first. Since most of the Scala style warnings are non-trivial to fix, several PRs will be pipelined to address them group by group.

"afterMethod" in TestTemplateWithTmpDir doesn't clean the created tmp directories.

In trait TestTemplateWithTmpDir, the getTmpDir method would create a tmp directory with the current time in milliseconds as part of the directory's name.

As a result, when getTmpDir is called in the afterMethod method, it would create a new directory which has a different name compared to the directories created previously, which are supposed to be cleaned. Consequently, those previously created tmp directories won't be cleaned after the afterMethod is called.

Photon - GAME Optimization Problem Synthesis

Photon and GAME currently exist as distinct entities, aside from a few shared classes (Optimizer, DiffFunction, and descendants of these two), even though there is overlap in the tasks they perform. The area of most overlap is between the GeneralizedLinearAlgorithm and OptimizationProblem classes.

The Photon / GAME Optimization Design Proposal in the design wiki proposes a refactor of the existing code to replace GAME's OptimizationProblem interface with Photon's GeneralizedLinearAlgorithm interface, and build atop it.

Such an interface hierarchy will more accurately reflect the relationship between the two: GAME is an extension of Photon, Photon is GAME with only a single fixed effect model.

Add input argument to overwrite output files

Current logic will fail the driver if the output directory/file exists. This is slightly different from Photon behavior in the past, and causes some issues.
We should have an input argument to let user choose to overwrite the output files if they want. This affect photon driver, game training/scoring driver and name-term container driver.

Output best GAME model even when 'save-models-to-hdfs' is false

Currently when 'save-models-to-hdfs' is false, no models are output at all. The correct behaviour should be to always attempt to output the best model, and instead have 'save-models-to-hdfs' affect whether all models are output or not.

GAME might also want to leverage IndexMap interfaces

So that we could bring in offheap map seemlessly.

Add MatrixFactorizationModel to com.linkedin.photon.ml.model

As a first step of #58, the MatrixFactorizationModel should be added to com.linkedin.photon.ml.model.

Save MatrixFactorizationModel to HDFS as Avro files

This issue is a sub-task of #58.

We may need this issue resolved before adding the actual learning algorithm for MF because that, this functionality can be used to save the MF model learned from org.apache.spark.ml.recommendation.ALS to HDFS in Avro format. And such learned MF model can then be used in other downstream scenarios, for example, when the row and column latent factors of the MF model are used as features in a response prediction model.

Add Matrix Factorization to GAME

Matrix factorization fits naturally to the GAME framework, and adding matrix factorization to GAME is a promising direction to model the interaction among the effects in the latent space.

Since adding matrix factorization to GAME may require some non-trivial amount of work, I'd like to break down this issue to several smaller tasks/issues and address them sequentially with corresponding PRs.

Refactor and add unit tests for AbstractOptimizer

Current there is minimal test coverage for AbstractOptimizer, which acts as one core piece of Photon's optimization. As a result, I think increasing the test coverage for AbstractOptimizer is necessary.

At the same time, AbstractOptimizer has too many side-effect related operations and global states, which might cause some of it's components difficult to test, as a result, some refactoring work might be needed as well.

Add code coverage tag other than build CI

Maybe consider coveralls? Looks like its plugin is a modification of JaCoCo optimized for envs like Travis CI with less memory consumption.

http://blog.eluder.org/2013/06/code-coverage-for-github-hosted-java-projects-with-coveralls

Rename a9a.t file into something else

The file is located here:
photon-ml/src/integTest/resources/DriverIntegTest/input/a9a.t

Looks like Github is mistakenly thinking this is a Perl file, making our project tagged as a Perl project.

Change the random effect id type from String to Long.

Currently GAME uses String to represent each (random) effect id.

Although such choice is convenient for parsing the raw data, as no prior knowledge or restriction on the type of random effect id is required, caching a String based data structure onto memory/disk sometimes is too expensive, especially for applications/models where the effect ids contribute to a large portion of the memory usage such as matrix factorization (each data point is a triplet of rowEffectId, colEffectId and response/label).

As a result, changing the random effect id type from String to Long will be helpful by reducing the amount of memory/disk space required to cache the data greatly (~60 bytes for String to 8 bytes for Long).

One possible solution might be to build an effect index map during the data pre-processing stage, which maps the String typed raw effect ids to a Long based one, similar to what we did for mapping the features index from String to Int. In this way, data structure such as GameDatum will only need to deal with Long typed effect ids. Such refactoring will also keep the core code clean and independent of the type of the effect id provided in the raw data by the user.

Naming convention in GAME: randomEffectId, individualId and randomEffectType

Currently in GAME we name the type of each random effect as randomEffectId, which confusing with the random effect's id given a specific type, which is named as individualId or sometimes just id.

For example, the random effect type could be "memberId" or "itemId", and the random effect id for each random effect type could be "m123" for random effect of type "memberId", and "i123" for random effect of type "itemId".

As a result, we should make the naming convention in GAME clearer, one proposal is to rename the type of each random effect from randomEffectId to randomEffectType, and the id of a random effect of a certain type from individualId or sometimes id to randomEffectId.

Allow a unique Id to be part of the schema of the scoring job's output

Currently the schema of the output of the scoring job for GAME only contains the random effect ids and the computed score, which might not be sufficient for the users to join the computed scores back to the original data set, in order for some other downstream usage of the scores.

If the original data set has unique Id for each data point that can be used for the join purpose, then the scores computed from the scoring job should also keep that unique Id as part of the output schema.