bkelly-lab / ipca Goto Github PK

View Code? Open in Web Editor NEW

187.0 187.0 74.0 6.81 MB

Instrumented Principal Components Analysis

License: MIT License

Python 100.00%

ipca's People

Contributors

Stargazers

Watchers

Forkers

lbybee matbuechner fagan2888 christianjauregui shizelong1985 npezolano steffenwindmueller huning2009 mhatzikonstantinou fintrek lynnlew wizardshowing nickkir38 elitack zjt9101 sherryhooo caozq19 lnsongxf macrofinancehub 5l1v3r1 babymetal287 zcymagic elephann lsy617004926 htaehpeed xfx88 tagakii anhnguyendepocen robertmay615 danielyang1009 genperms mk0417 jiandu kaiwenshen miyama1209 abigailhust redpoint13 liangzp yz1386 jsliu prio33 charleswwang evanjo ddr-capital skpalu polarbluebear robustness-data fdoperezi nktrader hwang127 vroomzel wfh1300 zaneguqi monkeysaga anxl2008 joe-zh-dev wesley1110 xiaojianfei quanthao lin090811 junyi95 yankikalfa lucieluyiliu lfamarantine thezrx maartenscholl roopamup yuxiafin debackerl jingmouren gliu73

ipca's Issues

solution not converging

Hello, I tried to feed in a bunch of barra factors but I bumped into the following problems:

Sometimes when running on _numba_chol() it says the Gamma_New matrix is not "positive definite";
Sometimes the solution cannot converge and stopped at a singular matrix problem.

I wonder do you have some intuitions on why? My intuition is that Barra factors are pre-processed to be orthogonal to each other, and somehow it cannot have larger/smaller number of latent factors?

Further Changes Roadmap

I wanted to create a separate issue from #3 to discuss some potential further changes:

In order to fully align with sklearn we need to distinguish the characteristics in X from the indices (time periods/assets).

1.a Maybe the most natural way is to add an input called something like indices or groups. Then X would just be the chars, y would just be the returns, and indices would hold the time/asset labels. What I'm imagining here then is:

fit(X=None, y=None, indices=None, ...)

1.b This might provide a good route to direct integration with pandas. Our indices would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandas X and y with a MultiIndex.
1.c sklearn also has a series of "multipleoutput" methods which somewhat align with what we're doing here. I need to read into this more but this might be one way to go.
Ultimately, we should add an IPCARegressorCV class instead of the current fit_path method. I'll take care of this once we handle the indices issue. This is also where I can add the "hot-start" approach for cross-validation (which should be faster on most machines).
It might make more sense to name the main class something like InstrumentedPCA instead of IPCARegressor. Two reasons for this:

3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
3.b Regressor doesn't seem to add much information to the name.

Pandas error in the example

I'm not sure if this is pandas related or if the statsmodels example dataset changed, but running the example throws the following error:

     29 from ipca import IPCARegressor
     30 regr = IPCARegressor(n_factors=1, intercept=False)
---> 31 regr = regr.fit(X=X, y=y)
     32 Gamma, Factors = regr.get_factors(label_ind=True)

~/anaconda3/lib/python3.7/site-packages/ipca/ipca.py in fit(self, X, y, PSF, Gamma, Factors, data_type, **kwargs)
    170 
    171         # init data dimensions
--> 172         self = self._init_dimensions(X)
    173 
    174         # Handle pre-specified factors

~/anaconda3/lib/python3.7/site-packages/ipca/ipca.py in _init_dimensions(self, X)
    965         """
    966 
--> 967         self.dates = np.unique(X[:, 1])
    968         self.ids = np.unique(X[:, 0])
    969         self.T = np.size(self.dates, axis=0)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 )
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:
   2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), 1)' is an invalid key

Specifically X[:, 1] throws the error.

My env:
Python 3.7.4
pandas 0.25.1
statsmodels 0.10.1
numpy 1.17.2

Matrix is singular to machine precision.

Matrix is singular to machine precision.
Could you please tell me the possible reason for the error? I know it is my code problem, but I really can't find the reason, thank you very much

Orthogonalization in Case with Pre-Specified Factors

Orthogonalization is not adapted for the case with PSF yet.

ipca/ipca/ipca.py

Line 1044 in 57c0488

if K > 0:

Thanks to Steffen Windmüller for pointing this out.

Thank you for your reply! It did help me a lot to dive deep into this code. I think there is a typo inside the ipca.py about parameters n_jobs and backend. In line 1082 of ipca.py file, I think it is better to use n_jobs = self.n_jobs and backend=self.backend.

Besides, I wonder if this ALS method's estimation result is a global minimum of this objective function. If not, how should we convince ourselves about using SVD decomposition initialization of \Gamma is a good method in some specific contexts?

Best,
Kendrick

Problem with data_type='portfolio' in predict function

Hi author, I tried to do out-of-sample prediction for portfolio but realized that there is no data_type parameter in the predictOOS function, so I tried to run "Ypred = regr.predict(X=data_OOS, data_type='portfolio', mean_factor=True)". In your source code predict function exists such a parameter as data_type, but if you run the above code, it selects the elif that X is not None, and then runs pred = self.predict_portfolio(W, L, T, mean_factor), but the values of L and T are not returned in the elif. This makes it impossible to predict the out-of-sample portfolio. my solution is to calculate the values of W, L and T myself and then run the predict_portfolio function.

Fitting with only PSF and intercept: redundant factor output

Hi,

Thank you for writing a great package!

I'm fitting a model with only PSF (e.g. 4 factors) and an intercept. The returned Factors has 6, including the original 4 and another 2 constant, but the Gamma gives correctly 5. Do you know why this is the case? Does the case currently nested here?

LinAlgError: Matrix is singular to machine precision.

Thank you for providing this package.

When applying the fit()-function I run into the following error: LinAlgError: Matrix is singular to machine precision.. (I've attached the full stack trace below)

I'm working on a large X matrix with shape `(344753, 77)``. Both the X DataFrame and the y Series have a multi-index consisting of a datetime64[ns] and int64.

X matrix (head of DataFrame):

                   yy       ret       prc      size  q10  q20  q30  q40  q50  \
firm  time                                                                     
19940 1963-07-31 -1.0 -1.000000  1.000000  1.000000 -1.0 -1.0 -1.0 -1.0 -1.0   
25160 1963-07-31 -1.0 -0.578433  0.487544 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0   
25478 1963-07-31 -1.0  1.000000 -1.000000 -0.936839 -1.0 -1.0 -1.0 -1.0 -1.0   
19940 1963-08-31 -1.0 -1.000000  1.000000  1.000000 -1.0 -1.0 -1.0 -1.0 -1.0   
25160 1963-08-31 -1.0  1.000000  0.815603 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0   

                  q60  ...  idio_vol  total_vol  std_volume  std_turn  \
firm  time             ...                                              
19940 1963-07-31 -1.0  ... -1.000000  -1.000000   -1.000000 -1.000000   
25160 1963-07-31 -1.0  ...  1.000000   1.000000   -0.742541  1.000000   
25478 1963-07-31 -1.0  ...  0.922743   0.875899    1.000000  0.396275   
19940 1963-08-31 -1.0  ... -1.000000  -1.000000   -0.916715 -1.000000   
25160 1963-08-31 -1.0  ...  0.165515   0.626392   -1.000000  0.115954   

                   lme_adj  beme_adj   pm_adj    at_adj  mm_sin  mm_cos  
firm  time                                                               
19940 1963-07-31  1.000000 -1.000000  0.29254  1.000000    -1.0    -1.0  
25160 1963-07-31 -1.000000  1.000000 -1.00000 -1.000000    -1.0    -1.0  
25478 1963-07-31 -0.319608 -0.638467  1.00000  0.438206    -1.0    -1.0  
19940 1963-08-31  1.000000 -1.000000  0.29254  1.000000    -1.0    -1.0  
25160 1963-08-31 -1.000000  1.000000 -1.00000 -1.000000    -1.0    -1.0  

[5 rows x 77 columns]

y matrix (head of Series):

firm   date      
19940  1963-07-31   -1.000000
25160  1963-07-31    1.000000
25478  1963-07-31   -0.809350
19940  1963-08-31    1.000000
25160  1963-08-31    0.193671
Name: TARGET, dtype: float64

I use the following code to perform the IPCA:

ipca = InstrumentedPCA(n_factors=4, max_iter=20000)

ipca = ipca.fit(X=X_ipca,y=y_ipca)
gamma, factors = ipca.get_factors(label_ind=True)
factors.head()

The shapes are then recognized correctly:

The panel dimensions are:
n_samples: 3262 , L: 77 , T: 270

Afterwards I get the LinAlgError: Matrix is singular to machine precision..

I already tried to increase max_iter without any luck. Is there any explanation causing this behaviour? Are there ways to circumvent it by applying more pre-processing or additional args?

Please let me know if you need more information. Thank you.

Here is the full stack trace:

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-110-830cb2dc72be> in <module>
      2 ipca = InstrumentedPCA(n_factors=4, max_iter=20000)
      3 
----> 4 ipca = ipca.fit(X=X_ipca,y=y_ipca)
      5 gamma, factors = ipca.get_factors(label_ind=True)
      6 

~\anaconda3\lib\site-packages\ipca\ipca.py in fit(self, X, y, indices, PSF, Gamma, Factors, data_type, label_ind, **kwargs)
    219 
    220         # Run IPCA
--> 221         Gamma, Factors = self._fit_ipca(X=X, y=y, indices=indices, Q=Q,
    222                                         W=W, val_obs=val_obs, PSF=PSF,
    223                                         Gamma=Gamma, Factors=Factors,

~\anaconda3\lib\site-packages\ipca\ipca.py in _fit_ipca(self, X, y, indices, PSF, Q, W, val_obs, Gamma, Factors, quiet, data_type, **kwargs)
   1011         while((iter <= self.max_iter) and (tol_current > self.iter_tol)):
   1012 
-> 1013             Gamma_New, Factor_New = ALS_fit(Gamma_Old, *ALS_inputs,
   1014                                             PSF=PSF, **kwargs)
   1015 

~\anaconda3\lib\site-packages\ipca\ipca.py in _ALS_fit_portfolio(self, Gamma_Old, Q, W, val_obs, PSF, **kwargs)
   1097 
   1098         # ALS Step 2
-> 1099         Gamma_New = _Gamma_fit_portfolio(F_New, Q, W, val_obs, PSF, L, K,
   1100                                          Ktilde, T)
   1101 

~\anaconda3\lib\site-packages\ipca\ipca.py in _Gamma_fit_portfolio(F_New, Q, W, val_obs, PSF, L, K, Ktilde, T)
   1518                                  * val_obs[t]
   1519 
-> 1520     Gamma_New = _numba_solve(Denom, Numer).reshape((L, Ktilde))
   1521 
   1522     return Gamma_New

~\anaconda3\lib\site-packages\numba\np\linalg.py in _inv_err_handler()
    822             assert 0   # unreachable
    823         if r > 0:
--> 824             raise np.linalg.LinAlgError(
    825                 "Matrix is singular to machine precision.")
    826 

LinAlgError: Matrix is singular to machine precision.

predictOOS when mean_factor=False

Hi,

first thank you for providing this great package!
I have been playing around with the example code and noticed that the length of the array returned by predictOOS does not match the length of the data when setting mean_factor to False.

One can use the Grunfeld data to see this:

regr = InstrumentedPCA(n_factors=1, intercept=True)
regr = regr.fit(X=data_IS, y=y_IS)
Ypred1 = regr.predictOOS(X=data_OOS, y=y_OOS, mean_factor=True)
Ypred2 = regr.predictOOS(X=data_OOS, y=y_OOS, mean_factor=False)

print(y_OOS, "\n")
print(Ypred1, "\n")
print(Ypred2, "\n")

This is what the output looks like:

firm  year
11    1954    1486.700
14    1954     459.300
10    1954     189.600
8     1954     172.490
7     1954      81.430
13    1954     135.720
15    1954      89.510
16    1954      68.600
12    1954      49.340
9     1954       5.120
6     1954       6.281
Name: invest, dtype: float64 

[780.89716506 291.46947239 380.59479685 101.20025869  65.84867622
 126.53074081  36.65394197 160.14977461  72.50646239   7.91635846
   8.04336863] 

[1244.4912375]

Apparently the method only returns a prediction for the first entity-time pair.

Not converging solution and LinalgError: Matrix is singular to machine precision

I encountered these two issues when I was trying to apply IPCA model to a stock market index option, following the pattern in Büchner M, Kelly B(2022). To be specific, I construct 15 features as it was in the paper. However, when I was fitting the IPCA model, it never converged when number of factors in this model is larger than 1. In fact, the aggregate update in every step tends to go up and explore for these multi-factors situation. For reference, my feature dataset looks like this:

The last eight columns are Greeks interacted with a 0-1 variable indicating put options. And the features are re-scaled into [-0.5,0.5].

Next, as I was fitting the IPCA model with characteristic managed portfolio, not only that it did not converge, but also that there was a error raised saying: LinAlgError: Matrix is singular to machine precision.
As I looked into the source code where this error is raised, I found that it was that a singular matrix was passed into function _Ft_fit_portfolio during ALS process.
`Traceback (most recent call last):

File "C:\code.py", line 345, in
analyzer.run()

File "C:\code.py", line 298, in run
rgsr = self.IPCA(K, False, 'portfolio', quiet=False)

File "C:\code.py", line 235, in IPCA
rgsr = rgsr.fit(X=self.features, y=self.label, data_type=datatype, **kwargs)

File "D:\ana\lib\site-packages\ipca\ipca.py", line 221, in fit
Gamma, Factors = self._fit_ipca(X=X, y=y, indices=indices, Q=Q,

File "D:\ana\lib\site-packages\ipca\ipca.py", line 1013, in _fit_ipca
Gamma_New, Factor_New = ALS_fit(Gamma_Old, *ALS_inputs,

File "D:\ana\lib\site-packages\ipca\ipca.py", line 1074, in _ALS_fit_portfolio
F_New[:,t] = _Ft_fit_portfolio(Gamma_Old, W[:,:,t],

File "D:\ana\lib\site-packages\ipca\ipca.py", line 1449, in _Ft_fit_portfolio
return np.squeeze(_numba_solve(m1, m2.reshape((-1, 1))))

File "D:\ana\lib\site-packages\numba\np\linalg.py", line 899, in _inv_err_handler
raise np.linalg.LinAlgError(

LinAlgError: Matrix is singular to machine precision.`

Combining these two issues, I think that it was something wrong numerically(or statistically) in my features data, but I have no clue where exactly goes wrong and how I can fix this problem and let my code work. Additionally, I have seen issue on the same LinAlgError raised before, however, I guess that was a different situation to my problem since there's no columns with all entries being 0.

Please tell me how I can fix this issue, And any hint or suggestion is welcomed as well.

Matrix calculation accelerations on GPU

I noticed we used Numba packages here to accelerate matrix calculations. Since Q, W could potentially be large matrix (consider when we have 500+ characteristics and 500+ point in time), have we considered using CUDA in the code to facilitate GPU acceleration?

Testing the Gamma assumption

Hello,

I don't know if this is the right place for my question. I'm currently writing my master thesis about IPCA and I want to test if the assumption that the matrix Gamma is constant over time holds. Does anyone have a clue how this coud be tested? Is it reasonable to compare the model with constant Gamma to a model with a time-varying Gamma? Or would you say that there are only theoretical arguments for and against this assumption?

Thank you in advance.
Simon

How could we verify the solution of IPCA is correct or not?

Hi guys,

Thanks for your kind sharing of this awesome package. However, when I checked the function of test_ipca.py file, I think there are some kinds of testing about whether these functions work or not.

I tried to generate Gamma, Factors by myself and then compared with the Gamma solved by this package. I think these two matrices are not identical. Therefore, is there any method for me to check the correctness of this package? Look forward to your reply.

Best.

Bootstrapping: the unit variance of Student t (dof=5)

The line(s) of d_temp = np.random.standard_t(5) should be scaled by (sqrt(5/(5-2)). For example, np.std(np.random.standard_t(5,10**6)) is about 1.2910 instead of unit variance.

Bootstrap Runtime

@matbuechner, I'm doing some work with @AllenHu95 using the IPCA code here and had a couple of questions I was hoping you could help with, as I'm fiddling around with the backend. In particular, we're doing some work with the bootstrap testing procedure and are hoping to improve the runtime somewhat.

Do you have a good sense of whether 10e-6 is the "right" threshold for convergence? Since we have to refit IPCA for each bootstrap sample, I'm considering using a less conservative threshold but wanted to check whether you foresee any issues there.
Would you be open to a PR that allows users to modify the fit params at run-time (as opposed to using IPCA class values)? What I'm imagining here is adding some additional key-words to _fit_ipca for iter_tol and max_iter (which could be passed through the bootstrap tests). In cases where these key-words weren't provided we'd default to the class values.

My use-case here would be a more conservative threshold for the main IPCA run, and then a lighter one for the bootstrap sample runs. Obviously, there are work-arounds (could set the class value after fitting the main run), which might be preferable, so I wanted to get your input.
It looks like most of the work is done in the alternating-least squares component. Off-hand, do you think there are any opportunities for performance improvement there?

I'll note that when we run each bootstrap iteration sequentially the run-time per iteration is better (about 5 minutes), vs. in parallel (< 1.2 hours). I'm guessing this is because something clever that numpy/numba/etc. are doing gets screwed up when joblib is layered on top, but I'm not familiar with all the details. Let me know if you see an obviously solution based on this result.