nickderobertis / data-code Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 6.46 MB

License: MIT License

Python 99.43% Makefile 0.15% Shell 0.21% Batchfile 0.12% CSS 0.08%

data-code's People

Contributors

Watchers

data-code's Issues

determine how to set index for columns from calculated variables

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:331. It will automatically be closed when the TODO comment is removed from the default branch (master).

don't copy df, use same df

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:339. It will automatically be closed when the TODO comment is removed from the default branch (master).

better summary for DataGeneratorPipeline

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/operations/generate.py:37. It will automatically be closed when the TODO comment is removed from the default branch (master).

support trading calendar in cumulative portfolio

Some initial work is done in _daily_multipler, but need to add
for other functions, and be more flexible for custom calendars

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/portfolio/cumret.py:179. It will automatically be closed when the TODO comment is removed from the default branch (master).

re-enable mypy in workflows (multiple files)

This issue has been automatically created by todo-actions based on a TODO comment found in .github/workflows/automerge.yml:51. It will automatically be closed when the TODO comment is removed from the default branch (master).

should not be necessary to explicitly hash AppliedTransform

Already hashing Transform base class. But was getting unhashable type: 'AppliedTransform'
even with that.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/transform/applied.py:62. It will automatically be closed when the TODO comment is removed from the default branch (master).

when loading from existing source, handle when indices do not match

Currently the code assumes the same index in the existing source and loaded source.
Need to add code to change the index. But if this was due to a desired aggregation,
how should the user select what aggregation would be applied?

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:68. It will automatically be closed when the TODO comment is removed from the default branch (master).

implement df size optimization

Needs to be after adding data types to variables. Then can use data types to optimize

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:260. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient implementation of loading variables for calculations

The DataLoader checks what variables are needed for calculations that are not
included in load_variables, and if it requires multiple transformations of
a variable, then it copies that series for as many transformations are needed.
It would be better to have an implementation that doesn't require carrying copies
through everything.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:203. It will automatically be closed when the TODO comment is removed from the default branch (master).

[Security] Workflow docs.yml is using vulnerable action peaceiris/actions-gh-pages

The workflow docs.yml is referencing action peaceiris/actions-gh-pages using references v2.5.0. However this reference is missing the commit d2178821cb5968f5b7c818210297f3dbeea3114c which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

should not have to write to disk with pipeline to then load it in source

Loader should be able to take a DataFrame instead of just a filepath, then use
that here. Will need to handle columns, variables, transformations correctly

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/source.py:181. It will automatically be closed when the TODO comment is removed from the default branch (master).

better summary for DataTransformationPipeline

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/operations/transform.py:43. It will automatically be closed when the TODO comment is removed from the default branch (master).

update SEM integration to work with semopy>=2

This issue has been automatically created by todo-actions based on a TODO comment found in Pipfile:34. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get CRSP from data paths before getting these tests working

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:233. It will automatically be closed when the TODO comment is removed from the default branch (master).

eliminate repeated from_str methods in dtypes

Currently the same from_str method is in the subclasses because they have a different init
Only int and float have different from_str methods, and both of those are the same. Create
mixin or intermediate classes to eliminate repeated code.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/dtypes/base.py:43. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get FF from data paths before getting these tests working

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:167. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient last_modified

last_modified is calculated a lot and goes through the
entire pipeline each time. Caching the result of the
calculations will give a significant speed up, especially
in DataExplorer.graph. Need to handle updating the cache
whenever data sources or operations change, and somehow
also when OS modified time of file changes (fs events?).

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/base.py:229. It will automatically be closed when the TODO comment is removed from the default branch (master).

implement output to location types other than CSV

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/outputter.py:111. It will automatically be closed when the TODO comment is removed from the default branch (master).

finish compare ids data combination to node

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/compareids/models/datasets.py:106. It will automatically be closed when the TODO comment is removed from the default branch (master).

refactor auto-merge workflow once Github Actions improves

Entire jobs are getting copied between workflow files due to limitations in Github Actions.
The only difference in these jobs is that they checkout master instead of requiring master

Possible changes to Github Actions that would allow the automerge workflow to be refactored:

reuse jobs
reuse steps
trigger workflow from within action/workflow
commit triggered by action triggers push event

This issue has been automatically created by todo-actions based on a TODO comment found in .github/workflows/automerge.yml:89. It will automatically be closed when the TODO comment is removed from the default branch (master).

use implementation of add_missing_group_rows which does not require dropping and resetting index

Need to wait for pd_utils to support it

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/transform/specific/lag.py:157. It will automatically be closed when the TODO comment is removed from the default branch (master).

add test for save and load source after adding save functionality

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_source.py:381. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get CRSP from data paths before getting these tests working

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:95. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get Compustat from data paths before getting these tests working

class TestLoadAndMergeCompustat(DataFrameTest):

def test_freq_a(self):

    expect_df = pd.DataFrame(data = [
        ('001076', Timestamp('1995-03-01 00:00:00'), Timestamp('1994-03-31 00:00:00'),
             185.18400000000003, 112.70299999999999),
        ('001076', Timestamp('1995-04-01 00:00:00'), Timestamp('1995-03-31 00:00:00'),
             228.892, 113.575),
        ('001722', Timestamp('2012-01-01 00:00:00'), Timestamp('2011-06-30 00:00:00'),
             80676.0, 1247.0),
        ('001722', Timestamp('2012-07-01 00:00:00'), Timestamp('2012-06-30 00:00:00'),
             89038.0, 1477.0),
        ('001722', numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns'),
             numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns')),
        (numpy.datetime64('NaT'), numpy.datetime64('2012-01-01T00:00:00.000000000'), numpy.datetime64('NaT'),
             numpy.datetime64('NaT'), numpy.datetime64('NaT')),
        ], columns = ['GVKEY', 'Date', 'datadate', 'sale', 'capx'])

    c_str = datacode.load_and_merge_compustat(self.df_gvkey_str, get=['sale','capx'], freq='a',
                                               gvkeyvar='GVKEY', debug=True)

    c_num = datacode.load_and_merge_compustat(self.df_gvkey_num, get=['sale','capx'], freq='a',
                                               gvkeyvar='GVKEY', debug=True)

    assert_frame_equal(expect_df, c_str, check_dtype=False)
    assert_frame_equal(expect_df, c_num, check_dtype=False)

def test_freq_q(self):

    expect_df = pd.DataFrame(data = [
        ('001076', Timestamp('1995-03-01 00:00:00'), Timestamp('1994-12-31 00:00:00'),
             56.511, 21.96799999999999),
        ('001076', Timestamp('1995-04-01 00:00:00'), Timestamp('1995-03-31 00:00:00'),
             59.551, 29.421000000000006),
        ('001722', Timestamp('2012-01-01 00:00:00'), Timestamp('2011-12-31 00:00:00'),
             23306.0, 409.0),
        ('001722', Timestamp('2012-07-01 00:00:00'), Timestamp('2012-06-30 00:00:00'),
             22675.0, 284.0),
        ('001722', numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns'),
             numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns')),
        (numpy.datetime64('NaT'), numpy.datetime64('2012-01-01T00:00:00.000000000'), numpy.datetime64('NaT'),
             numpy.datetime64('NaT'), numpy.datetime64('NaT')),
        ], columns = ['GVKEY', 'Date', 'datadate', 'saleq', 'capxq'])


    c_str = datacode.load_and_merge_compustat(self.df_gvkey_str, get=['sale','capx'], freq='q',
                                               gvkeyvar='GVKEY', debug=True)

    c_num = datacode.load_and_merge_compustat(self.df_gvkey_num, get=['sale','capx'], freq='q',
                                               gvkeyvar='GVKEY', debug=True)

    assert_frame_equal(expect_df, c_str, check_dtype=False)
    assert_frame_equal(expect_df, c_num, check_dtype=False)

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:493. It will automatically be closed when the TODO comment is removed from the default branch (master).

better tests for graph

Currently just checking to make sure they can be generated with no errors.
Should also check the contents of the graphs. Also see TestCreateSource.test_graph

This issue has been automatically created by todo-actions based on a TODO comment found in tests/pipeline/test_data_merge.py:96. It will automatically be closed when the TODO comment is removed from the default branch (master).

don't trigger extra columns when the extra columns are just the untransformed columns

We are adding extra columns here for calculated variables which require variables not
included in load_variables. Currently, it will load extra variables even if
the calculation could just be done before variable transforms. For example, the
test TestLoadSource.test_load_with_calculate_on_transformed_before_transform should be able
to complete without adding any extra columns

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/source.py:97. It will automatically be closed when the TODO comment is removed from the default branch (master).

add all string methods as available transforms

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/dtypes/str_type.py:12. It will automatically be closed when the TODO comment is removed from the default branch (master).

remove mock test once others are fixed

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:12. It will automatically be closed when the TODO comment is removed from the default branch (master).

better way of storing calculated columns than uuid in columns dictionary

The dictionary of columns has keys as names in the original source and values as columns.
A calculated column is not in the original source, so uuid was used for now just to ensure
that these columns can be in the dictionary, but they should be tracked separately.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:154. It will automatically be closed when the TODO comment is removed from the default branch (master).

weighted average stderr in cumulative portfolios

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/portfolio/cumret.py:146. It will automatically be closed when the TODO comment is removed from the default branch (master).

add tests for SEM

This issue has been automatically created by todo-actions based on a TODO comment found in tests/init.py:3. It will automatically be closed when the TODO comment is removed from the default branch (master).

collect portfolios through time make more efficient

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/portfolio/resample.py:11. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient DataExplorer.graph

Examining last_modified or pipeline_last_modified on
a large pipeline structure is extremely slow. Performance
of DataExplorer graphing could be improved if it first found
only the terminal pipelines and sources and used only those,
as the nested is included anyway.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/explorer.py:118. It will automatically be closed when the TODO comment is removed from the default branch (master).

better summary for DataCombinationPipeline

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/operations/combine.py:39. It will automatically be closed when the TODO comment is removed from the default branch (master).

better summary for DataAnalysisPipeline

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/operations/analysis.py:40. It will automatically be closed when the TODO comment is removed from the default branch (master).

handle when labels are so large that they overlap due to not enough x shift.

E.g. MatchComparisonBarData(100000000, 100000, 100000, name='Unbalanced')

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/compareids/models/bars.py:44. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get CRSP from data paths before getting get gvkey or permno tests working

class TestGetGvkeyOrPermno(DataFrameTest):

def test_get_gvkey_with_nan(self):

    expect_df = pd.DataFrame(data = [
        ('a', Timestamp('2000-01-01 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-02 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-03 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-04 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-01 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-02 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-03 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-04 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2008-01-01 00:00:00'), nan, nan),
        ('a', Timestamp('2009-01-02 00:00:00'), nan, nan),
        ('a', Timestamp('2010-01-03 00:00:00'), 78049.0, 1076),
        ('a', Timestamp('2011-01-04 00:00:00'), 10517.0, 1076),
        ], columns = ['byvar', 'Date', 'PERMNO', 'GVKEY'])

    ggop = datacode.get_gvkey_or_permno(self.permno_df_with_nan, datevar='Date',
                                         other_byvars='byvar') #default is on permno get gvkey

    assert_frame_equal(expect_df, ggop)

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:141. It will automatically be closed when the TODO comment is removed from the default branch (master).

could make variable collection initialization more efficient

Currently calling self._set_variables_and_collections() before self._create_variable_map()
as variables need to have the custom name attributes created. But then still calling after to
set the variables attributes correctly. Could reorganize this.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/variables/collection.py:41. It will automatically be closed when the TODO comment is removed from the default branch (master).

merge source variable combine logic doesn't seem to be working completely correctly

Had to put safe=False in merge pipeline output to make it happen

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/operations/merge.py:53. It will automatically be closed when the TODO comment is removed from the default branch (master).

Saving calculated variables

When variables are calculated there is no corresponding column
being passed to DataSource so it does not have a consistent load_key
for saving purposes. Passing the column results in an error for it not
existing in the original data. Need to be able to pass columns which are
from calculations and not to load from existing data

This issue has been automatically created by todo-actions based on a TODO comment found in tests/pipeline/test_auto_cache.py:163. It will automatically be closed when the TODO comment is removed from the default branch (master).

Preserving variables in transform apply to source inplace not working

This code is supposed to prevent that but is not working as expected.
The original variables are still being modified. The problem occurs with both
SourceTransform.apply and Transform.apply_to_source. A test has been added which
catches this issue in test_lags_as_source_transform_with_subset but it has been
commented out for now.

This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/transform/source.py:52. It will automatically be closed when the TODO comment is removed from the default branch (master).

nickderobertis / data-code Goto Github PK

data-code's People

Contributors

Watchers

data-code's Issues

Recommend Projects

Recommend Topics

Recommend Org