Coder Social home page Coder Social logo

data-code's People

Contributors

derobertis avatar github-actions[bot] avatar nickderobertis avatar todo-actions[bot] avatar

Watchers

 avatar

data-code's Issues

when loading from existing source, handle when indices do not match

Currently the code assumes the same index in the existing source and loaded source.
Need to add code to change the index. But if this was due to a desired aggregation,
how should the user select what aggregation would be applied?


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:68. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient implementation of loading variables for calculations

The DataLoader checks what variables are needed for calculations that are not
included in load_variables, and if it requires multiple transformations of
a variable, then it copies that series for as many transformations are needed.
It would be better to have an implementation that doesn't require carrying copies
through everything.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:203. It will automatically be closed when the TODO comment is removed from the default branch (master).

[Security] Workflow docs.yml is using vulnerable action peaceiris/actions-gh-pages

The workflow docs.yml is referencing action peaceiris/actions-gh-pages using references v2.5.0. However this reference is missing the commit d2178821cb5968f5b7c818210297f3dbeea3114c which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

eliminate repeated from_str methods in dtypes

Currently the same from_str method is in the subclasses because they have a different init
Only int and float have different from_str methods, and both of those are the same. Create
mixin or intermediate classes to eliminate repeated code.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/dtypes/base.py:43. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient last_modified

last_modified is calculated a lot and goes through the
entire pipeline each time. Caching the result of the
calculations will give a significant speed up, especially
in DataExplorer.graph. Need to handle updating the cache
whenever data sources or operations change, and somehow
also when OS modified time of file changes (fs events?).


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/pipeline/base.py:229. It will automatically be closed when the TODO comment is removed from the default branch (master).

refactor auto-merge workflow once Github Actions improves

Entire jobs are getting copied between workflow files due to limitations in Github Actions.
The only difference in these jobs is that they checkout master instead of requiring master

Possible changes to Github Actions that would allow the automerge workflow to be refactored:

  • reuse jobs
  • reuse steps
  • trigger workflow from within action/workflow
  • commit triggered by action triggers push event

This issue has been automatically created by todo-actions based on a TODO comment found in .github/workflows/automerge.yml:89. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get Compustat from data paths before getting these tests working

class TestLoadAndMergeCompustat(DataFrameTest):

def test_freq_a(self):

    expect_df = pd.DataFrame(data = [
        ('001076', Timestamp('1995-03-01 00:00:00'), Timestamp('1994-03-31 00:00:00'),
             185.18400000000003, 112.70299999999999),
        ('001076', Timestamp('1995-04-01 00:00:00'), Timestamp('1995-03-31 00:00:00'),
             228.892, 113.575),
        ('001722', Timestamp('2012-01-01 00:00:00'), Timestamp('2011-06-30 00:00:00'),
             80676.0, 1247.0),
        ('001722', Timestamp('2012-07-01 00:00:00'), Timestamp('2012-06-30 00:00:00'),
             89038.0, 1477.0),
        ('001722', numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns'),
             numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns')),
        (numpy.datetime64('NaT'), numpy.datetime64('2012-01-01T00:00:00.000000000'), numpy.datetime64('NaT'),
             numpy.datetime64('NaT'), numpy.datetime64('NaT')),
        ], columns = ['GVKEY', 'Date', 'datadate', 'sale', 'capx'])

    c_str = datacode.load_and_merge_compustat(self.df_gvkey_str, get=['sale','capx'], freq='a',
                                               gvkeyvar='GVKEY', debug=True)

    c_num = datacode.load_and_merge_compustat(self.df_gvkey_num, get=['sale','capx'], freq='a',
                                               gvkeyvar='GVKEY', debug=True)

    assert_frame_equal(expect_df, c_str, check_dtype=False)
    assert_frame_equal(expect_df, c_num, check_dtype=False)

def test_freq_q(self):

    expect_df = pd.DataFrame(data = [
        ('001076', Timestamp('1995-03-01 00:00:00'), Timestamp('1994-12-31 00:00:00'),
             56.511, 21.96799999999999),
        ('001076', Timestamp('1995-04-01 00:00:00'), Timestamp('1995-03-31 00:00:00'),
             59.551, 29.421000000000006),
        ('001722', Timestamp('2012-01-01 00:00:00'), Timestamp('2011-12-31 00:00:00'),
             23306.0, 409.0),
        ('001722', Timestamp('2012-07-01 00:00:00'), Timestamp('2012-06-30 00:00:00'),
             22675.0, 284.0),
        ('001722', numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns'),
             numpy.timedelta64('NaT','ns'), numpy.timedelta64('NaT','ns')),
        (numpy.datetime64('NaT'), numpy.datetime64('2012-01-01T00:00:00.000000000'), numpy.datetime64('NaT'),
             numpy.datetime64('NaT'), numpy.datetime64('NaT')),
        ], columns = ['GVKEY', 'Date', 'datadate', 'saleq', 'capxq'])


    c_str = datacode.load_and_merge_compustat(self.df_gvkey_str, get=['sale','capx'], freq='q',
                                               gvkeyvar='GVKEY', debug=True)

    c_num = datacode.load_and_merge_compustat(self.df_gvkey_num, get=['sale','capx'], freq='q',
                                               gvkeyvar='GVKEY', debug=True)

    assert_frame_equal(expect_df, c_str, check_dtype=False)
    assert_frame_equal(expect_df, c_num, check_dtype=False)

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:493. It will automatically be closed when the TODO comment is removed from the default branch (master).

better tests for graph

Currently just checking to make sure they can be generated with no errors.
Should also check the contents of the graphs. Also see TestCreateSource.test_graph


This issue has been automatically created by todo-actions based on a TODO comment found in tests/pipeline/test_data_merge.py:96. It will automatically be closed when the TODO comment is removed from the default branch (master).

don't trigger extra columns when the extra columns are just the untransformed columns

We are adding extra columns here for calculated variables which require variables not
included in load_variables. Currently, it will load extra variables even if
the calculation could just be done before variable transforms. For example, the
test TestLoadSource.test_load_with_calculate_on_transformed_before_transform should be able
to complete without adding any extra columns


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/source.py:97. It will automatically be closed when the TODO comment is removed from the default branch (master).

better way of storing calculated columns than uuid in columns dictionary

The dictionary of columns has keys as names in the original source and values as columns.
A calculated column is not in the original source, so uuid was used for now just to ensure
that these columns can be in the dictionary, but they should be tracked separately.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/loader.py:154. It will automatically be closed when the TODO comment is removed from the default branch (master).

add tests for SEM


This issue has been automatically created by todo-actions based on a TODO comment found in tests/init.py:3. It will automatically be closed when the TODO comment is removed from the default branch (master).

more efficient DataExplorer.graph

Examining last_modified or pipeline_last_modified on
a large pipeline structure is extremely slow. Performance
of DataExplorer graphing could be improved if it first found
only the terminal pipelines and sources and used only those,
as the nested is included anyway.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/explorer.py:118. It will automatically be closed when the TODO comment is removed from the default branch (master).

decouple get CRSP from data paths before getting get gvkey or permno tests working

class TestGetGvkeyOrPermno(DataFrameTest):

def test_get_gvkey_with_nan(self):

    expect_df = pd.DataFrame(data = [
        ('a', Timestamp('2000-01-01 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-02 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-03 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2000-01-04 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-01 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-02 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-03 00:00:00'), 10516.0, 1722),
        ('b', Timestamp('2000-01-04 00:00:00'), 10516.0, 1722),
        ('a', Timestamp('2008-01-01 00:00:00'), nan, nan),
        ('a', Timestamp('2009-01-02 00:00:00'), nan, nan),
        ('a', Timestamp('2010-01-03 00:00:00'), 78049.0, 1076),
        ('a', Timestamp('2011-01-04 00:00:00'), 10517.0, 1076),
        ], columns = ['byvar', 'Date', 'PERMNO', 'GVKEY'])

    ggop = datacode.get_gvkey_or_permno(self.permno_df_with_nan, datevar='Date',
                                         other_byvars='byvar') #default is on permno get gvkey

    assert_frame_equal(expect_df, ggop)

This issue has been automatically created by todo-actions based on a TODO comment found in tests/test_data.py:141. It will automatically be closed when the TODO comment is removed from the default branch (master).

could make variable collection initialization more efficient

Currently calling self._set_variables_and_collections() before self._create_variable_map()
as variables need to have the custom name attributes created. But then still calling after to
set the variables attributes correctly. Could reorganize this.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/variables/collection.py:41. It will automatically be closed when the TODO comment is removed from the default branch (master).

Saving calculated variables

When variables are calculated there is no corresponding column
being passed to DataSource so it does not have a consistent load_key
for saving purposes. Passing the column results in an error for it not
existing in the original data. Need to be able to pass columns which are
from calculations and not to load from existing data


This issue has been automatically created by todo-actions based on a TODO comment found in tests/pipeline/test_auto_cache.py:163. It will automatically be closed when the TODO comment is removed from the default branch (master).

Preserving variables in transform apply to source inplace not working

This code is supposed to prevent that but is not working as expected.
The original variables are still being modified. The problem occurs with both
SourceTransform.apply and Transform.apply_to_source. A test has been added which
catches this issue in test_lags_as_source_transform_with_subset but it has been
commented out for now.


This issue has been automatically created by todo-actions based on a TODO comment found in datacode/models/transform/source.py:52. It will automatically be closed when the TODO comment is removed from the default branch (master).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.