mini-kep / parser-rosstat-kep Goto Github PK

Provide clean macroeconomic time series as CSV files at stable URL

Python 98.26% HTML 1.74%

pandas-dataframe csv economics statistics time-series-database russian-specific macroeconomics time-series-analysis time-series

parser-rosstat-kep's People

Contributors

Stargazers

Watchers

Forkers

quantviews sdgupta ufukhurriyetoglu gitter-badger zarak alexeyzakharenkov perevedko garymau malirobot liepieshov sasha00123 bupychuk karagul

parser-rosstat-kep's Issues

sync local 'doc' and 'data' folders with with AWS S3

frontend.md

var descriptions from cfg.py
add sections
check all variables from spec have decriptions
generate frontend README.d with sparllines using mako templates
rename current root README to DEV.md
add github.io version

documentation: unclear role of modules.rst

As of now in documentation I have an issue with modules.rst file. In runtime I get an error message:

\mini-kep\doc\rst\modules.rst:: WARNING: document isn't included in any toctree

I'm not understanding why modules.rst needed at all? It this a default file to insert to index.rst?

@aptiko, are you familiar with this? @dz0 ?

Imports and usage from src/config.py file

1. config.py `DataFolder` used in:

src/word2csv/word.py

def make_interim_csv(year, month):
    from config import DataFolder
    folder = DataFolder(year, month).get_raw_folder()
    interim_csv = DataFolder(year, month).get_interim_csv()
    folder_to_csv(folder, interim_csv)
    return True

2. config.py `DateHelper` used in:

src/finaliser.py

from config import PathHelper, DateHelper

...

def copy_latest():
    year, month = DateHelper.get_latest_date()
    print(f'Latest date is {year} {month}.')
    src_folder = PathHelper.get_processed_folder(year, month)

src/csv2df/parser.py (under if __main__):

    from config import PathHelper, DateHelper
    ...
    year, month = DateHelper.get_latest_date()
    csv_path = PathHelper.locate_csv(year, month)

src/csv2df/reader.py (under if __main__):

    from config import PathHelper, DateHelper
    ...
    year, month = DateHelper.get_latest_date()
    csv_path = PathHelper.locate_csv(year, month)

src/csv2df/runner.py

from config import DateHelper, PathHelper

...

class Collection:
    """Methods to manipulate entire set of data releases."""

    all_dates = DateHelper.get_supported_dates()

    @staticmethod
    def save_latest():
        year, month = DateHelper.get_latest_date()
        latest_vintage = Vintage(year, month)
        latest_vintage.save()

    @staticmethod
    def approve_latest():
        """Quick check for algorithm on latest available data."""
        year, month = DateHelper.get_latest_date()
        latest_vintage = Vintage(year, month)
        latest_vintage.validate()

3. config.py `PathHelper` used in:

src/finaliser.py

from config import PathHelper, DateHelper

...

def copy_latest():
    year, month = DateHelper.get_latest_date()
    print(f'Latest date is {year} {month}.')
    src_folder = PathHelper.get_processed_folder(year, month)
    dst_folder = PathHelper.get_latest_folder()
    copied = []
    for src in [f for f in src_folder.iterdir() if f.is_file()]:
        dst = dst_folder / src.name
        shutil.copyfile(src, dst)
        copied.append(src)
        print('Copied', src)
    print(f'Updated folder {dst_folder}')
return copied

...

def save_xls():
    ...
    filepath = PathHelper.get_xl_path()
    to_xls(filepath, dfa, dfq, dfm)
    print('Saved', filepath)

src/getter.py

from config import PathHelper

...

def get_dataframe(freq):
    path = PathHelper.get_latest_csv(freq)
    content = Path(path).read_text()
   ...

src/csv2df/parser.py (under if __main__):

    from config import PathHelper, DateHelper
    ...
    year, month = DateHelper.get_latest_date()
    csv_path = PathHelper.locate_csv(year, month)
    csvfile = reader.open_csv(csv_path)

src/csv2df/reader.py (under if __main__):

    from config import PathHelper, DateHelper
    ...
    year, month = DateHelper.get_latest_date()
    csv_path = PathHelper.locate_csv(year, month)
    csvfile = open_csv(csv_path)

src/csv2df/runner.py (mentioned in module docstring):

from config import DateHelper, PathHelper

...

class Vintage:
    """Represents dataset release for a given year and month."""

    def __init__(self, year, month):
        self.year, self.month = year, month
        csv_path = PathHelper.locate_csv(year, month)
        with open_csv(csv_path) as csvfile:
            self.dfa, self.dfq, self.dfm = get_dataframes(csvfile)

    def dfs(self):
        """Shorthand for obtaining three dataframes."""
        return self.dfa, self.dfq, self.dfm

    def save(self):
        folder_path = PathHelper.get_processed_folder(self.year, self.month)
        self.dfa.to_csv(folder_path / 'dfa.csv')
        self.dfq.to_csv(folder_path / 'dfq.csv')
        self.dfm.to_csv(folder_path / 'dfm.csv')
        print("Saved dataframes to", folder_path)
        return True

src/csv2df/tests/test_various_bugs.py

from config import PathHelper

...

def test_csv_has_no_null_byte():
    csv_path = PathHelper.locate_csv(2015, 2)
    z = csv_path.read_text(encoding='utf-8')
    assert "\0" not in z

src/download/download.py

from config import PathHelper

UNPACK_RAR_EXE = PathHelper.get_unrar_binary()

...

def unrar(path, folder, unrar=UNPACK_RAR_EXE):
    ...
class RemoteFile():

    def __init__(self, year, month):
        self.year, self.month = year, month
        ...
        folder = PathHelper.get_raw_folder(year, month)
        self.folder = str(folder)
        self.local_path = str(folder / 'ind.rar')
    ...

DataRows and Table classes

в классе DataRows плохо то что данные инициализируются не в конструкторе (хотя бы дефолтные), а только при вызове parse_row: self.a, self.m, self.q.
Аналолично у класса Table (self.a, self.q, self.m)

Не нравится класс DataRows как таковой. Вся его работа вырождается в 1 ф-цию, label и splitter_func у которой доступны внешне у вызываемого кода (из класса Table).
Я бы перенес его логику (ф-цию parse_row из DataRows) в вызываемую сторону, в сам класс Table, а класс DataRows удалил:
def parse_row(self, year, values)
в которой, очевидно, автоматически уже будут доступны self.label и self.splitter_func, плюс здесь можно будет сразу работать c результатами парсинга:

self.a.append(ds.a)
self.q.extend(ds.q)
self.m.extend(ds.m)

Несколько непрозрачная работа классa Table, сейчас получается что работа с ним должна быть грубо говоря такой:
а) создали б) вызвали parse, в нём в частности проставилась splitter_func'ция, плюс еще доп. телодвижения с хидером, далее
при вызове метода calc идет работа в частности с настроенной self.splitter_func.
Иными словами, если не знать эту последовательность действий для работы с Table, вызов её calc упадет где-то так:
AttributeError: 'Table' object has no attribute 'splitter_func'.

В свою очередь класс Datapoints полагается на то, что ему при создании передаются "уже настроенные таблицы",
в которых эта splitter_func была проставлена, и спокойно вызывает табличный метод calc.

Не многовато ли допущений, вопрос? Можно попробовать хотя бы сокрыть внешне все эти сложности,
перенеся ф-цию get_all_tables непосредственно в класс Datapoints, где Datapoints уже в конструкторе
сделает все необходимые "донастройки" таблиц и об этом внешне не надо будет беспокоиться.

provide code examples for src/kep

Must show sequence of parsing commands for easier understanding of algorithm

make new frontpage after adding new month

Make tests using mock data in temp CSV file

# -*- coding: utf-8 -*-
from pathlib import Path
import cfg

csv_path = cfg.rosstat_folder / "testdata.csv"
doc = """1. Сводные показатели / Aggregated indicators					
1.1. Валовой внутренний продукт1) / Gross domestic product1)					
Объем ВВП, млрд.рублей /GDP, bln rubles					
1999	4823	901	1102	1373	1447
2000	7306	1527	1697	2038	2044
2001	8944	1901	2105	2488	2450
2002	10831	2262	2529	3013	3027
2003	13208	2851	3102	3600	3655
2004	17027	3516	3972	4594	4945
2005	21610	4459	5078	5845	6228
2006	26917	5793	6368	7276	7480
2007	33248	6780	7768	8903	9797
2008	41277	8878	10238	11542	10619
2009	38807	8335	9245	10411	10816
2010	46308	9996	10977	12086	13249
2011	59698	12844	14314	15663	16877
2012	66927	14925	16149	17442	18411
2013	71017	15892	17015	18543	19567
20142)	79200	17139	18884	20407	21515
20152)	83233	18210	19284	21294	22016
20162)	85881	18561	19979	22190	
2017					
Индекс физического объема произведенного ВВП, в % / Volume index of produced GDP, percent					
1999	106,4	98,1	103,1	111,4	112,0
2000	110,0	111,4	110,2	110,5	108,2
2001	105,1	104,7	105,0	106,0	104,5
2002	104,7	103,8	104,4	104,4	106,2
2003	107,3	107,5	107,9	106,1	107,6
2004	107,2	107,2	108,0	107,3	106,2
2005	106,4	105,6	106,0	106,0	107,8
2006	108,2	107,3	108,1	108,2	108,9
2007	108,5	108,1	108,6	108,2	109,2
2008	105,2	109,2	107,9	106,4	98,7
2009	92,2	90,8	88,8	91,4	97,4
2010	104,5	104,1	105,0	103,8	105,1
2011	104,3	103,3	103,3	105,0	105,2
2012	103,5	105,3	104,3	103,1	101,8
2013	101,3	100,6	101,1	101,2	102,1
20142)	100,7	100,6	101,1	100,9	100,2
20152)	97,2	97,2	95,5	96,3	96,2
20162)	99,8	98,8	99,4	99,6	
2017					
	Год / Year	Кварталы / Quarters	Янв. Jan.	Фев. Feb.	Март Mar.	Апр. Apr.	Май May	Июнь June	Июль July	Август Aug.	Сент. Sept.	Окт. Oct.	Нояб. Nov.	Дек. Dec.			
		I	II	III	IV												
1.2. Индекс промышленного производства1) / Industrial Production index1)																	
в % к соответствующему периоду предыдущего года / percent of corresponding period of previous year																	
2015	99,2	99,9	98,3	99,5	99,1	100,0	98,2	101,2	98,2	97,6	99,1	98,5	100,2	99,7	98,4	101,0	98,1
2016	101,3	101,1	101,5	101,0	101,7	99,2	103,8	100,3	101,0	101,5	102,0	101,4	101,5	100,1	101,6	103,4	100,2
2017
1.9. Внешнеторговый оборот – всего1), млрд.долларов США / Foreign trade turnover – total1), bln US dollars																	
1999	115,1	24,4	27,2	28,4	35,1	7,2	7,9	9,3	9,8	8,0	9,3	9,5	9,3	9,6	10,4	11,1	13,7
в % к соответствующему периоду предыдущего года / percent of corresponding period of previous year																	
1999	86,9	68,1	75,5	87,0	125,3	63,5	68,3	72,1	80,5	68,6	77,0	78,4	83,7	102,2	117,9	127,4	129,8
в % к предыдущему периоду / percent of previous period																	
1999		87,0	111,5	104,3	123,9	68,1	109,5	118,0	105,7	81,5	116,7	101,8	97,4	103,2	108,2	106,9	123,9
в том числе:																	
экспорт товаров – всего, млрд.долларов США																	
/ of which: export of goods – total, bln US dollars																	
1999	75,6	15,3	17,1	18,9	24,3	4,5	4,9	5,8	6,6	5,1	5,4	6,3	6,2	6,4	7,0	7,6	9,7
в % к соответствующему периоду предыдущего года / percent of corresponding period of previous year																	
1999	101,5	85,1	91,6	98,5	130,1	79,0	86,8	89,0	106,4	84,9	83,8	94,4	99,8	101,5	119,0	132,3	137,5
в % к предыдущему периоду / percent of previous period																	
1999		81,7	111,9	110,5	128,8	63,7	109,5	118,3	112,4	78,3	105,1	116,4	98,2	104,4	108,4	109,0	127,9
импорт товаров – всего, млрд.долларов США																	
/ import of goods – total, bln US dollars																	
1999	39,5	9,1	10,1	9,5	10,8	2,7	3,0	3,5	3,3	2,9	4,0	3,2	3,1	3,1	3,4	3,5	4,0
в % к соответствующему периоду предыдущего года / percent of corresponding period of previous year																	
1999	68,1	51,1	58,1	70,6	115,8	47,8	50,3	54,7	54,2	51,0	69,3	58,9	63,4	103,5	115,7	117,7	114,3
в % к предыдущему периоду / percent of previous period																	
1999		97,4	110,9	93,8	114,2	77,0	109,7	117,6	94,5	87,8	137,3	81,9	96,0	100,9	107,7	102,5	115,3
1.9.1. Внешнеторговый оборот со странами дальнего зарубежья – всего, млрд.долларов США / Foreign trade turnover with far abroad countries – total, bln US dollars
2.1.1. Доходы (по данным Федерального казначейства)1) / Revenues (data of the Federal Treasury)1)												
Консолидированный бюджет, млрд.рублей / Consolidated budget, bln rubles												
20142)	26766,1	1726,3	3579,8	5960,4	8498,3	10572,3	12671,2	15108,2	17143,4	19221,4	21563,5	23439,4
________________________ 1) Данные по консолидированному бюджету за 2005г. и, начиная с I полугодия 2006г., приведены с учетом бюджетов государственных внебюджетных фондов. / 2005 data and data starting 1st half year of 2006 on consolidated budget are given taking into account budgets of public non-budget funds. 2) Начиная с I квартала 2014г. данные об исполнении бюджета приведены с учетом сведений по Республике Крым и г.Севастополю. / Since the second half of 2014 the budget execution data are prepared using data of the Crimea and city of Sevastopol. 3) Оперативные данные. / Short-term data.												
Федеральный бюджет, млрд.рублей / Federal budget, bln rubles												
2014	14496,9	1326,7	2368,6	3521,4	4754,3	5882,6	7120,9	8255,7	9439,6	10698,3	11891,6	12951,4
Консолидированные бюджеты субъектов Российской Федерации, млрд.рублей / Consolidated budgets of constituent entities of the Russian Federation, bln rubles												
2014	8905,7	295,6	863,1	1790,6	2840,5	3493,1	4052,1	5067,5	5704,7	6325,4	7246,1	7834,5
2.1.2. Расходы (по данным Федерального казначейства)1) / Expenditures (data of the Federal Treasury)1)												
Консолидированный бюджет, млрд.рублей / Consolidated budget, bln rubles												
2016	30888,82)	1095,5	3348,6	6339,1	9029,5	11106,6	13582,9	15784,1	18101,9	20493,6	22875,3	25444,1
Федеральный бюджет, млрд.рублей / Federal budget, bln rubles												
2014	14831,6	761,2	2261,5	3345,7	4626,2	5406,4	6402,1	7516,5	8467,4	9529,0	10713,9	11639,2
	Год Year	Янв. Jan.	Янв-фев. Jan-Feb	I квартал Q1	Янв-апр. Jan-Apr	Янв-май Jan-May	I полугод. 1st half year	Янв-июль Jan-Jul	Янв-авг. Jan-Aug	Янв-cент. Jan-Sent	Янв-окт. Jan-Oct	Янв-нояб. Jan-Nov
Консолидированные бюджеты субъектов Российской Федерации, млрд.рублей / Consolidated budgets of constituent entities of the Russian Federation, bln rubles												
2014	9353,3	405,8	1010,4	1683,2	2501,7	3192,2	3961,9	4758,5	5421,7	6150,4	7023,1	7769,5
	Год Year	Янв. Jan.	Янв-фев. Jan-Feb	I квартал Q1	Янв-апр. Jan-Apr	Янв-май Jan-May	I полугод. 1st half year	Янв-июль Jan-Jul	Янв-авг. Jan-Aug	Янв-cент. Jan-Sent	Янв-окт. Jan-Oct	Янв-нояб. Jan-Nov
"2.1.3. Превышение доходов над расходами /профицит /, расходов над доходами /дефицит "" - ""/ (по данным Федерального казначейства) / Surplus of revenues over expenditures /proficit/, surplus of expenditures over revenues /deficit ‘-‘/ (data of the Federal Treasure)"												
Федеральный бюджет, млрд.рублей / Federal budget, bln rubles												
2014	-334,7	565,5	107,0	175,7	128,1	476,2	718,8	739,1	972,3	1169,3	1177,7	1312,2
Консолидированные бюджеты субъектов Российской Федерации, млрд.рублей / Consolidated budgets of constituent entities of the Russian Federation, bln rubles												
2014	-447,6	-110,2	-147,4	107,4	338,8	300,8	90,2	308,9	283,0	175,0	222,9	65,0
	Год Year	Янв. Jan.	Янв-фев. Jan-Feb	I квартал Q1	Янв-апр. Jan-Apr	Янв-май Jan-May	I полугод. 1st half year	Янв-июль Jan-Jul	Янв-авг. Jan-Aug	Янв-cент. Jan-Sent	Янв-окт. Jan-Oct	Янв-нояб. Jan-Nov
2.2. Сальдированный финансовый результат1) по видам экономической деятельности, млн.рублей / Balanced financial result by economic activity, mln rubles												
"""
   
Path(csv_path).write_text(doc, encoding=cfg.ENC)

download rar file from Rosstat

... from https://github.com/epogrebnyak/data-rosstat-kep

in 'src' add 'data' and 'features' folders + update paths in run files and travis_ci

run files:

cfg.py
vint.py
other?

Refactor cell.filter_value()

cell.filter_value() был сделан исторически по принципу валится-чиним. чтобы его улучшить нужен поток всех значений из даже неотпарсенных таблиц. Это даст занчения типа "93,3,", которые сейчас не ловятся.

frontend: Broken cyrillic characters in src/frontend outputs

file: mini-kep/src/frontpage/_latest.md
displays broken characters:

get_dfa, get_dfq, get_dfm methods refactoring

В классе Frame есть методы get_dfa, get_dfq, get_dfm.

Выродилось из комментария в коде:

TO DISCUSS: get_df*() creates data every time we call them,
may create self.dfa, etc and return.

Предлагается создавать по требованию 1 раз, далее возвращать уже созданный инстанс.

validate methods (related to Specification and Definition classes) questions

Речь про метод vaildate в классе Дефиниции и Спецификации.
Открытые вопросы на обсуждение:

Как предполагается использовать методы validate в Спецификации и Дефиниции?

Как реагировать на невалидные данные: игнорировать и выбрасывать варнинг, падать, или еще каким-то образом?

Может быть вынести эту логическую часть во внешнюю сторону, т.к. сейчас опять придем примерно к той же ситуации, что была с классом Table (где объекты могут иметь невалидное состояние).

Можно завести т.н. Валидаторы, в которых реализовать всю эту логику, как для Спецификаций, так и для Дефиниций.

Плюс завернуть получение Спецификации и Дефиниций сразу в этот валидатор, который на выходе даст только валидные объекты и отрапортует о найденных проблемах.

Это на уровне идеи.

automate file exchange from AWS S3 using python boto3

better date handler

csv2df.helpers.DATES should be replaced with a get_supported_dates(), which:

starts at DATES[0]
continious to current year, month
excludes one date excluded in DATES
very, very loosely follows this schedule, eg in any date in August we consider July an allowed date.

Need a test for this and code for get_supported_dates() on a separate branch

REST API for mini-kep

Will put a user case here for the API, to work on a solution. Sorry this takes long to describe.

Style and readability comments

в методе pop_segment необязательно работать с копией csv_dicts, можно напрямую с self.csv_dicts

    def pop_segment(self, start, end):
        """Pops elements of self.csv_dicts between [start, end). 
           Recognises element occurences by index *i*."""           
        #remaining_csv_dicts = self.csv_dicts.copy()
        we_are_in_segment = False
        segment = []
        i = 0
        while i < len(self.csv_dicts):
            row = self.csv_dicts[i]
            line = row['head']
            if is_matched(start, line):
                we_are_in_segment = True
            if is_matched(end, line):
                break
            if we_are_in_segment:
                segment.append(row)
                del self.csv_dicts[i]
            else:    
                # else is very important, wrong indexing without it
                i += 1
        #self.csv_dicts = remaining_csv_dicts
        return segment -->

в методе calc получение года (через get_year) нужно только если прошло условие len(row['data'])>0,
можно перенести его туда ниже.

Здесь же, в методе calc вместо:
if len(row['data'])>0:
можно идиоматически:
if row['data']:

В классе Table я бы поменял нейминг полей:
вместо

self.textrows = textrows
self.datarows = datarows

сделал

self.headers = headers
self.datarows = datarows

тем более что и инстанциируется этот класс как раз с headers и datarows:

yield Table(headers, datarows)

Как по мне, так будет понятнее.

в методе fix_multitable_units

def fix_multitable_units(blocks):
    """For those blocks which do not have parameter definition,
       but do not have any unknown rows, copy parameter 
       from previous block
    """
    for prev_block, block in zip(blocks, blocks[1:]):
        if block.header.unknown_lines() == 0 and \
           block.header.varname is None:
           block.header.varname = prev_block.header.varname

лучше поменять местами условия в ветке if, т.к проверка на is None гораздо быстрее чем проверка на unknown_lines(),
в которой делается вагон и маленькая тележка действий; т.е предлагаю сделать так:
if block.header.varname is None and block.header.unknown_lines() == 0:

может быть я что-то упустил, но разве в ф-ции get_all_tables нужна последняя строка:
return [t for t in all_tables if t.label and t.label not in exclude]
?
Можно сразу возвращать all_tables, они уже отфильтрованы, потому что в текущей реализации здесь
фильтрация делается на каждой итерации ""for pdef in spec[1:]:"", где вызывается get_tables, в которой эта фильтрация и происходит.
т.о. к концу выполнения ф-ции get_all_tables, список all_tables хранит готовые отфильтрованные данные, и последняя строчка
return [t for t in all_tables if t.label and t.label not in exclude] избыточна.

make new fixtures in csv2df tests

This review is similar to #38.

transform GOV_ACCUM_* variables to levels

Validate 'markers' order in cfg.spec is same as in reference file

The algorithm may break if markers (start and end lines) are not in order, to check for that:
- must read csv file rows, use a reference csv file (use files.get_latest())
- use first pair in start and end markers
- make sure the order of markers in specification is the same order as in csv file

Need addition to test_cfg.py or validation method in cfg.spec:

add Excel output

Move assertions from modules to test modules

Предлагаю перенести все проверки из модуля cfg ко всем остальным проверкам модуля cfg, в файл test_cfg.py, если нет возражений.
Аналогично для модуля tables, перенести в test_tables.py.

Undetected errors in unused methods/properties in parse.Table

В текущей версии класса Table проперти npoints некорректное (стало после изменений в классе в последние дни).
Сейчас оно используется в def repr(self) разве что. Недочет в том, что класс не содержит self.a, self.q, self.m (переехали в Emitter),
т.о. всегда будет ловиться эксепшн AtributeError: 'Table' object has no attribute 'a' и возвращаться 0 (см. реализацию npoints).
Предлагаю удалить проперти npoints и удалить работу с ним в repr
Подобная проблема здесь же, в методе str, не определён self.textrows у класса Table:
msg = "\nTable with {} header(s) and {} datarow(s)".format(len(self.textrows),
self.nrows)
будет эксепшн

Remaining refactoring at parse.py

Some refactoring proposals given at: https://github.com/sdgupta/mini-kep/blob/master/src/kep/parse_refactored.py

Todo @sd:

mark comment with initials and start comment on new line before commented code

Misc todo (from parse.py)

TODO:

Tasks

List of notebooks

Make notebook README.MD:

vintages / revisions (make markdown based on MS Word)
datalab
inflation components
bank reserves vs oil
variations in variables
macro assumptions for stress testing

OUT OF SCOPE

read SEP + update KEP with SEP data
prior to 1999

Datadriven comment

Using incomplete dir structure
DAG for economist

    # REQUIREMENT: 
    # pick new vars by requirement for features   
    # var desciptions by section with latest data
    # detrended data
    # revisions analysis
    # dataset organisation - what we want:
        # latest dataset
        # variable by vintages
        # seasonally adjusted series
        
        
    
    # TODO-PARSING:         
    # convert more existing parsing definitions to cfg.py    
    # add more control datapoints        
    
    #FIXME: generate Excel files and plots
    #def to_excel(dfs, filename):
    #    dfa, dfq, dfm = dfs
    #    with pd.ExcelWriter(filename) as writer:
    #        dfa.to_excel(writer, sheet_name='year')
    #        dfq.to_excel(writer, sheet_name='quarter')
    #        dfm.to_excel(writer, sheet_name='month')
    #        #FIXME - write variable names to sheet, by section 
    #
    ##for file in [config.XLSX_FILE, config.XLS_FILE]:
    ##   to_excel(file)

csv2df design review with pycallgraph

There is a dependency and runtime graph for calling csv2df.runner.Vintage(2017, 5) by @dz0 gerenrated with pycallgraph.

The graph is here ('Open in new tab' will enlarge) and to me it was a lot to say:

csv2df.parser is overloaded with logic: it both makes tables and parses them
8 call from runner to reader is 8 segments in defintion
numers on the upper side of Emitter graph: "yes, we are making 3 data frames"
lower part of Emitter graph: the numbers refer to number of variables and datapoints generated

Pycallgraph seems stalled itself, but I think we can use this chart as reference.

Inaccurate/misleading exception message in RowStack/pop

Речь про строки:

elif not s or not e:
    raise ValueError("Single None as a marker cannot be processed")

в методе pop класса RowStack.

в случаях:
1)
marker['start'] явл. пустой строкой, т.е ""
marker['end'] явл. строкой "abc"
2)
marker['end'] явл. пустой строкой, т.е ""
marker['start'] явл. строкой "abc"

получим мутное сообщение эксепшна, что якобы был задан 1 None, хотя это не так (пустая строка - это не None).

structure for src folder

config.py #Get folder locations for repo root, latest and output  
make.helper #Navigate in dates and data folder
make.download #Download and unrar
make.word2csv #make_interim_csv(folder, csvpath)
make.csv2df #...(folder, csvpath)
finaliser.py #scripts to copy year/month to latest  
getter.py # en-duse API prototype
images #(folder, csvpath) 
example_parsing.py
example_access.py
legacy.frontpage.markdown

Use 2011 GDP quarterly values at 2011 prices to derive rog

http://www.gks.ru/free_doc/new_site/vvp/kv/tab6.htm

2011:

GDP_2011_Q = [13255.3, 14345.3, 15615.1, 16482.4]

Rename packages/modules in /src

After done with refactoring:

rename kep to csv2df
rename modules in src/kep to reader-parser-emitter:
- combine rows and splitter modules, make reader
- rename tables to parser
- make emitter

rst/Sphinx documentation for files module

I made some sphinx documentation for files module:

Data structure for specification which has 'default' and 'other' parsing definitions

В методе get_all_tables сейчас захардкожено то что мы полагаем что сперва идет дефолтная дефиниция, затем все extra.
Если где-то это не будет выполняться (неверно заданы данные в cfg.py, забыли о таком правиле, итд), код поломается
(но, возможно, это даже и хорошо, станет раньше заметно ошибку). Может стоит в cfg.py как-то это заложить: заведение дефолтной дефиниции и всех остальных extra,
чтобы было наглядно и очевидно?

better look and structure for sphinx documentation

Current documentation:

http://mini-kep-parcer-for-rosstat-kep-publication.readthedocs.io/en/latest/

Doc folder:

https://github.com/epogrebnyak/mini-kep/tree/master/doc

Script to generate docs:

ctx.run("sphinx-apidoc -efM -o doc src\csv2df *test_*")
at tasks.py

My idea of the docs is the follwoing:

documentation page shows intro.rst and glossary.rst for the project
and links to src/packages folders:
- download
- word2csv (will rename, it is now called word)
- csv2df
- frontpage

The link is a module name, possibly followed by a short comment. Can this comment be a module docstring?

At csv2df page I want a listing of modules with their docstrings. In listing I would prefer to control order of modules listed as:
- helpers
- specification
- reader
- parcer
- runner
- validator
- util_label
- util_row_splitter

Can this order be specified somewhere, eg __all__constant?

When clicking on module name like 'helpers' I get its documentation.

For this structure I need a new commnd to put at tasks.py + manual edit of some .rst files, apparently.

Add more variable definitions

Extend variable definitions in spec.py using https://github.com/epogrebnyak/mini-kep/tree/master/reference/parsing_definitions

Fix unexpected row length warnings im splitter.py

WARNING: unexpected row length 3
WARNING: unexpected row length 11
WARNING: unexpected row length 11

на каких рядах они возникают?
что с этим делать?

Integrate cell.py to parser.py

в ф-ции get_year регулярку можно использовать сразу готовую, предварительно скомпильнутую,
как сделано к примеру с ""_COMMENT_CATCHER = re.compile("\D*([\d.,])\s(?=\d))")""

в cell.py есть ф-ция get_year, в parser.py есть ф-ция get_year, нужно определиться где ей находиться.

online coverage badge

https://codecov.io/gh/epogrebnyak/mini-kep or coveralls
https://github.com/epogrebnyak/mini-kep/blob/master/.travis.yml#L11-L12

Need step-by-step procedure resulting in coverage badge in README.md.

frontend: put code for graphs in one place

Есть три варианта отрисовки данных :

https://github.com/epogrebnyak/mini-kep/blob/master/src/frontpage/VALUES.md - сплайны, код где-то в папке frontpage
https://github.com/epogrebnyak/data-rosstat-kep и код тут
https://github.com/epogrebnyak/data-lab (группы показателей)

Мне не нравится, что это три разные отрисовки, код в разных местах. Первая задача собрать код этих рисунков в одном месте.

Test csv reading

Reading csv files and filtering csv rows not tested. May split some functions in order to test. See Ned's lecture.

how do I reference a module/class/func in dostring?

Here is the code for module docstring:
https://github.com/epogrebnyak/mini-kep/blob/master/src/csv2df/helpers.py

and here how it looks like in documentation:
http://mini-kep-parcer-for-rosstat-kep-publication.readthedocs.io/en/latest/csv2df.html#module-csv2df.helpers

The problem: function names are in bold, but there no clickable links.

document and refactor kep.spec

See list of todos in readme.

Files:

automate creation of 'processed/latest' on update

data/pocessed/latest - make utility that copies files to latest on update
some code below for updating make_latest.py - see MAKE/docopt syntax for common tasks in repo if necessary

import cfg
import parse 

if __name__ == '__main__':    
    # check latest date data           
    parse.approve_latest()               
    
    # check all dates, runs slow (about 20 sec.) + may fail if dataset not complete      
    parse.approve_all()

    # save dataframes to csv 
    parse.save_all_dfs()
    
    # interim to processed data cycle: (year, month) -> 3 dataframes
    #use None, None for latest values
    year, month = 2017, 4 
    # source csv file
    csv_path = cfg.get_path_csv(year, month)  
    # break csv to tables with variable names
    tables = parse.get_all_valid_tables(csv_path)
    # emit values from tables
    dpoints = parse.Datapoints(tables)    
    # convert stream values to pandas dataframes     
    frame = parse.Frame(datapoints=dpoints)
    # save dataframes to csv files  
    processed_folder = cfg.get_processed_folder(year, month)    
    frame.save(processed_folder)       
    # end of cycle 
    
   # sample access - dataframes
    dfa = frame.get_dfa()
    dfq = frame.get_dfq()
    dfm = frame.get_dfm()

Testing suggestions

to_csv: is not tested at all, function in its current state is hard to test as it creates file for the given
path you should consider passing open file to this function this way in tests you can pass mocked file
and check what was writen in that file by the function.

from_csv: is not tested explicitly but it is covered through some tests that use read_csv function

read_csv: is covered but not tested explicitly probably for the same reason as to_csv

get_tables_from_rows_segment: is not tested explicitly but it is extensively used both in code and it is used in
pipeline tests so to my mind it is vital to test this function thoroughly

    fix_multitable_units, check_required_labels: are not tested explicitly but are used in
    get_tables_from_rows_segment only so it wont be necessarily to test them separately if that function
    is tested well enough.

is_year: not tested explicitly but very straightforward and no need to test it.

split_to_tables: looks very complex but it is not tested explicitly.

month_end_day: not tested explicitly but very straightforward and no need to test it.

Row.__str__ is tested but __repr__ is not

RowStack.is_found is not covered fully (path leading to false is not covered)

RowStack.pop_segment is not tested explicitly it is covered by tests of pop method though.

Table.echo_error_table_not_valid: not tested, better take a second look echo in name by intuition means
printing something but this method throws whole bunch of errors is it really a expected behaviour?

Header.set_varname and Beader.set_unit have complex logic but not tested in isolation. (need more tests)

DictMaker is not tested explicitly but it is straightforward class and maybe it is enough that
it is implicitly covered (m_dict method is not covered though)



By the way you can omit scripts from coverage using --omit=(coma separated list of regex patterns) parameter
like so: coverage report -m --omit=test_*.py,tst_draft.py > coverage.output

testing: test.csv2df.reader.open_csv()

Some cleanup / excercise based on 'dirty' test by @bakakaldsas here. Hope you don't mind me pulling this to issue, it should help coming to common testing style. It extends #24 in a way.

There are two things to 'clean' in test structure and several option on how to test.

Test structure:

test naming should be observed, as with rest of python classes. Suggested name is Test_open_csv(). Caps for test name, lowercase due to it is testing a function open_csv. It should not be all lowercase for testclass.
the part like below showld bertter go to either setup_method or a separate fixture. As we are not using it outside the class can settle for setup method.

    from pathlib import Path
    class MockPath(type(Path())):
        def open(self, mode='r', buffering=-1, encoding=None, errors=None, newline=None):
            return "test"

    path_good = MockPath()
    path_bad = "This is not pathlib.Path, this is a string"

As for the test essence there are several things we can test:

is open_csv() callable
can it open a file?
can it open a file and the contents of the file is right?

(3) is a bit out of reach of open_csv() as it maust just open and not necassarily be responsible for content. So lets concentrate on (1) and (2).

In brief, @bakakaldsas you are testing (1), but the test is kind of self-fulilling with MockPath you wrote. From standard library you can use unittest.mock.mock_open() in MockPath to avoid creating custom-made open() method.

As for (2) one can also create a temp file, pass it to csv_open(), close/dekete file in teardown. I use that in https://github.com/epogrebnyak/mini-kep/blob/master/src/tests/test_finaliser_and_getter.py (with tempfile package to get temp dir/ tempfile).

What I'm doing in further commits is changing the structure of the submitted test and providing some new versions as described above.

vintage.Collection.approve_all() fails on ['UNEMPL_pct', 'WAGE_NOMINAL_rub', 'WAGE_REAL_yoy', 'WAGE_REAL_rog'] in 2009 4

# ERROR: approve all will fail on new definitons in 2009 4
Collection.approve_all()
ValueError: Missed labels: ['UNEMPL_pct', 'WAGE_NOMINAL_rub', 
'WAGE_REAL_yoy', 'WAGE_REAL_rog']

Must check corresponding tab.csv file and edit spec.py parameters for these variables.

clone local cache in example_access_data.py

Комментарии к коммиту ниже:

для чтения и дальнейшей отрисовки фреймов требуется минимальное преобразование данных на уровне pd.read_csv()
сейчас годовой фрейм содержит только год, может содержать дату конца года и тогда для него не нужен специальный ридер, можно упросить, перегенерировав данные
все время читать из интернета медленно, можно из других репозиториев позаимствовать класс, которые читает из интернета и сохраняет локальную копию, можно во временной папке, это ускоряет чтение CSV файлов

TODO: уточнить в репозиториях по нефти и курсу систему кэширования.

absolute values by month/qtr accumulate to qtr/year (with tolerance)
rog rates accumulate to yoy (with tolerance)
other rules for consistency checks

Also it is a good point to transform GOV_ACCUM_variables to non-accumulated values (by substacting).