nlpaueb / edgar-crawler Goto Github PK

The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.

License: GNU General Public License v3.0

Python 100.00%

python finance sec edgar edgar-crawler economics business natural-language-processing nlp

edgar-crawler's Introduction

EDGAR-CRAWLER: Unlock the Power of Financial Documents 🚀

Tired of sifting through endless financial reports of 100+ pages, struggling to extract meaningful insights?

📊 EDGAR-CRAWLER is an open-source & optimized toolkit that retrieves key information from financial reports. It can crawl any report found in the SEC EDGAR database, the web repository for all publicly traded companies in the USA.

Most importantly, apart from downloading EDGAR filings like other standard toolkits, EDGAR-CRAWLER can also preprocess and convert them from lengthy and unstructured documents into clean and easy-to-use JSON files.

`EDGAR-CRAWLER` has 2 core modules:

📥🕷️ Business Documents Crawling: Utilize the power of the edgar_crawler.py module to effortlessly crawl and download financial reports for every publicly-traded company within your specified years.

🔍📑 Item Extraction: Extract and clean specific text sections such as Risk Factors or Management's Discussion & Analysis from 10-K documents (annual reports) using the extract_items.py module. Get straight to the heart of the information that matters most.

Who Can Benefit from `EDGAR-CRAWLER`?

📚 Academics: Enhance your NLP research in economics & finance or business management by accessing and analyzing financial data efficiently.

💼 Professionals: Strengthen financial analysis, strategic planning, and decision-making with comprehensive, easy-to-interpret financial reports.

🛠 Developers: Seamlessly integrate financial data into your models, applications, and experiments using our open-source toolkit.

Star History

🚨 News

2023-01-16: EDGAR-CORPUS, the biggest financial NLP corpus (generated from EDGAR-CRAWLER), is available as a HuggingFace 🤗 dataset card. See Accompanying Resources for more details.
2022-10-13: Updated documentation and fixed a minor import bug.
2022-04-03: EDGAR-CRAWLER is available for Windows systems too.
2021-11-11: We presented EDGAR-CORPUS, our sister work that started it all, at ECONLP 2021 (EMNLP Workshop) at the Dominican Republic.

Install
Usage
Citation
Accompanying Resources
Contributing
License

Install

Before starting, it's recommended to create a new virtual environment using Python 3.8. We recommend installing and using Anaconda for this.
Install dependencies via pip install -r requirements.txt

Usage

Before running any script, you should edit the config.json file, which configures the behavior of our 2 modules.
- Arguments for edgar_crawler.py, the module to download financial reports:
  - start_year XXXX: the year range to start from (default is 2021).
  - end_year YYYY: the year range to end to (default is 2021).
  - quarters: the quarters that you want to download filings from (List).
    Default value is: [1, 2, 3, 4].
  - filing_types: list of filing types to download.
    Default value is: ['10-K', '10-K405', '10-KT'].
  - cik_tickers: list or path of file containing CIKs or Tickers. e.g. [789019, "1018724", "AAPL", "TWTR"]
    In case of file, provide each CIK or Ticker in a different line.
    If this argument is not provided, then the toolkit will download annual reports for all the U.S. publicly traded companies.
  - user_agent: the User-agent (name/email) that will be declared to SEC EDGAR.
  - raw_filings_folder: the name of the folder where downloaded filings will be stored.
    Default value is 'RAW_FILINGS'.
  - indices_folder: the name of the folder where EDGAR TSV files will be stored. These are used to locate the annual reports. Default value is 'INDICES'.
  - filings_metadata_file: CSV filename to save metadata from the reports.
  - skip_present_indices: Whether to skip already downloaded EDGAR indices or download them nonetheless.
    Default value is True.
- Arguments for extract_items.py, the module to clean and extract textual data from already-downloaded 10-K reports:
  - raw_filings_folder: the name of the folder where the downloaded documents are stored.
    Default value s 'RAW_FILINGS'.
  - extracted_filings_folder: the name of the folder where extracted documents will be stored.
    Default value is 'EXTRACTED_FILINGS'.
    For each downloaded report, a corresponding JSON file will be created containing the item sections as key-pair values.
  - filings_metadata_file: CSV filename to load reports metadata (Provide the same csv file as in edgar_crawler.py).
  - items_to_extract: a list with the certain item sections to extract.
    e.g. ['7','8'] to extract 'Management’s Discussion and Analysis' and 'Financial Statements' section items.
    The default list contains all item sections.
  - remove_tables: Whether to remove tables containing mostly numerical (financial) data. This work is mostly to facilitate NLP research where, often, numerical tables are not useful.
  - skip_extracted_filings: Whether to skip already extracted filings or extract them nonetheless.
    Default value is True.
To download financial reports from EDGAR, run python edgar_crawler.py.
To clean and extract specific item sections from already-downloaded 10-K documents, run python extract_items.py.
- Reminder: We currently support the extraction of 10-K documents.

Citation

An EDGAR-CRAWLER paper is on its way. Until then, please cite the relevant EDGAR-CORPUS paper published at the 3rd Economics and Natural Language Processing (ECONLP) workshop at EMNLP 2021 (Punta Cana, Dominican Republic):

@inproceedings{loukas-etal-2021-edgar,
    title = "{EDGAR}-{CORPUS}: Billions of Tokens Make The World Go Round",
    author = "Loukas, Lefteris  and
      Fergadiotis, Manos  and
      Androutsopoulos, Ion  and
      Malakasiotis, Prodromos",
    booktitle = "Proceedings of the Third Workshop on Economics and Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.econlp-1.2",
    pages = "13--18",
}

Read the paper here: https://aclanthology.org/2021.econlp-1.2/

Accompanying Resources

[EDGAR-CORPUS on Zenodo] EDGAR-CORPUS: The biggest corpus for financial NLP research, built from EDGAR-CRAWLER - https://zenodo.org/record/5528490
[EDGAR-CORPUS on HuggingFace 🤗 datasets] -https://huggingface.co/datasets/eloukas/edgar-corpus/
[Financial Word2Vec Embeddings] EDGAR-W2V: Word2vec Embeddings trained on EDGAR-CORPUS - https://zenodo.org/record/5524358

Contributing

PRs and contributions are accepted.

Please use the Feature Branch Workflow.

Issues

Please create an issue on GitHub instead of emailing us directly so all possible users can benefit from the troubleshooting.

License

Please see the GNU General Public License v3.0.

edgar-crawler's People

Contributors

Stargazers

Watchers

Forkers

lakshaykc aniruddhakal hussien nsnow danielizard flakusha fwjiang996 nickmagginas dhnanjay crneville urvijaykumar truion jwirth2017 simrit1 yukonzhang oerdem19 sophia-jihye nadhem-zmandar karenssong akashmavle5 eshwar01 joon1bae chenghaomou shahagam4 jamesahnjoonchul joe32140 infinitesimal94 pints-ai aspen-ai yonxie yanchen036 korea-quantum-computing jlohding conceptofmind han8931 nkhanhle23 jhyung00 lgaalves techthiyanes ycm-hackers dxlnr hanjinda kuiming sweetbro polya20 sahilkh1 hishamhaniffa m-laube sumit6597 pongelu jdnickell athakur569 akshat111111 itsav1ral akashmishra97 andyw-llm guveni mobley-trent cloudgeekpro andrewfowl aaronthekj ppubbi junaidiqbalsyed thibaultdeco bridgetleonard2 hamt bailefan zooeee w8741906 olegshteynbuk alexoao

edgar-crawler's Issues

serious issues of 10-Q extraction: Item 1 Financial statements and Item 2 MD&A are gathered together

Dear the team,

I am checking the parsing output of 10-Qs and seeing all item_2 MD&A are gathered together with item_1.
This is because the 10-Qs have two items 2.

Item 2 The MD&A in Part I
Item 2 Unregistered Sales of Equity Securities in Part II

I got the items 1 and 2 in Part 1 gathered for all of the 10-Qs, resulting in the MD&A appearing in the text of Item_1 Financial Statements.
Can you check that? I am in a hurry for my research project, so any updates would be highly appreciated.

Best,
HHN

requirements file

Can't locate requirements.txt file. Please add it to repository.

UnpicklingError: could not find MARK

Hi!
I am trying to load emdeddings in (https://zenodo.org/record/5524358#.Yk9U2ShBxPY), but got the next errorr:
UnpicklingError: could not find MARK
and with this warning - return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()

Tried this code: model = Word2Vec.load('edgar.word.w2v.200.bin')

any plans for parsing the tables into readable format?

Hi, thanks for this useful codebase!

Do you have any plans or idea on how the tables can be parsed into readable format? say using pandas etc?

errno 22

with open(filepath, 'wb') as f:
OSError: [Errno 22] Invalid argument: 'C:\Users\Amir\PycharmProjects\edgar-crawler\datasets\RAW_FILINGS\1020214_10K_1999_https://www.txt'

Parallelism not working

In edgar_crawler.py, it appears we are trying to issue number of downloads in parallel by creating a list_of_series in main. However, the way get_specific_indices() is coded, it returns a df with one entry per filing, so list_of_series is really a list of single-item entries. Hence crawl is called separately for each filing to download, which seems very slow. Am I missing something? Is this a bug or intentional for some reason?

Issue with extracting items

Hi - I'm trying to work with the extracted Item 1 Business Descriptions, and I noticed that for many filings the extract_items.py code appears to be cutting off the section prematurely. This usually happens when the text makes a reference to a later section (e.g. Item 1A Risk Factors), where that reference is interpreted by the code to be the header of the next section. One example is this filing:
https://www.sec.gov/Archives/edgar/data/872448/000087244813000005/atml-201210k.htm
When the code gets to this sentence from "Forward Looking Statements":
"... including the risk factors set forth in this discussion and in Item 1A — Risk Factors, and elsewhere in this Form 10-K."
it cuts off item_1 at "including the risk factors set forth in this discussion and in"
The issue occurs in approximately 10-15% of the filings I've looked at.

I am very inexperienced at working with text data, so I'm not sure how to fix this problem myself. Please let me know if you need more information or if I can help in any way. Thanks!

Start/end year not working correctly

I'm trying to use edgar-crawler to create a data source of all 10-Ks for NLP embedding purposes. The default start-year and end-year are both 2021. When I attempted to change the start-year to 1993 and 1994, and the end-year to 2023, it still only downloaded the files from 2021. This happened both when I changed them from the command line and from the config.json file. Is there a way to fix this?

Check downloaded filings and parsed items

Hi,
Thanks for sharing a great package.
It's not a real issue, but I hope you can update to check if a filing has been downloaded and an item has been parsed in the next version. And by that, we need to download missing filings and parse missing items as needed.

Sometimes the computer could get crashed and let me restart my session. I saw some filings and items in the output folders, but restarting the session requires Python to start from the beginning. Checking downloaded filings is also helpful in case we want to update the data.

Best regards,
hhn

Hi Eloukas, i got a error when using the crawler.py

When i run the crawler.py code in VScode, it returns an error message, it seems like an OS problem. Could you please help me to tackle with it? Many thanks in advance.

OSError: [Errno 22] Invalid argument: 'C:\Users\liyif\Documents\GitHub\edgar-crawler\datasets\RAW_FILINGS\1000209_10K_2019_https://www.htm'

Super new to this

I am receiving an error code when I run Init.py. Can I have some help? I am incredibly new to this. This is the error code I am seeing. /usr/local/bin/python3: can't open file '/Users/jakobwirthwein/Documents/GitHub/Edgar-Fillings/edgar-crawler/init.py': [Errno 1] Operation not permitted

edgar_crawler.py filing_types: TypeError: only list-like objects are allowed to be passed to isin(), you passed a str

Hi,

I edited the config.json file as suggested as follows. I only choose two companies (CIKs) to test because I would like to see the format of raw fillings and where they are located.

{'edgar_crawler': {'start_year': 2021, 'end_year': 2021, 'quarters': [1, 2, 3, 4], 'filing_types': '10-K', 'cik_tickers': ['1318605', '1018724'], 'user_agent': '[email protected]', 'raw_filings_folder': 'RAW_FILINGS', 'indices_folder': 'INDICES', 'filings_metadata_file': 'FILINGS_METADATA.csv', 'skip_present_indices': True}, 'extract_items': {'raw_filings_folder': 'RAW_FILINGS', 'extracted_filings_folder': 'EXTRACTED_FILINGS', 'filings_metadata_file': 'FILINGS_METADATA.csv', 'items_to_extract': ['1', '1A', '7', '8'], 'remove_tables': True, 'skip_extracted_filings': True}}

Next, I executed the edgar_crawler.py file and an error popped up. I am not sure I understand the error here. Could you please help me or give me some suggestions to debug?

run_edgar_crawler - Jupyter Notebook.pdf

Thank you so much for your help.
Jun

empty filings metadata

When Ctrl-C interrupting edgar-crawler the filings_metadata_file often becomes empty with a size of 0. This is likely because .to_csv is called after every retrieval, and takes a long time to run with 100k+ entries.

Suggestions:

Don't write csv file on every loop iteration
Prevent filings_metadata_file from becoming empty
Alternatively, find a way to skip existing files even when filings_metadata_file is empty/ missing

for i, series in enumerate(tqdm(list_of_series, ncols=100)):
    ...
    if i % 100 == 0 or i == len(list_of_series) - 1:
        with tempfile.NamedTemporaryFile(delete=False) as tf:
            final_df.to_csv(tf, index=False, header=True)
            tf.close()
            shutil.copy(tf.name, filings_metadata_filepath)
            os.remove(tf.name)