Coder Social home page Coder Social logo

sagemaker-jumpstart-industry-pack's Introduction

SageMaker JumpStart Industry Python SDK

Latest Version Supported Python Versions Documentation Status

The SageMaker JumpStart Industry Python SDK is a client library of Amazon SageMaker JumpStart. The library provides tools for feature engineering, training, and deploying industry-focused machine learning models on SageMaker JumpStart. With this industry-focused SDK, you can curate text datasets, and train and deploy language models.

In particular, for the financial services industry, you can use a new set of multimodal (long-form text, tabular) financial analysis tools within Amazon SageMaker JumpStart. With these new tools, you can enhance your tabular ML workflows with new insights from financial text documents and help save weeks of development time. By using the SDK, you can directly retrieve financial documents such as SEC filings, and further process financial text documents with features such as summarization and scoring for sentiment, litigiousness, risk, and readability.

In addition, you can access language models pretrained on financial texts for transfer learning, and use example notebooks for data retrieval, feature engineering of text data, enhancing the data into multimodal datasets, and improve model performance.

SageMaker JumpStart Industry also provides prebuilt solutions for specific use cases (for example, credit scoring), which are fully customizable and showcase the use of AWS CloudFormation templates and reference architectures to accelerate your machine learning journey.

For detailed documentation, including the API reference, see ReadTheDocs.

Installing the SageMaker JumpStart Industry Python SDK

The SageMaker JumpStart Industry Python SDK is released to PyPI and can be installed with pip as follows:

pip install smjsindustry

You can also install from source by cloning this repository and running a pip install command in the root directory of the repository:

git clone https://github.com/aws/sagemaker-jumpstart-industry-python-sdk.git
cd sagemaker-jumpstart-industry-python-sdk
pip install .

Supported Operating Systems

The SageMaker JumpStart Industry Python SDK supports Unix/Linux and Mac.

Supported Python Versions

The SageMaker JumpStart Industry Python SDK is tested on:

  • Python 3.6
  • Python 3.7
  • Python 3.8

AWS Permissions

The SageMaker JumpStart Industry Python SDK runs on Amazon SageMaker. As a managed service, Amazon SageMaker performs operations on your behalf on the AWS hardware that is managed by Amazon SageMaker. Amazon SageMaker can perform only operations that the user permits. You can read more about which permissions are necessary in the Amazon SageMaker Documentation.

The SageMaker JumpStart Industry Python SDK should not require any additional permissions aside from what is required for using SageMaker. However, if you are using an IAM role with a path in it, you should grant permission for iam:GetRole.

Licensing

The SageMaker JumpStart Industry Python SDK is licensed under the Apache 2.0 License. It is copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. The license is available at Apache License.

Legal Notes

  1. The SageMaker JumpStart Industry solutions, notebooks, demos, and examples are for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice.
  2. The SageMaker JumpStart Industry solutions, notebooks, demos, and examples use data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions located in the Accessing EDGAR Data page.

Running Tests

The SageMaker JumpStart Industry SDK has unit tests and integration tests.

You can install the libraries needed to run the tests by running pip install --upgrade .[test] or, for Zsh users: pip install --upgrade .\[test\]

Unit tests

We use tox to run Unit tests. Tox is an automated test tool that helps you run unit tests easily on multiple Python versions, and also checks the code sytle meets our standards. We run tox with all of our supported Python versions(Python 3.6, Python 3.7, Python 3.8). In order to run unit tests with the same configuration as we do, you need to have interpreters for those Python versions installed.

To run the unit tests with tox, run:

tox tests/unit

Integrations tests

To run the integration tests, you need to first prepare an AWS account with certain configurations:

  1. AWS account credentials are available in the environment for the boto3 client to use.
  2. The AWS account has an IAM role named SageMakerRole. It should have the AmazonSageMakerFullAccess policy attached as well as a policy with the necessary permissions to use Elastic Inference.

We recommend selectively running just those integration tests you would like to run. You can filter by individual test function names with:

tox -- -k 'test_function_i_care_about'

You can also run all of the integration tests by running the following command, which runs them in sequence, which may take a while:

tox -- tests/integ

Building Sphinx Docs Locally

Install the dev version of the library:

pip install -e .\[all\]

Install Sphinx and the dependencies listed in sagemaker-jumpstart-industry-python-sdk/docs/requirements.txt:

pip install sphinx
pip install -r sagemaker-jumpstart-industry-python-sdk/docs/requirements.txt

Then cd into the sagemaker-jumpstart-industry-python-sdk/docs directory and run:

make html && open build/html/index.html

sagemaker-jumpstart-industry-pack's People

Contributors

amazon-auto avatar derrickzhang123 avatar hehehe47 avatar johnhe-dev avatar jsspric avatar mchoi8739 avatar sophiayue1116 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-jumpstart-industry-pack's Issues

Missing functions and objects in notebook 4

Hi,

I'm trying to run the lines of code in: https://github.com/aws/sagemaker-jumpstart-industry-pack/blob/main/docs/source/notebooks/finance/notebook4/SEC_10K_10Q_8K_section_extraction.rst

However, I miss the function "get_form_items" and the objects "columns_10K" and "header_mappings_10K", in the following block

I could not find these objects in the package either. Any help appreciated.

MDNA not being parsed for some filings

First off, thank you for this well-documented and highly functional SDK. It works well for some stock tickers / ciks but per the attachment, the CSV produced in Download the filings you wish to work with doc appears to be a bit off for other companies. Moreover, the mdna column doesn't appear to be extracted. Is there a way to resolve this?

Below is the config I used when running in SageMaker Studio for some 10Ks just filed this year. Thanks again for all of this and making it easy to run in AWS.

%%time

dataset_config = EDGARDataSetConfig(
    tickers_or_ciks=['syf', 'moh', 'dal', 'wm'],     # list of stock tickers or CIKs
    form_types=['10-K'],                             # list of SEC form types
    filing_date_start='2022-01-01',                  # starting filing date
    filing_date_end='2022-12-31',                    # ending filing date
    email_as_user_agent='[email protected]')        # user agent email
    
data_loader = DataLoader(
    role=sagemaker.get_execution_role(),    # loading job execution role
    instance_count=1,                       # instances number, limit varies with instance type
    instance_type='ml.c5.2xlarge',          # instance type
    volume_size_in_gb=30,                   # size in GB of the EBS volume to use
    volume_kms_key=None,                    # KMS key for the processing volume
    output_kms_key=None,                    # KMS key ID for processing job outputs
    max_runtime_in_seconds=None,            # timeout in seconds. Default is 24 hours.
    sagemaker_session=sagemaker.Session(),  # session object
    tags=None)                              # a list of key-value pairs
    
data_loader.load(
    dataset_config,
    's3://{}/{}/{}'.format(bucket, secdashboard_processed_folder, 'output'),      # output s3 prefix (both bucket and folder names are required)
    'dataset_10k_2022.csv',                                                       # output file name
    wait=True,
    logs=True)

Attachment: dataset_10k_2022.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.