dxc-technology / dxc-industrialized-ai-starter Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 39.0 29.23 MB

Industrialized AI Starter

License: Apache License 2.0

Python 1.97% Jupyter Notebook 98.03% CSS 0.01%

automl data-science python

dxc-industrialized-ai-starter's People

Contributors

Stargazers

Watchers

Forkers

fsiddiqi ompsingh mandeepdxc vamsi7behara praveenanantharaman jonfernandes abhay-channe pradyutdec maniac0r openbsod madhucsc koverholt spandana-bendi miggytrinidad lwrnnglyflchr jemusni07 amarify priyatj rameshwargupta97 sureshathanti knmitri clu25 vivekbachala philipeldh aumerhadi hercolubus itsmemarty likhil jdamascoty madhu407 karthikreddyks75 soujanya8977 longs madhubandru kishorpulagam92 karthikreddykuna giuseppecozza bfrichot roodk

dxc-industrialized-ai-starter's Issues

Metrics or statistics of differences between raw data and clean data.

name	title	about	labels	assignees
Transparency Request	metrics or statistics of differences between raw data and clean data.

Describe the area of code that needs more transparency:
Display metrics or statistics that show the difference between raw data and clean data.
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Show the stats of raw and clean data as different columns
We should have metrics for categorical and numerical data, should also think about how to handle providing usable metrics for data sets with lots of features.

Provide Auto-ML documentation link in user guide for running models

name	title	about	labels	assignees
Transparency Request	Provide Auto-ML documentation link

Describe the area of code that needs more transparency:
Provide Auto-ML documentation link in the user guide for running models
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
If possible we should provide an easy way to expose deep links to the specific algorithms as part of our way to support Data Scientist making their work explainable

distplot is a deprecated

Describe the bug
distplot is a deprecated function and will be removed in a future version, we have to find an alternative to replace this function

To Reproduce
Steps to reproduce the behavior:

in the DXC-Industrialized-AI-Starter.ipynb in colab, execute this command ai.plot_distributions(data1)
you will get this warning
distplot is a deprecated function and will be removed in a future version. Please adapt your code to use either displot (a figure-level function with similar flexibility) or histplot (an axes-level function for histograms). Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be data, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

Expected behavior
WE have to find an alternative for this function

Screenshots

Additional context
N/A

Is it possible to include a function to test api keys for Mongodb?

I am just wonder how do i test the api keys obtained from Mongodb using this library. Is it included in it ?

Where i refer to test what it meant is to check if the api keys are working and able to successfully establish a connection.

Research- Add encryption to the published microservice

name	title	about	labels	assignees
Transparency Request	Add encryption to the published microservice		Research

Describe the area of code that needs more transparency:
Add encryption to the published microservice
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
research levels of best-practice security for different types of data. We could offer a parameter mapped to predefined configurations for low, medium, high, and/or extra-high levels of security.

Read data from local excel file in Colab

Describe the bug
Trying to upload an excel file (in Colab) but its failing with below error

read_data_frame_from_local_excel_file()
29 uploaded = files.upload()
30 excel_file_name = list(uploaded.keys())[0]
---> 31 df = pd.read_excel(io.BytesIO(uploaded[excel_file_name]))
32 return(df)
33

NameError: name 'io' is not defined

To Reproduce
Steps to reproduce the behavior:

dataframe = ai.read_data_frame_from_local_excel_file()
Browse the excel file (in Colab)
See error

Expected behavior
Dataframe with the Excel data

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Publish DXC-Industrialized-AI-Starter notebook file to GitHub.

Publish DXC-Industrialized-AI-Starter google colab notebook file to GitHub.

Verbose and succinct mode for running an experiment

By default ai.run_experiment() produces a lot of output. Can we add a parameter to the function that defines a verbose and succinct mode? In verbose mode, the function produces the full output, but in succinct mode, only a summary is output. I recommend making succinct mode the default.

"as matrix" error in explore_features function.

Users in the boot camp is facing "as_matrix" error when calling ‘ai.explore_features(raw_data)’ in the AI Workbook.

cannot import name 'six' from 'sklearn.externals' error while installing DXC-Industrialized-AI-Starter

The module is deprecated in version 0.21 and removed in version 0.23. So hotfix need to be done to avoid problem in boot camp usage.

Cannot upload a local file into Colab

Problem:
In the DXC_Industrialized_AI_Starter.ipynb notebook, there are a couple instances where it suggests you can upload a local file to colab. This doesn't work. The error message I get is :

MessageError: TypeError: Cannot read property '_uploadFiles' of undefined

ai.clean_dataframe() unable to parse date formatted in MM/D/YYYY

MM/D/YYYY and MM/DD/YYYY are two popular US date format and it appears the ai.clean_dataframe() function cannot recognize them.

ParserError: Could not match input '10/1/2020' to any of the following formats: YYYY-MM-DD, YYYY-M-DD, YYYY-M-D, YYYY/MM/DD, YYYY/M/DD, YYYY/M/D, YYYY.MM.DD, YYYY.M.DD, YYYY.M.D, YYYYMMDD, YYYY-DDDD, YYYYDDDD, YYYY-MM, YYYY/MM, YYYY.MM, YYYY, W

Add test cases links in contributing guide

name	title	about	labels	assignees
Transparency Request	Add test cases links in contributing guide

Describe the area of code that needs more transparency:
Add test cases links in contributing guide, revisit the contribute guide document, and do the necessary changes.
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:

Unclear "Set up the development environment" instructions

Most of those attending the DXC-Industralized AI course have not used Colab before.
Some of the instructions for getting started in the "Set up the development environment" could be clearer.
e.g. "This code installs all the packages you'll need. Run it first."
Most people who have not used Colab before would not know how to do this.

Time series model option

can you include a time series model option in the experiment design?

Indicate the completeness or correctness of the data and show the outliers

name	title	about	labels	assignees
Transparency Request	indicate the completeness or correctness of the data and show the outliers

Describe the area of code that needs more transparency:
indicate the completeness or correctness of the data and show the outliers
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Logan comments:
Not just visualization, completeness, correctness, outliers, and other metrics should be saved statistics, too. We can’t force others to make their work explainable to an end-user, but we can ensure that their work is capable of being explained if they use our package.

Refactoring the code into sub modules

Refactor the initial version of the code into sub modules.

Generalizing the conversion of Column names issue in clean_dataframe, write_raw_data, pipeline functions

Generalizing the conversion of Column names issue in clean_dataframe, write_raw_data, pipeline functions.

Publishing a microservice of a custom model

In the documentation can you show an example of publishing a microservice from a custom model instead of a model generated from run_experiment() function? The custom model could be something as simple as a function that adds two number or appends text to an input.

'int' object has no attribute 'split' Error in data pipeline function

User is facing 'int' object has no attribute 'split' error in data pipeline function while working with numeric data operators.

Create an Issue template for AI Ethics & Trustworthiness

Create an Issue template to capture the issues/changes for AI Ethics & Trustworthiness

List of available data sets not available

The link in the example that is supposed to list available data sets is broken

[BUG] Error when importing ai package

Describe the bug
Error occurred when trying to import ai from dxc

To Reproduce
Steps to reproduce the behavior:

used pip to install ai starter package
from dxc import ai
error

Expected behavior
ai was getting loaded earlier

Screenshots

There is an identification issue in publish microservice function.

When the function is writing source code to algoritima, then there is an issue in that source code.

[BUG] Unable to run AI experiments for Time-Series problems

Describe the bug
I am unable to run AI experiments for Time-Series problems

To Reproduce
When I passed timeseries as a parameter, I got an error as below:

I tried to pick up an example from https://nbviewer.jupyter.org/github/dxc-technology/DXC-Industrialized-AI-Starter/blob/c58754247060262ac0949396e48f71861cb79d4e/Examples/Time_series_Model.ipynb

on setting the value : "model": 'timeseries',
The timeseries values are not displaying as expected. Instead it shows the same value for all predictions

Please let me a way to handle timeseries problems

Expected behavior
Please create a function for Time-Series problem. Please revoke the functionality. As it looks like, it was already implemented

Screenshots
Added the images

Additional context
Add any other context about the problem here.

custom dataset creation for images

Resize the pictures
Convert all images into the same file format
Merging the images into a single file
Convert images into a CSV file
Few Changes to the CSV file
Loading the CSV file

Add separator parameter for reading csv files with ai.read_data_frame_from_local_csv()

ai.read_data_frame_from_local_csv() needs to be able to accept files with other types of separators. I tried to import a pipe delimited csv file and could not set sep = '|' to be able to read the file

Code failed when try to import dxc ai - bug level : Blocker

Describe the bug
The code fails when I run the part responsible for importing ai from DXC (from dxc import ai)

To Reproduce
Steps to reproduce the behavior:

pip install DXC library (First block of code)
run the second block of code (from dxc import ai)

Expected behavior
The code should import the ai from DXC library without any issue

Screenshots

Additional context
Non

Handle column names in data pipeline

User facing issue with column names in "access_data_from_pipeline" function in below scenarios:

When column names are case sensitive.
When column have SPACE in between.

So fix need to be done to:

Handle column names case sensitive.
Handle column with SPACE in between to replace with “_”.

[BUG] Pandas.IO.JSON.JSON_Normalize is depreceated, (Build Data Pipeline)

Describe the bug
the lib pandas.io.json.json_normalize is depreceated, recommendation to use pandas.json_normalize instead

To Reproduce
Steps to reproduce the behavior:

Go to build data pipe line in Collab
Click on the cell #TODO: Define the code needed to refine the raw data
Run cell
See error

Expected behavior
no error from Pandas lib

Screenshots

[BUG] Loose Dependency Resolver Constraints

Describe the bug
Pip install DXC_Industrialized_AI_Starter-2.3.9-py3-none-any.whl takes several minutes and times out in Google Colab due to multiple versions of libraries being downloaded

To Reproduce
Steps to reproduce the behavior:
run pip install

Expected behavior
The starter to load in at the most 10 minutes

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Clustering model option

Can you include a clustering model option in the experiment design?

Develop a function to download datasets directly from DXC public github

Reasons to implement

User has the flexibility to download only required datasets.
Reduce the AI starter library size by moving all datasets to DXC public GitHub and providing user a flexibility to download.

Deep learning model option

Can you include a deep learning model option in the experiment design?

Rename the variable name in notebook

name	title	about	labels	assignees
Transparency Request	Rename the variable name in notebook

Describe the area of code that needs more transparency:
Rename the variable name representing the dataset after cleaning the data (ex: Raw_data, Clean_data)
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:

Cant resubmit after rework on badge

After getting a "rework" status on a badge, doing the changes required and then resubmits gets you an error response that an assertion already exists:

<Response [400]> { "errorMessage": { "statusCode": 400, "exception": "BadRequest", "message": "Assertion already exists", "payload": { "evidence": "https://colab.research.google.com/drive/1oMUEVLS5x1netqWaL0kAt8FVR-ABeJ4U", "lastUpdated": "2021-02-10T15:37:33Z", "badge": "Create a Data Story", "status": "rework", "created": "2021-02-09T16:06:25Z", "comments": [ { "date": "2021-02-10T15:37:33Z", "comment": "Hello, nice work, but for the sample data set area, please upload your data set, not the iris file.", "email": "[email protected]" } ], "email": "[email protected]", "d1": "user:6366a530-391b-41be-bbff-6f372658afef", "d2": "badge:dd05bbdf-ad5b-469d-ab2c-4dd218fd68fe", "salt": "ecaa9028a8feb321be864bf98ac1ebe6", "reviewer": "[email protected]", "sk": "assertion", "pk": "assertion:93372602-81eb-47dc-bf05-5bc475c276b6" } } }

Add the drift function after pipeline

name	title	about	labels	assignees
Transparency Request	Add the drift function after pipeline

Describe the area of code that needs more transparency:
Add the drift function after pipeline
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Calculate the drift between given two data sets over a period of time or between training sets

Column name issues with ai.clean_dataframe

User facing issue with ai.clean_dataframe function when

column names are in upper case
Having SPACE in between names.

Creation of new algorithm through API.

In publish microservice, we need to create a new algorithm if that is not existing in algorithmia. It was not working in our previous code and did changes to make it run.

Verbose and succinct mode for publishing a micro-service

By default ai.publish_microservice() produces a lot of output. Can we add a parameter to the function that defines a verbose and succinct mode? In verbose mode, the function produces the full output, but in succinct mode, only a summary is output. I recommend making succinct mode the default.

Include metadata details

name	title	about	labels	assignees
Transparency Request	Include metadata details

Describe the area of code that needs more transparency:
Include metadata details while writing data into MongoDB
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Include in the code as a minimum, but best in Mongo
Where did we got the data from, did we ran cleaner on the data
What version of data is
Where we stored the data
This is a manual entry

Add logs to the crucial steps

name	title	about	labels	assignees
Transparency Request	Add logs to the crucial steps

Describe the area of code that needs more transparency:
Add logs to the crucial steps that give feasibility to the user to revert changes
Describe the solution you'd like:
Describe the alternatives you've considered:
Additional context or comments:
Add logs to the crucial steps(aggregation) which give the feasibility to the user to revert changes if something goes wrong.
Log storage options, recommend config options (at least) for both Mongo Atlas DB and local storage

dxc-technology / dxc-industrialized-ai-starter Goto Github PK

dxc-industrialized-ai-starter's People

Contributors

Stargazers

Watchers

Forkers

dxc-industrialized-ai-starter's Issues

Recommend Projects

Recommend Topics

Recommend Org