DaPy is a data analysis library designed with ease of use in mind, which lets you smoothly implement your thoughts by providing well-designed data structures and abundant professional ML models. There has been a lot of famous data operation modules like Pandas already, but there is no module, which
- can write the codes in Chain Programming;
- can quickly do simple feature engineering with simple APIs;
- can easily operate the data row by row;
- can show the log of each steps on console like MySQL.
This example simply shows the characters of DaPy of chain programming, working log and simple feature engineering method. Our goal in this example is to train a classifier for Iris classification task. Detail information can be read from here.
We already have abundant of great libraries for Data Science like Numpy and Pandas, why we need DaPy?
The answer is DaPy is designed for Data Analysis, not for coders. In DaPy, users only need to focus on their thought of handling data, and pay less attention to coding tricks.
For example, while manipulating data by rows fits for people's habits, it is not a good idea in Pandas. Because Pandas is build for operate time series data, it is forbidden to operate rows from DataFrame.iterrows()
. However, DaPy relies on the concept of "views" to solve this problem, making it easy to process data in rows in a way that suits people's habits.
>>> import DaPy as dp
>>> sheet = dp.SeriesSet({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> for row in sheet:
print(row.A, row[0]) # get the value by column name or index
row[1] = 'b' # assign value by index
1, 1
2, 2
3, 3
>>> sheet.show()
A | B
---+---
1 | b
2 | b
3 | b
>>> row0 = sheet[0] # get row view object
>>> row0
[1, 'b']
>>> sheet.append_col(series=[7, 8, 9], variable_name='newColumn') # operate the sheet
>>> sheet.show()
A | B | newColumn
---+---+-----------
1 | b | 7
2 | b | 8
3 | b | 9
>>> row0
[1, 'b', 7]
We hope DaPy is an user-friendly tool. Therefore, we effort to the design of APIs in DaPy in order to let you quickly adept it and use it flexibly. Here are just a few of things that make DaPy simple:
- Variety of ways to visualize data in CMD
- 2D data sheet structures following Python syntax habits
- SQL-like APIs to process data
- Variety functions for preprocessing and feature engineering
- Flexible IO tools for loading and saving data (e.g. Website, Excel, Sqlite3, SPSS, Text)
- Built-in basic models (e.g. Decision Tree, Multilayer Perceptron, Linear Regression, ...)
Also, we hope it can be used in some real-world tasks, thereby we are keeping an eye on its efficiency. Although DaPy is implemented by pure Python, it has comparable efficiency to some exists libraries. Following dialog shows a testing result and the data had 4.32 million rows and 7 columns.
Following are the standards of performance test.
-
Task 1: load
Libraries have to load the original data from a CSV format file. In this CSV file, it has different columns with different data types. The libraries must have the ability to automatically predict the best matched data type then transfer the values. We recorded the time consumption of each library spent on the task. The commands we used are listed as bellow.
>>> pandas.readcsv(addr) >>> numpy.genfromtxt(addr, dtype=None, delimiter=',', encoding=None, names=True) >>> DaPy.read(addr)
-
Task 2: Traverse
Libraries have to traverse each row of the data loaded in Task1. We recorded the time consumption of each library spent on the task. The commands we used are listed as bellow.
>>> for row in pd_DataFrame.itertuples(): pass >>> for row in np_Ndarray: pass >>> for row in dp_SeriesSet.iter_rows(): pass
-
Task 3: Sort
Libraries have to sort the records from the data loaded in Task 1 by one column named "Price". We recorded the time consumption of each library spend on the task. The commands we used in this task are listed as bellow.
>>> pd_DataFrame.sort_values(by='Price') >>> np_Ndarray.sort(axis=0, order='Price') >>> dp_SeriesSet.sort('Price')
-
Task 4: Query
Libraries have to select the records that the keyword "Price" is greater than 99999. We recorded the time consumption of each library spent on the task. The commands we used are listed as bellow.
>>> pd_DataFrame.query('Price >= 99999') >>> numpy.extract(tuple(_['Price'] > 99999 for _ in np_Ndarray), np_Ndarray) >>> dp_SeriesSet.query('Price >= 99999', limit=None)
-
Task 5: Groupby
Libraries have to separate the records into groups according to the keyword of "Date", than calculate the mean of each column for each subset. Because
numpy.ndarray
doesn't support thegroupby
operation, Numpy skips this task. We recorded the time consumption of each library spent on the task. The commands we used are listed as bellow.>>> pd_DataFrame.groupby('Date')[['Price', 'Volume', 'Token', 'LastToken', 'LastMaxVolume']].mean() >>> dp_SeriesSet.groupby('Date', np.mean, apply_col=['Price', 'Volume', 'Token', 'LastToken', 'LastMaxVolume'])
-
Task 6: Save
Libraries have to save their data into a CSV format file. We recorded the time consumption of each library spent on the task. The commands we used are listed as bellow.
>>> pd_DataFrame.to_csv('test_Pandas.csv', index=0) >>> np.savetxt('test_numpy.csv', np_Ndarray, delimiter=',', fmt='%s%s%s%s%s%s%s') >>> dp_SeriesSet.save('test_Numpy.csv')
The latest version 1.10.1 had been updated to PyPi.
pip install DaPy
Some of functions in DaPy depend on requirements.
- xlrd: loading data from .xls file【Necessary】
- xlwt: export data to a .xls file【Necessary】
- repoze.lru: speed up loading data from .csv file【Necessary】
- savReaderWrite: loading data from .sav file【Optional】
- bs4.BeautifulSoup: auto downloading data from a website【Optional】
- numpy: dramatically increase the efficiency of ML models【Recommand】
- Load & Explore Data
- Load data from a local csv, sav, sqlite3, mysql server, mysql dump file or xls file:
sheet = DaPy.read(file_addr)
- Display the first five and the last five records:
sheet.show(lines=5)
- Summary the statistical information of each columns:
sheet.info
- Count distribution of categorical variable:
sheet.count_values('gender')
- Find differences of the labels in categorical variables:
sheet.groupby('city')
- Calculate the correlation between the continuous variables:
sheet.corr(['age', 'income'])
- Load data from a local csv, sav, sqlite3, mysql server, mysql dump file or xls file:
- Preprocessing & Clean Up Data
- Remove duplicate records:
sheet.drop_duplicates(col, keep='first')
- Use linear interpolation to fill in NaN :
sheet.fillna(method='linear')
- Remove the records which contains more than 50% variables are NaN:
sheet.dropna(axis=0, how=0.5)
- Remove some meaningless columns (e.g. ID):
sheet.drop('ID', axis=1)
- Sort records by some columns:
sheet = sheet.sort('Age', 'DESC')
- Merge external features from another table:
sheet.merge(sheet2, left_key='ID', other_key='ID', keep_key='self', keep_same=False)
- Merge external records from another table:
sheet.join(sheet2)
- Append records one by one:
sheet.append_row(new_row)
- Append new variables one by one:
sheet.append_col(new_col)
- Get parts of records by index:
sheet[:10, 20: 30, 50: 100]
- Get parts of columns by column name:
sheet['age', 'income', 'name']
- Remove duplicate records:
- Feature Engineering
- Transfer a date time into categorical variables:
sheet.get_date_label('birth')
- Transfer numerical variables into categorical variables:
sheet.get_categories(cols='age', cutpoints=[18, 30, 50], group_name=['Juveniles', 'Adults', 'Wrinkly', 'Old'])
- Transfer categorical variables into dummy variables:
sheet.get_dummies(['city', 'education'])
- Create higher-order crossover terms between your selected variables:
sheet.get_interactions(n_power=3, col=['income', 'age', 'gender', 'education'])
- Introduce the ranks of each records:
sheet.get_ranks(cols='income', duplicate='mean')
- Standardize some normal continuous variables:
sheet.normalized(col='age')
- Special processing for some special variables:
sheet.normalized('log', col='salary')
- Create new variables by some business logical formulas:
sheet.apply(func=tax_rate, col=['salary', 'income'])
- Difference process to make time-series stable:
DaPy.diff(sheet.income)
- Transfer a date time into categorical variables:
- Developing Models
- Choose a model and initialize it:
m = MLP()
,m = LinearRegression()
,m = DecisionTree()
orm = DiscriminantAnalysis()
- Train the model parameters:
m.fit(X_train, Y_train)
- Choose a model and initialize it:
- Model Evaluation
- Evaluate model with parameter tests:
m.report.show()
- Evaluate model with visualization:
m.plot_error()
orDecisionTree.export_graphviz()
- Evaluate model with test set:
DaPy.methods.Performance(m, X_test, Y_test, mode)
.
- Evaluate model with parameter tests:
- Saving Result
- Save the model:
m.save(addr)
- Save the final dataset:
sheet.save(addr)
- Save the model:
✔️ = Done 🏃 = In Development 📆 = Put On the Agenda 🤔 = Not Sure
-
Data Structures
- DataSet (3-D data structure) ✔️
- Frame (2-D general data structure) ✔️
- SeriesSet (2-D general data structure) ✔️
- Matrix (2-D mathematical data structure) ✔️
- Row (1-D general data structure) ✔️
- Series (1-D general data structure) ✔️
- TimeSeries (1-D time sequence data structure) 🏃
-
Statistics
-
Basic Statistics (mean, std, skewness, kurtosis, frequency, fuantils) ✔️
-
Correlation (spearman & pearson) ✔️
-
Analysis of variance ✔️
-
Compare Means (simple T-test, independent T-test) ✔️
-
-
Operations
- Beautiful CRUD APIs (create, Retrieve, Update, Delete) ✔️
- Flexible I/O Tool(supporting multiple source data for input and output) ✔️
- Dummy Variables (auto parse norminal variable into dummy variable) ✔️
- Difference Sequence Data ✔️
- Normalize Data (log, normal, standard, box-cox):heavy_check_mark:
- Drop Duplicate Records ✔️
- Group By (analysis the dataset under controlling a group variable):heavy_check_mark:
-
Methods
- LDA (Linear Discriminant Analysis) ✔️
- LR (Linear Regression) ✔️
- ANOVA (Analysis of Variance) ✔️
- MLP (Multi-Layers Perceptron) ✔️
- DT (Decision Tree):heavy_check_mark:
- K-Means 🏃
- PCA (Principal Component Analysis) 🏃
- ARIMA (Autoregressive Integrated Moving Average) 📆
- SVM ( Support Vector Machine) 🤔
- Bayes Classifier 🤔
-
Others
- Manual 🏃
- Example Notebook 🏃
- Unit Test 🏃
-
Xuansheng WU (@JacksonWoo: [email protected] )
-
- Xuansheng WU
- Feichi YANG (@Nick Yang: [email protected])
-
V1.10.1 (2019-08-22)
- Added
SeriesSet.update()
, update some values of specific records; - Added
BaseSheet.tolist()
andBaseSheet.toarray()
, transfer your data to list or numpy.array; - Added
BaseSheet.query()
, select records with a python statement in string; - Added
SeriesSet.dropna()
, drop rows or variables which contain NaN; - Added
SeriesSet.fillna()
, fill missing values in the dataset with constant value or linear model; - Added
SeriesSet.label_date()
, transfer a datetime object to several columns; - Added
DaPy.Row
, a view of a row record of the original data; - Added
DaPy.methods.DecitionTree
, classifier implemented with C4.5 algorithm; - Added
DaPy.methods.SignTest
, supported some of sign test algorithms; - Refactored the structure of
DaPy.core.base
package; - Optimized
BaseSheet.groupby()
, 18 times faster than ever before; - Optimized
BaseSheet.select()
, 14 times faster than ever before; - Optimized
BaseSheet.sort()
, 2 times faster than ever before; - Optimized
dp.save()
, 1.6 times faster than ever before to saving data to a .csv; - Optimized
dp.read()
, 10% faster than ever before to loading data from .csv;
- Added
-
V1.9.2 (2019-04-23)
- Added
BaseSheet.groupby()
, regroup your observations with specific columns; - Added
DataSet.apply()
, mapping a function to the dataset by axis; - Added
DataSet.drop_duplicates()
, automatically dropout the duplicate records in the dataset; - Added
DaPy.Series
, a new data structure to obtain a sequence of data; - Added
DaPy.methods.Performance()
, automatically testify the performance of ML models; - Added
DaPy.methods.Kappa()
, calculate the Kappa index with a confusing matrix; - Added
DaPy.methods.ConfuMat()
, calculate the Confusing matrix with your result; - Added
DaPy.methods.DecitionTree()
, implement the C4.5 decision tree algorithm; - Refactored the structure of
DaPy.core.base
package; - More on
BaseSheet.select()
, supports new keywords "limit" and "columns";
- Added
-
V1.7.2 Beta (2019-01-01)
- Added
get_dummies()
, supports to auto process norminal variables; - Added
show_time
attribute, auto timer for DataSet object; - Added
boxcox()
, supports Box-Cox transformation to a sequence data; - Added
diff()
, supports calculate the differences to a sequence data; - Added
DaPy.methods.LDA
, supports DIscriminant Analysis on two methods (Fisher & Linear); - Added
row_stack()
, supports to combine multiple data structures with out DataSet; - Added
Row
structure for handling a record in sheet; - Added
report
attribute to all classes inmethods
, you can read a statistical report after training a model; - More on
read()
, supports to auto parse data from a web address; - More on
SeriesSet.merge()
, more options when we merge to SeriesSets; - Rename
DataSet.pop_miss_value()
intoDataSet.dropna()
; - Refactored
methods
, more stable and more scalable in the future; - Refactored
methods.LinearRegression
, it can prepare a statistic report for you after training; - Refactored
BaseSheet.select()
, 5 times faster and more pythonic API design; - Refactored
BaseSheet.replace()
, 20 times faster and more pythonic API design; - Supported Python 3.x platform;
- Fixed a lot of bugs;
- Added
-
V1.5.3 (2018-11-17)
- Added
select()
, quickly access partial data with some conditions; - Added
delete()
, delete data along the axis from a un-DaPy object; - Added
column_stack()
, merging several un-DaPy objects together; - Added
P()
&C()
, calculating permutation numbers and combination numbers; - Added new syntax, therefore users can view values in a column with statement as
data.title
. - Optimized
DaPy.save()
, supported external saving data types: html and SQLite3; - Refactored
BaseSheet
, less codes and more flexsible in the future; - Refactored
DataSet.save()
, more stable and more flexsible in the future; - Rewrite a part of basic mathematical functions;
- Fixed some bugs;
- Added
-
V1.4.1 (2018-08-19)
- Added
replace()
for high-speed transering your data; - Optimized the speed in reading .csv file;
- Refactored the
methods.MLP
, customized with any layers, any active functions and any cells now; - Refactored the
Frame
andSeriesSet
to improve the efficiency; - Supported to initialize Pandas and Numpy data structures;
- Fixed some bugs;
- Added
-
V1.3.3 (2018-06-20)
- Added
methods.LinearRegression
andmethods.ANOVA
; - Added
io.encode()
for better adepting to Chinese; - Optimized
SeriesSet.__repr__()
andFrame.__reprt__()
to show data in beautiful way; - Optimized the
Matrix
, so that the speed in calculating is two times faster; - More on
read()
, supports external file as: Excel, SPSS, SQLite3, CSV; - Renamed
DataSet.read_col()
,DataSet.read_frame()
,DataSet.read_matrix()
byDataSet.read()
; - Refactored the
DataSet
, which can manage multiple sheets at the same time; - Refactored the
Frame
andSeriesSet
, delete the attributes' limitations; - Removed
DaPy.Table
;
- Added
-
V1.3.2 (2018-04-26)
- Added more useful functions for
DaPy.DataSet
; - Added a new data structure called
DaPy.Matrix
; - Added some mathematic formulas (e.g. corr, dot, exp);
- Added
Multi-Layers Perceptrons
to DaPy.machine_learn; - Added some standard dataset;
- Optimized the loading function significantly;
- Added more useful functions for
-
V1.3.1 (2018-03-19)
- Added the function which supports to save data as a csv file;
- Fixed some bugs in the loading data function;
-
V1.2.5 (2018-03-15)
- First public beta version of DaPy!
Copyright (C) 2018 - 2019 Xuansheng Wu
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https:\www.gnu.org\licenses.# datapy A light Python library for data processing and analysing.