Coder Social home page Coder Social logo

pierrepo / autoclasswrapper Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 3.0 1.29 MB

AutoClassWrapper: a Python :snake: wrapper for AutoClass C classification :package: :rocket:

Home Page: https://autoclasswrapper.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 16.91% Makefile 0.56% Jupyter Notebook 62.80% SQLPL 17.01% Shell 0.31% TeX 2.42%

autoclasswrapper's Introduction

Pierre Poulain, associate professor

Twitter Badge

I am associate professor at Université Paris Cité, France. I currently perform my research at the Laboratory of Theoretical Biochemistry. I study the sharing and reuse of molecular dynamics simulation data and code. For this, I develop new methods and software, mostly in Python, using traditional machine learning technics or deep learning. More generally, I have a strong interest in data analysis and data management in biology and bioinformatics.

github stats

Projects

Bayesian clustering

AutoClassWrapper AutoClassWeb

Jupyter-based platform for e-learning

Have a look to the Plasma project website.

Plasma AutoClassWeb

Structural bioinformatics

PBxplore Principal Axes

Misc

Bibliomarklets

autoclasswrapper's People

Contributors

arfon avatar kyleniemeyer avatar pierrepo avatar trallard avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

autoclasswrapper's Issues

Download URL is not working

Hi,
Thibault L. told me about Autoclass algorithm. Unfortunately, the URL to the source code does not lead to the source code. Could you update the URL?
Best.

Output autoclass-c status

Maybe with a file:

  • autoclass-search-succeeded and autoclass-report-succeededif OK
  • autoclass-search-failed and autoclass-report-failedif not OK

Check AutoClass C is working

Hi Pierre,

I ran into the issue of also needing to install libc6-i386 for my AutoClass C installation. Without it, autoclass is recognized by wrapper.search_autoclass_in_path() but it never actually executes. Then while not Path("autoclass-run-success").exists(): never passes.

I wonder if during wrapper.search_autoclass_in_path(), if maybe a simple execution of autoclass to extract the version information would could be added. Here's an example of autoclass output.

autoclass-c/autoclass


AUTOCLASS C (version 3.3.6unx)

 AutoClass Search:
      > autoclass -search <.db2[-bin] file path> <.hd2 file path>
             <.model file path> <.s-params file path>

 AutoClass Reports:
      > autoclass -reports <.results[-bin] file path> <.search file path>
             <.r-params file path>

 AutoClass Prediction:
      > autoclass -predict <test.. .db2 file path>
             <training.. .results[-bin] file path>
             <training.. .search file path> <training.. .r-params file path>

This check could even be implemented into wrapper.Run(), which could return the Run object if autoclass is valid, or return an exception if it isn't found or doesn't work.

Cheers,
Robert

ERROR AutoClass C executable not found in path

I have problem with error: AutoClass C executable not found in path!
My OS is Ubuntu 20.04, file /autoclass-c$ autoclass exists and I can call it:

AUTOCLASS C (version 3.3.6unx)

 AutoClass Search: 
      > autoclass -search <.db2[-bin] file path> <.hd2 file path>
             <.model file path> <.s-params file path> 

 AutoClass Reports: 
      > autoclass -reports <.results[-bin] file path> <.search file path> 
             <.r-params file path> 

 AutoClass Prediction: 
      > autoclass -predict <test.. .db2 file path>
             <training.. .results[-bin] file path>
             <training.. .search file path> <training.. .r-params file path> 

but script can't see executable file. I added chmod 777 but it nothing changes. I will be grateful for any suggestions.

Peter

Python: 3.8.10 (default, Jul 14 2021, 14:06:22) 
[GCC 9.3.0]
matplotlib: 3.4.3
numpy: 1.21.2
pandas: 1.3.2
AutoClassWrapper: 1.5.1
2021-08-26 19:36:36 INFO     Reading data file 'demo_real_scalar.tsv' as 'real scalar' with error 0.01
2021-08-26 19:36:36 INFO     Detected encoding: ascii
2021-08-26 19:36:36 INFO     Found 300 rows and 2 columns
2021-08-26 19:36:36 DEBUG    Checking column names
2021-08-26 19:36:36 DEBUG    Index name 'name'
2021-08-26 19:36:36 DEBUG    Column name 'x'
2021-08-26 19:36:36 INFO     Checking data format
2021-08-26 19:36:36 INFO     Column 'x'
2021-08-26 19:36:36 INFO     count    300.000000
2021-08-26 19:36:36 INFO     mean       4.321810
2021-08-26 19:36:36 INFO     std        1.410835
2021-08-26 19:36:36 INFO     min        1.604192
2021-08-26 19:36:36 INFO     50%        3.983164
2021-08-26 19:36:36 INFO     max        7.377156
2021-08-26 19:36:36 INFO     ---
2021-08-26 19:36:36 INFO     Reading data file 'demo_real_location.tsv' as 'real location' with error 0.01
2021-08-26 19:36:36 INFO     Detected encoding: ascii
2021-08-26 19:36:36 INFO     Found 300 rows and 2 columns
2021-08-26 19:36:36 DEBUG    Checking column names
2021-08-26 19:36:36 DEBUG    Index name 'name'
2021-08-26 19:36:36 DEBUG    Column name 'y'
2021-08-26 19:36:36 INFO     Checking data format
2021-08-26 19:36:36 INFO     Column 'y'
2021-08-26 19:36:36 INFO     count    300.000000
2021-08-26 19:36:36 INFO     mean       2.985426
2021-08-26 19:36:36 INFO     std        2.313562
2021-08-26 19:36:36 INFO     min       -1.679489
2021-08-26 19:36:36 INFO     50%        3.965726
2021-08-26 19:36:36 INFO     max        6.399967
2021-08-26 19:36:36 INFO     ---
2021-08-26 19:36:36 INFO     Preparing input data
2021-08-26 19:36:36 INFO     Final dataframe has 300 lines and 3 columns
2021-08-26 19:36:36 INFO     Searching for missing values
2021-08-26 19:36:36 INFO     No missing values found
2021-08-26 19:36:36 INFO     Writing autoclass.db2 file
2021-08-26 19:36:36 INFO     If any, missing values will be encoded as '?'
2021-08-26 19:36:36 DEBUG    Writing autoclass.tsv file [for later use]
2021-08-26 19:36:36 INFO     Writing .hd2 file
2021-08-26 19:36:36 INFO     Writing .model file
2021-08-26 19:36:36 INFO     Writing .s-params file
2021-08-26 19:36:36 INFO     Writing .r-params file
2021-08-26 19:36:36 ERROR    AutoClass C executable not found in path!
2021-08-26 19:36:36 INFO     Writing run file
2021-08-26 19:36:36 ERROR    AutoClass C executable not found in path!

merge_dataframes() necessary even when only one dataframe is present?

Hi there,

I've been playing around with some toy datasets, and from what I can tell it's necessary to call merge_dataframes() even when only one dataframe is part of my dataset.

For example, calling the following ends up leading to an error ERROR 'NoneType' object has no attribute 'to_csv'. Additionally, when run as part of a script this error isn't actually raised, so downstream parts of the script try to run, making it more difficult to figure out the actual source of the problem. Adding in a clust.merge_dataframes() after adding input data fixes the problem, but it's not clear to me why I should have to merge dataframes if I only have one dataframe.

import autoclasswrapper as wrapper
clust = wrapper.Input()
clust.add_input_data('test_data.tsv', 'real scalar', 0.01)
clust.create_db2_file()

Add AutoClass C as a dependency

Hi Pierre!

I followed the demo (successfully!) and it includes a few steps to install AutoClass C.

wget https://ti.arc.nasa.gov/m/project/autoclass/autoclass-c-3-3-6.tar.gz
tar zxvf autoclass-c-3-3-6.tar.gz
rm -f autoclass-c-3-3-6.tar.gz
export PATH=$PATH:$(pwd)/autoclass-c

# if you use a 64-bit operating system,
# you also need to install the standard 32-bit C libraries:
# sudo apt-get install -y libc6-i386

I would imagine some users might over look the Docs and demo. Considering this, It might be useful to also include these steps under Installation and dependencies in the README.

Cheers,
Robert

API documentation - clarify required parameters.

Reading through the API documentation I'm often unclear on what parameters are required and what aren't - for example, for the Input() class the default value for input_error on add_input_data() is listed as None, so I assumed that I could leave the value blank. However, the following code ends up failing, and it was only through trial and error that I was able to pinpoint the source of the error. Having the API documentation modified so that it's more clear what parameters are required throughout and/or throwing errors when required parameters aren't entered would make it much easier for users to know what's going wrong.

import autoclasswrapper as wrapper
import time

clust = wrapper.Input()
clust.add_input_data('test.tsv', 'real scalar')

clust.merge_dataframes()
clust.create_db2_file()
clust.create_hd2_file()
clust.create_model_file()
clust.create_sparams_file()
clust.create_rparams_file()

wrapper.search_autoclass_in_path()
run = wrapper.Run()
run.create_run_file()
run.run()
time.sleep(20)
results = wrapper.Output()
results.extract_results()
results.aggregate_input_data()

Suggestion - use sklearn.cluster class

My suggestion is to create sklearn like class with methods:

fit - saving data from dataset as class atributes, this atributes can be use to create tsv files when call predict or transform methods
predict - to return cluster number
transform - to return probailities of belonging to every class

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.