uio-bmi / immuneml Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 29.0 195.24 MB

immuneML is a platform for machine learning analysis of adaptive immune receptor repertoire data.

Home Page: https://immuneml.uio.no

License: GNU Affero General Public License v3.0

Dockerfile 0.04% Python 97.72% Shell 0.01% HTML 2.15% CSS 0.08%

adaptive-immune-receptors airr bcr benchmarking classification immune-repertoire machine-learning-analysis tcr

immuneml's People

Contributors

Stargazers

Watchers

immuneml's Issues

error when exporting evenness encoded data as design matrix

Hi!

I am trying to export evenness encoding as design matrix using the following yaml specification:

I get the following error:

Could you help me figure out what is wrong here ..Are evenness encoding and design matrix exporter not compatible !? I am using version 1.02. Thanks!

Error while running the quickstart analysis

Hello,
I've just started using this package and the installation went well. When I tried to run the quickstart analysis I kept running into the error shown below.

I think the error happens when the program tries to read the synthetic dataset in AIRR format, but there is some issue with the way the columns are specified.

I investigated the synthetic airr file rep_0.tsv and found that the sequence_id column has some weird issues. This is an example of the file contents:

sequence_id	sequence	rev_comp	productive	v_call	d_call	j_call	sequence_alignment	germline_alignment	junction	junction_aa	v_cigar	d_cigar	j_cigar	cdr3_aa	locus	duplicate_count	vj_in_frame	stop_codon	my_signal
6			T	TRBV1-1*01		TRBJ1-1*01								FYRVSIWQQENE	TRB	1	T	F	False
95208f3bd4b24b45b5120567057adffe			T	TRBV1-1*01		TRBJ1-1*01								LWAARKFVRG	TRB	1	T	F	True

This is my output.
Any help is appreciated.

(immuneml_env) [immuneML]$ immune-ml-quickstart ./quickstart_results/
immuneML quickstart: generating a synthetic dataset...
2024-05-05 20:22:13.029352: Setting temporary cache path to quickstart_results/synthetic_dataset/result/cache
2024-05-05 20:22:13.029383: ImmuneML: parsing the specification...

2024-05-05 20:22:13.752929: Imported repertoire dataset my_synthetic_dataset with 100 examples.
2024-05-05 20:22:13.876557: Full specification is available at quickstart_results/synthetic_dataset/result/full_simulation_specs.yaml.

2024-05-05 20:22:13.876602: ImmuneML: starting the analysis...

2024-05-05 20:22:13.876629: Instruction 1/1 has started.
2024-05-05 20:22:15.137774: Instruction 1/1 has finished.
2024-05-05 20:22:15.151792: Generating HTML reports...
2024-05-05 20:22:15.194902: HTML reports are generated.
2024-05-05 20:22:15.195323: ImmuneML: finished analysis.

immuneML quickstart: finished generating a synthetic dataset.
immuneML quickstart: training a machine learning model...
2024-05-05 20:22:15.201168: Setting temporary cache path to quickstart_results/machine_learning_analysis/result/cache
2024-05-05 20:22:15.201184: ImmuneML: parsing the specification...

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 183, in load_sequence_dataframe
    df = alternative_load_func(filepath, params)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 159, in alternative_load_func
    df = airr.load_rearrangement(filename)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/airr/interface.py", line 103, in load_rearrangement
    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 164, in load_repertoire_as_object
    dataframe = ImportHelper.load_sequence_dataframe(filename, params, alternative_load_func)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 187, in load_sequence_dataframe
    raise Exception(f"{ex}\n\nImportHelper: an error occurred during dataset import while parsing the input file: {filepath}.\n"
Exception: Bool column has NA values in column 2

ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/rep_0.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=False, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 177, in load_repertoire_as_object
    raise RuntimeError(f"{ImportHelper.__name__}: error when importing file {metadata_row['filename']}.") from exception
RuntimeError: ImportHelper: error when importing file rep_0.tsv.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 60, in _parse_dataset
    dataset = import_cls.import_dataset(params, key)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 109, in import_dataset
    return ImportHelper.import_dataset(AIRRImport, params, dataset_name)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 49, in import_dataset
    dataset = ImportHelper.import_repertoire_dataset(import_class, processed_params, dataset_name)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 95, in import_repertoire_dataset
    repertoires = pool.starmap(ImportHelper.load_repertoire_as_object, arguments)
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: ImportHelper: error when importing file rep_0.tsv.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
    return func(*args, **kwargs)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 70, in _parse_dataset
    raise Exception(f"{ex}\n\nAn error occurred while parsing the dataset {key}. See the log above for more details.")
Exception: ImportHelper: error when importing file rep_0.tsv.

An error occurred while parsing the dataset d1. See the log above for more details.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda/envs/immuneml_env/bin/immune-ml-quickstart", line 11, in <module>
    sys.exit(main())
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 167, in main
    quickstart.run(sys.argv[1] if len(sys.argv) == 2 else None)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 160, in run
    app.run()
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 44, in run
    symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
    symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 141, in parse
    def_parser_output, specs_defs = DefinitionParser.parse(workflow_specification, symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/definition_parsers/DefinitionParser.py", line 48, in parse
    symbol_table, specs_import = ImportParser.parse(specs, symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 27, in parse
    symbol_table = ImportParser._parse_dataset(key, workflow_specification[ImportParser.keyword][key], symbol_table, result_path)
  File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
    raise Exception(f"{e}\n\n"
Exception: ImportHelper: error when importing file rep_0.tsv.

An error occurred while parsing the dataset d1. See the log above for more details.

ImmuneMLParser: an error occurred during parsing in function _parse_dataset  with parameters: ('d1', {'format': 'AIRR', 'params': {'is_repertoire': True, 'path': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), 'paired': False, 'import_productive': True, 'import_with_stop_codon': False, 'import_out_of_frame': False, 'import_illegal_characters': False, 'region_type': 'IMGT_CDR3', 'separator': '\t', 'column_mapping': {'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, 'import_empty_nt_sequences': True, 'import_empty_aa_sequences': False, 'metadata_file': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), 'result_path': PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1')}}, SymbolTable(), PosixPath('quickstart_results/machine_learning_analysis/result')).

For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.

ImmuneML: parsing the specification...

I attempted to train a model with ImmuneSEQRearrangement data using ImmuneML. The process has been running for more than 5 hours without producing any additional details or output. It seems to be stuck, and I'm unable to determine the status of the task.

Logs in docker:
2024-01-15 18:14:11 2024-01-15 12:44:11.790241: Setting temporary cache path to data/results/cache
2024-01-15 18:14:11 2024-01-15 12:44:11.791161: ImmuneML: parsing the specification...
2024-01-15 18:14:11

Log.txt in results folder:
2024-01-15 12:44:11,789 INFO: Setting temporary cache path to data/results/cache
2024-01-15 12:44:11,790 INFO: ImmuneML: parsing the specification...

2024-01-15 12:44:11,803 INFO: --- Entering: parse with parameters ({}, SymbolTable())
2024-01-15 12:44:11,803 INFO: --- Exiting: parse
2024-01-15 12:44:11,803 INFO: --- Entering: parse_encoder with parameters ('encoding_1', {'KmerFrequency': {'k': 3, 'reads': 'all', 'sequence_encoding': 'CONTINUOUS_KMER'}})
2024-01-15 12:44:11,970 INFO: --- Exiting: parse_encoder
2024-01-15 12:44:11,970 INFO: --- Entering: _parse_ml_method with parameters ('k_nearest_neighbors', {'KNN': {'n_neighbors': [3, 5, 7], 'show_warnings': False}, 'model_selection_cv': True, 'model_selection_n_folds': 5})

please add a version parameter

Please add a parameter for immuneml to output its version. This is useful for provenance.

immuneml -v
immuneml --version

As you are using python and git, maybe use versioneer like the airr-standards library

More example of yaml please

Dear immuneML team:

Thanks for your amazing immuneML tool. I am using the local version of immuneML for repertoire classification. I found in the officla turotial:https://docs.immuneml.uio.no/latest/installation.html, there are just few completing example yaml using k-mear encoding or sequenceabundance. Even though there are detailed introduction for various encoding and ML method. But how to correctly combined different 'encodings' with compataible 'ml_methods' and also 'instructions' in to complete woriking yaml file is still confusing for me. Especially, using word2vec/DeepRC/TCRdist encoding for repertoire classification.

If you guys can kindly provide as much as possible yaml file that combine different encoding and ml method for repertoire classification, I would be very appreciate for your help.

Best regards,

Matches report with MatchedRegex encoder encountered an error and could not be generated

I am trying out the Matches report with MatchedRegex encoding. Please see the attached log.txt file. It says an error was encountered and no report was created. The stdout.txt file did not show any error. No report was created as stated in the log file. The yaml file (small_emerson_dat_exploratory_yaml.txt) and a report-specific motif file ([motif_file.txt (https://github.com/uio-bmi/immuneML/files/5471005/motif_file.txt)) were attached, along with the immuneml imported data (immuneml_imported_data). Could you help me in figuring out what I am doing wrong here? Thanks!

Galaxy interface trims CDR3 residues in create dataset

Hello, in attempting to generate immuneML datasets using the galaxy interface, I am unable to specify trim_leading_trailing: false in the 'simple' parameter mode. I would like more control by using my own .yaml file in the galaxy Create Dataset interface.

However, I am also unable to upload my own .yaml specification for this purpose, because it seems the data files are loaded using some temporary cache that I do not know the filepath for.

Please let me know if this query makes sense, if not happy to elaborate.

perhaps not fail when not supplying sequence position weights in full sequence implantation

Add check to ML parser for positive class setting

When using ProbabilisticBinaryClassifier, it should be checked at the parsing time if the positive class is set for the label and not after the encoding was done as it can take a lot of time. Also update in the docs for the classifier that it relies on positive class parameter under label to be set.

KeyError in exploratory analysis

I am trying to run this yaml instruction (repertoire_implanting_rate__0.001_-_ml_instruction_yaml.txt) using this dataset and a file required by a specific report (motif_file_2.txt). I encounter the following error (stderr.txt). Attached is immuneML's log file (log.txt). Could you help me figure out what is going wrong here. Thanks!

ImmuneML requires scikit-learn==1.2.2 to work, the default pip install uses scikit-learn==1.3.0

Hello,
I noticed that the default process of installing immuneml using pip didn't work at first because the statement "from sklearn.metrics import SCORERS" (in the file /immuneML/ml_methods/SklearnMethod.py) causes a runtime error (shows in the attached photo).

It appears that statement is no longer a supported syntax by scikit-learn. After changing my scikit-learn version from default (most up to date) version 1.3.0, to version 1.2.2, ImmuneML appeared to work as intended.

I was using python version 3.9.6 with a virtual environment, on a linux machine (Red hat enterprise server), and pip version 23.2.1. I believe all the other packages installed were the most recent available versions.

Just wanted to let you guys know, as it had me confused for a minute,
Thanks!

WARNING: ABCMeta: chain was not set for sequence 0, skipping the sequence for matching...

I am trying to to use MatchedRegex encoding + Matches report e.g. with the following yaml specification file, immuneml imported data to reproduce,motif file for MatchedRegex encoding. The log file shows WARNING: ABCMeta: chain was not set for sequence 0, skipping the sequence for matching... for all sequences I guess. I attach the output files of Matches report, which show zero counts across all repertoires. Could you help me figure out the issue ..
complete_match_count_table_csv.txt
repertoire_sizes_csv.txt

perhaps add a warning to users in docs in relevant sections of full sequence implantation that data import by default trims first and last amino acid residues

if the full sequences that are being implanted contains these first and last residues ML models will perhaps just learn that

ValueError: RegionType NaN

Hi @pavlovicmilena! I am using the attached YAML specification file and I get a valueerror in RegionType checking.
The stderr is attached. Could you help me figure out the error!? Thanks ..

ImmuneML galaxy-error in training model

cp: cannot stat '/storage/007/dataset_7020_files/result/': No such file or directory
or
cp: cannot stat '/result/': No such file or directory

An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/immuneml/immuneml_tools/immuneml_train_classifiers/2.2.0.0.

Details
Tool generated the following standard error:

cp: cannot stat '/storage/007/dataset_7020_files/result/*': No such file or directory

IMGT positions are computed wrong

correct is shown here: https://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html

we add 0.001 for each position (we get e.g. [112.005, 112.01, 112.15] when it should be [112.1, 112.10, 112.15...])
should be a small fix, but need to take a thorough look to see if any other parts of the code depends on the way we do it currently

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.