freedmand / semantra Goto Github PK

View Code? Open in Web Editor NEW

2.5K 34.0 139.0 9.23 MB

Multi-tool for semantic search

License: MIT License

JavaScript 2.51% CSS 0.84% HTML 0.41% Svelte 38.53% TypeScript 11.21% Python 46.50%

cli machine-learning semantic-search

semantra's Introduction

Semantra

semantra_demo_vid.mov

Semantra is a multipurpose tool for semantically searching documents. Query by meaning rather than just by matching text.

The tool, made to run on the command line, analyzes specified text and PDF files on your computer and launches a local web search application for interactively querying them. The purpose of Semantra is to make running a specialized semantic search engine easy, friendly, configurable, and private/secure.

Semantra is built for individuals seeking needles in haystacks — journalists sifting through leaked documents on deadline, researchers seeking insights within papers, students engaging with literature by querying themes, historians connecting events across books, and so forth.

Resources

Tutorial: a gentle introduction to getting started with Semantra — everything from installing the tool to hands-on examples of analyzing documents with it
Guides: practical guides on how to do more with Semantra
Concepts: Explainers on some concepts to better understand how Semantra works
Using the web interface: A reference on how to use the Semantra web app

This page gives a high-level overview of Semantra and a reference of its features. It's also available in other languages: Semantra en español, Semantra 中文说明

Installation

Ensure you have Python >= 3.9.

The easiest way to install Semantra is via pipx. If you do not have pipx installed, run:

python3 -m pip install --user pipx

Or, if you have Homebrew installed, you can run brew install pipx.

Once pipx is installed, run:

python3 -m pipx ensurepath

Open a new terminal window for the new path settings pipx sets to go into effect. Then run:

python3 -m pipx install semantra

This will install Semantra on your path. You should be able to run semantra in the terminal and see output.

Note: if the above steps don't work or you'd like a more granular installation, you can install Semantra in a virtual environment (though note it will only be accessible while the virtual environment is activated):

python3 -m venv venv
source venv/bin/activate
pip install semantra

Usage

Semantra operates on collections of documents — text or PDF files — stored on your local computer.

At its simplest, you can run Semantra over a single document by running:

semantra doc.pdf

You can run Semantra over multiple documents, too:

semantra report.pdf book.txt

Semantra will take some time to process the input documents. This is a one-time operation per document (subsequent runs over the same document collection will be near instantaneous).

Once processing is complete, Semantra will launch a local webserver, by default at localhost:8080. On this web page, you can interactively query the passed in documents semantically.

Quick notes:

When you first run Semantra, it may take several minutes and several hundred megabytes of hard disk space to download a local machine learning model that can process the document you passed in. The model used can be customized, but the default one is a great mix of being fast, lean, and effective.

If you want to process documents quickly without using your own computational resources and don't mind paying or sharing data with external services, you can use OpenAI's embedding model.

Quick tour of the web app

When you first navigate to the Semantra web interface, you will see a screen like this:

Type in something in the search box to start querying semantically. Hit Enter or click the search icon to execute the query.

Search results will appear in the left pane ordered by most relevant documents:

The yellow scores show relevance from 0-1.00. Anything in the 0.50 range indicates a strong match. Lighter brown highlights will stream in over the search results explaining the most relevant portions to your query.

Clicking on a search result's text will navigate to the relevant section of the associated document.

Clicking on the plus/minus buttons associated with a search result will positively/negatively tag those results. Re-running the query will cause these additional query parameters to go into effect.

Finally, text queries can be added and subtracted with plus/minus signs in the query text to sculpt a precise semantic meaning.

For a more in-depth walkthrough of the web app, check out the tutorial or the web app reference.

Quick concepts

Using a semantic search engine is fundamentally different than an exact text matching algorithm.

For starters, there will always be search results for a given query, no matter how irrelevant it is. The scores may be really low, but the results will never disappear entirely. This is because semantic searching with query arithmetic often reveals useful results amid very minor score differences. The results will always be sorted by relevance and only the top 10 results per document are shown so the lower scoring results are cut off automatically.

Another difference is that Semantra will not necessarily find exact text matches if you query something that directly appears in the document. At a high level, this is because words can mean different things in different contexts, e.g. the word "leaves" can refer to the leaves on trees or to someone leaving. The embedding models that Semantra uses convert all the text and queries you enter into long sequences of numbers that can be mathematically compared, and an exact substring match is not always significant in this sense. See the embeddings concept doc for more information on embeddings.

Command-line reference

semantra [OPTIONS] [FILENAME(S)]...

Options

--model [openai|minilm|mpnet|sgpt|sgpt-1.3B]: Preset model to use for embedding. See the models guide for more info (default: mpnet)
--transformer-model TEXT: Custom Huggingface transformers model name to use for embedding (only one of --model and --transformer-model should be specified). See the models guide for more info
--windows TEXT: Embedding windows to extract. A comma-separated list of the format "size[_offset=0][_rewind=0]. A window with size 128, offset 0, and rewind of 16 (128_0_16) will embed the document in chunks of 128 tokens which partially overlap by 16. Only the first window is used for search. See the windows concept doc for more information (default: 128_0_16)
--encoding: Encoding to use for reading text files [default: utf-8]
--no-server: Do not start the UI server (only process)
--port INTEGER: Port to use for embedding server (default: 8080)
--host TEXT: Host to use for embedding server (default: 127.0.0.1)
--pool-size INTEGER: Max number of embedding tokens to pool together in requests
--pool-count INTEGER: Max number of embeddings to pool together in requests
--doc-token-pre TEXT: Token to prepend to each document in transformer models (default: None)
--doc-token-post TEXT: Token to append to each document in transformer models (default: None)
--query-token-pre TEXT: Token to prepend to each query in transformer models (default: None)
--query-token-post TEXT: Token to append to each query in transformer models (default: None)
--num-results INTEGER: Number of results (neighbors) to retrieve per file for queries (default: 10)
--annoy: Use approximate kNN via Annoy for queries (faster querying at a slight cost of accuracy); if false, use exact exhaustive kNN (default: True)
--num-annoy-trees INTEGER: Number of trees to use for approximate kNN via Annoy (default: 100)
--svm: Use SVM instead of any kind of kNN for queries (slower and only works on symmetric models)
--svm-c FLOAT: SVM regularization parameter; higher values penalize mispredictions more (default: 1.0)
--explain-split-count INTEGER: Number of splits on a given window to use for explaining a query (default: 9)
--explain-split-divide INTEGER: Factor to divide the window size by to get each split length for explaining a query (default: 6)
--num-explain-highlights INTEGER: Number of split results to highlight for explaining a query (default: 2)
--force: Force process even if cached
--silent: Do not print progress information
--no-confirm: Do not show cost and ask for confirmation before processing with OpenAI
--version: Print version and exit
--list-models: List preset models and exit
--show-semantra-dir: Print the directory semantra will use to store processed files and exit
--semantra-dir PATH: Directory to store semantra files in
--help: Show this message and exit

Frequently asked questions

Can it use ChatGPT?

No, and this is by design.

Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results. Generative language models occasionally produce outwardly plausible but ultimately incorrect information, placing the burden of verification on the user. Semantra treats primary source material as the only source of truth and endeavors to show that a human-in-the-loop search experience on top of simpler embedding models is more serviceable to users.

Development

The Python app is in src/semantra/semantra.py and is managed as a standard Python command-line project with pyproject.toml.

The local web app is written in Svelte and managed as a standard npm application.

To develop for the web app cd into client and then run npm install.

To build the web app, run npm run build. To build the web app in watch mode and rebuild when there's changes, run npm run build:watch.

Contributions

The app is still in early stages, but contributions are welcome. Please feel free to submit an issue for any bugs or feature requests.

semantra's People

Contributors

Stargazers

Watchers

Forkers

sthagen twilwa gladiopeace mrjdomingus babajideowoyele techthiyanes gkbxs yssource kaivvvv jiaoxioa feinor6 18106574249 guxiaokuzq littson lyhiving chen-guoyi imaxwel sinboyxx norman6688 wangwei90 googege econds etck anandanne fastrocket spectramaster petercao jackliu900 mrbuchixiangcai zhuyie ed4win liangdabiao shujuecn huaishui azeroth-dev islq tonyxia2016 iamleon121 qqr1 chaojigang001 cheng-lf fleetof leeseon yaoml ai-jie01 shamifd ajawebx cctvastu cgwgpt victokylin thanhpham1987 lyrl ailabteam gammastick azeroth-zone vtaranti s1x-data-team jasonchang0905 evangineer baopeng0604 kotrotsos billdenney yin-renlong nunomourinho jxp1659824758 fairhopeweb allanachates annias vibster gurusura sidney-zb binfenshijie xiaochendan boterocamilo chriskwyang xiangzigg98 jjhw chanjetsdp sss2107 caramelslushie amtech cvsekhar dkzdev brucechen7 jtan21at madhurimamandal zhouwenfengtyrantasteroid patsh90 realazizk kwameboame xingke2023 dcmaple11 liam-sc 775668513 hbcbh1999 tyrozhang valderwu3 melandres8 dorucioclea wangzhen-ryan

semantra's Issues

Colab and Languages

Hey, there! Is it possible to run this on google colab? And does this support other languages, like portuguese?

PDF viewer support higher resolution rendering

The present PDF viewer displays PDF files at a comparatively low resolution. It would be advantageous for users with high PPI monitors if the resolution could be adjusted adaptively or configured manually, in order to optimize the viewing experience.

Can we set minimum similarity score to exclude non-relevant matches?

i think that would be useful

Default literature folder

I was wondering if it is possible to create a 'default' location where semantra looks for the files if no argument is give on startup? I would like to just run semantra instead of specifying semantra ~/literature/* every time.

Thanks!

Set default connection to localhost

With the default settings, anybody can access the server. The default should be semantra --host 127.0.0.1 --port 8080.

ValueError: Loading openbmb/cpm-bee-10b requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

Error occurred when executing command "semantra --transformer-model openbmb/cpm-bee-10b xxx.txt" to use HuggingFace model openbmb/cpm-bee-10b. Detailed message shown below:
Traceback (most recent call last):
File "/Users/macbookuser/.local/bin/semantra", line 8, in
sys.exit(main())
^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/semantra/semantra.py", line 583, in main
model = TransformerModel(
^^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/semantra/models.py", line 166, in init
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/macbookuser/.local/pipx/venvs/semantra/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 669, in from_pretrained
raise ValueError(
ValueError: Loading openbmb/cpm-bee-10b requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.

Could anyone tell me how to fix this error? Thanks a lot!

File encoding issues

Reported GBK encoding error, but I've already used the command chcp 65001 to convert to UTF-8, but the error still exist.

C:\Users\Ye>semantra C:\Users\Ye\Documents\jqac033.pdf
jqac033.pdf:   0%|                                                                               | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Ye\.local\bin\semantra.exe\__main__.py", line 7, in <module>
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 594, in main
    documents[fn] = process(
                    ^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ye\.local\pipx\venvs\semantra\Lib\site-packages\semantra\pdf.py", line 79, in get_pdf_content
    position += f.write(pagetext)
                ^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'gbk' codec can't encode character '\ufffe' in position 1353: illegal multibyte sequence

Support full system-wide document searching

The current version of Semantra (0.1.3) is a good start, but unfortunately cannot be used as a generic PDF search engine (though it is very close!).

I indexed every PDF on my linux machine by running the following command:
find / -iname "*.pdf" -type f -print0| xargs -0 semantra --no-server

I note that:

Only one CPU core is used.
It is not clear to me that I could rewrite my find command to safely run 10 copies of Semantra in parallel, so I didn't.
Once I had everything indexed, there is no way to tell Semantra to just search over all already indexed files.
Using the same command for indexing and viewing violates separation of concerns.

Also, I note that when run without arguments. Semantra indicates that it accepts an optional filename (using []), but does not actually accept such input:

$ semantra
Usage: semantra [OPTIONS] [FILENAME]...
Try 'semantra --help' for help.

Error: Must provide a filename to process/query

I appreciate the work done so far, and have the following suggestions:

If nothing else, provide a command line option to Semantra to make it search all already-indexed files, or default to this behaviour when no file name is provided.
Separate Semantra into two files. One for indexing and one for searching. Allow indexers to run in parallel.
Allow the search webapp to run independently of indexers, so I can add files to the index without fear of breaking the webapp's search capabilities, and can leave the search window open 24/7. This could hopefully be as simple as only moving the index files out of a temporary directory and into the semantra search directory after indexing is completed.

请问能否增强对中文的支持？

希望能更好的支持中文语义检索

Embedding Cache Location and Custom Path Specification during Initialization

I am opening this issue to inquire about the location of the embedding cache and whether it's possible to specify a custom path during initialization. This would be helpful for backing up the cache and preventing accidental deletion. Any guidance on this matter would be appreciated.

PDF encode error

windows 10 system

❯ semantra.exe 38411-f00-NG-layer-1.pdf
38411-f00-NG-layer-1.pdf:   0%|                                                                                                                                                                                      | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\users\loseblue\.local\bin\semantra.exe\__main__.py", line 7, in <module>
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 594, in main
    documents[fn] = process(
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "C:\Users\loseblue\.local\pipx\venvs\semantra\lib\site-packages\semantra\pdf.py", line 79, in get_pdf_content
    position += f.write(pagetext)
UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 422: illegal multibyte sequence

38411-f00-NG-layer-1.pdf

"An attempt was made to access a socket in a way forbidden by its access permissions"

I'm getting this on both Windows computers I tried Semantra on. It apparently means that the default port is already in use? Can it provide a more helpful error message or just try a different fallback port automatically?

Garbled characters when using OpenAI models to embed Chinese

Garbled characters when using OpenAI models to embed Chinese.
But using --transformer-model intfloat/multilingual-e5-base it's normal.
endoing utf-8
What's the promble?

How to know which Hugging Face Hub models can be used?

According to guide_models.md, many models in the Hugging Face Hub can be used by Semantra. But how to know which Hugging Face Hub models can be used?
How can we determine whether an error encountered when executing the "--transformer-model" command is due to the unavailability of the model or other factors such as computer configuration?

ModuleNotFoundError: No module named 'semantra.semantra'; 'semantra' is not a package

I am trying to run the src/semantra.py file directly after cloning this repo, and I am getting the following error, kindly let me know how to mitigate this.: -

  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/madhurima/semantra/src/semantra/semantra.py", line 650, in base
    pkg_resources.resource_filename("semantra.semantra", "client_public"),
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1211, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/pkg_resources/__init__.py", line 402, in get_provider
    __import__(moduleOrReq)


ModuleNotFoundError: No module named 'semantra.semantra'; 'semantra' is not a package


125.18.52.138 - - [06/Jun/2023 16:28:02] "GET / HTTP/1.1" 500 -
[2023-06-06 16:28:02,714] ERROR in app: Exception on /favicon.ico [GET]
Traceback (most recent call last):
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/madhurima/semantra/src/semantra/semantra.py", line 658, in home
    pkg_resources.resource_filename("semantra.semantra", "client_public"),
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/pkg_resources/__init__.py", line 1211, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
  File "/rapid_data/miniconda/envs/sem_9/lib/python3.9/site-packages/pkg_resources/__init__.py", line 402, in get_provider
    __import__(moduleOrReq)

ModuleNotFoundError: No module named 'semantra.semantra'; 'semantra' is not a package```

Import PDF files from a dir

@freedmand

Good job! Semantra runs smoothly on my linux PC!

I think the command options :

semantra [dir]
semantra [dir1] [dir2] [....]

which can import one or more dirs contain many PDF files are useful and helpful.

'charmap' codec can't encode character

Hi, I am a PhD researcher really excited about this tool.

Unfortunately, I am not able to get Semantra to run. It shows the following error (something to do with charmap not being able to encode a unicode character):

PS C:\Users\nicho> semantra "C:\Users\nicho\Downloads\Chrome Downloads\hamlet (1).pdf" hamlet (1).pdf: 0%| | 0/1 [00:00<?, ?it/s] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "c:\users\nicho\.local\bin\semantra.exe\__main__.py", line 7, in <module> File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1130, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\click\core.py", line 760, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 594, in main documents[fn] = process( ^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 146, in process content = get_text_content(md5, filename, semantra_dir, force, silent) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\semantra\semantra.py", line 45, in get_text_content return get_pdf_content(md5, filename, semantra_dir, force, silent) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\.local\pipx\venvs\semantra\Lib\site-packages\semantra\pdf.py", line 79, in get_pdf_content position += f.write(pagetext) ^^^^^^^^^^^^^^^^^ File "C:\Users\nicho\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\ufffe' in position 565: character maps to <undefined>

Semantra Install Failure on Fedora 37

Hello! I tried to install semantra on an x86 machine running Fedora 37. The install failed. Anyone had a similar experience?
Please see link to error logs below - any suggestions as to a solution?
https://github.com/Rui-E-Rodrigues/SemantraFedora36InstallErrors/tree/570095e2ee9f80c8fa5b8a8cd6c97f6070a8b32b
Please note: I also updated my installation of gnuC and gnuC++ and then re-ran pipx install semantra... however this did not result in a fix. Furthermore I also tried to install Semantra on a machine running Debian-11 and I had the same issues as on Fedora36 and Fedora37
Thank you, Regards!

Flatpack or Appimage installer for Linux

Thanks for this great project - for Linux users, please consider offering a Flatpack or AppImage installer (or even a snap package) - this would make it much more accessible. With python and other key dependencies bundled in the installer, this can avoid issues in contexts where there is no current or future flexibility with regard to system python.

Auto-detect encoding?

Documentation says

--encoding: Encoding to use for reading text files [default: utf-8]

But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.

Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)

https://pypi.org/project/chardet/

Newbie question. I'm not sure how to run the Web UI server without the embedding process.

Hello.

I've already embedded all the necessary documents, and now I just need to turn on the web UI without going through the embedding process. Even after looking at the --help option, I couldn't find an option for 'running the server without embedding process' and it seems like it only provides options for 'embedding without running the server(--no-server)'.

I'm sorry for asking such a basic question, but if someone could kindly provide an answer, I would greatly appreciate it. (I understand that it might be a well-known method that doesn't require explanation, but as a true beginner, I need the explanation.)

Thanks in advance.

Fantastic. Enhancement ideas

I like the idea of semantic search. To expand on this (and I am sure it has been said before). It would be good to add:

UI / Input

Options to interact from WebUI iso command line. Adding files etc
Increased filetype coverage - include office, markdown, images (?)
Input a folder - with all files covered

Processing
4) Specify where embeddings are stored from UI (or make it clear in readme)
5) Save embeddings to avoid repeat processing
5) Auto-update embedding at startup for any new files that may have been added to folder
6) Enable GPU mode

Summarization (additional)
7) Perhaps use Vicuna (quite good already) or another local LLM to process the results from the search query into a coherent reply (with sources, links to source document).
8) This reply can be adjacent to a "source window" (in case of multiple documents), keeping full visibility into source text for the user.

Improve large document collection search performance

I recently conducted a search using a collection of over 8,000 PDF files (primarily 8-10 pages in length), and I am happy to report that the search results were successfully generated. However, I noticed that upon clicking on an item in the left panel displaying the search result entries, the browser becomes unresponsive. I believe that addressing this issue would further enhance the already valuable and fantastic experience that your application provides.

I'd like to express my gratitude for creating such a wonderful app, which I now use on a daily basis. I'm excited about the possibilities that resolving this issue could bring. Thank you for your continued dedication and hard work!

[Feature request] Support for search across multiple pdfs with persistent semantic indexing

Perhaps something similar to this project which is based on Langchain.

Turn on Discussions?

Can you turn on GitHub's Discussions feature so we can talk about models and use cases etc. without filing issues about them?

Re-run queries with tags on new documents

First off, thanks for building this fantastic tool!

One enhancement I’m considering is allowing users to save queries they have written that contain document tags such that these queries can be reused across future sets of documents.
 
I’m proposing this enhancement because I think building queries from tags (as described in 'Step 6: tagging search results') is a process of optimisation that has value worth persisting. We go through a process of discovery in Semantra figuring out which combination of tags and semantic arithmetic will lead us to the most relevant results. It is easy to copy and paste simple queries across separate document searches but replicating queries with tags on new sets of documents is more involved as these queries reference files (embeddings and PDFs) that would need to be saved somewhere sensible to retrieve them. At present, to replicate a query with tags we would also need to re-tag all the document sections in Semantra.
 
I would love to get your general thoughts on whether or not you think this could be a worthwhile enhancement?  I’ve done some basic exploration so far and been able to export a tagged query’s POST request payload which I can re-use by making new requests to the api/query endpoint. I would ideally then like to open Semantra and actually view the results through the interface (effectively reloading the tagged query results) but I haven’t found a good way of doing this.

All result scores are "nan"

Hello, thank you for sharing your work. Unfortunately I have an issue that prevents me from using it.
After installing and running semantra, in the web interface all indexed files report NaN score for whatever query.

How to use Semantra with GPU via PyTorch

Hi, Thanks for the great work and wonderful tool, and my apologies for asking 2 different questions in one thread.
I was following the guides on how to use Semantra and got it running, but I couldn't find guide on how to utilize my GPU (RTX 3070) instead of my old CPU.
I have followed the link to get PyTorch running, but still Semantra is only utilizing my CPU.

Is there some special command/code or model should be used to utilize GPU? it will be very nice to have guide section for this, or if you can kindly share how to utilize the GPU for speeding up Semantra.

I have also notice that it takes some disk space (~10 GB) after running quite few pdfs and text tests, and was trying to figure out if this space was taken for cache embeddings and/or for trying different models. So far I couldn't find all locations related to Semantra.

%AppData%\Semantra  (~444 MB)
%UserProfile%\.cache    (~5.92 GB)
%UserProfile%\.local      (~1.1 GB)

Which location is safe to clear for old searches, if I want to free some space (I am using Semantra on windows 10)?

I really appreciate your help. Thanks

Plugin-in model

I think it would be great if the document format support was done through a plug-in system. I'd like to add support for epub, and could do a PR here, but having a plugin API seems like a better approach. The PDF support could be done that way too, as well as the requested MS Word support, and maybe even things like GitHub issue support.

PDF file conversion fails, prompting encoding error

When parsing the pdf file, it prompts that the code conversion error is wrong. This is because I have not set the parameters, or there is such a problem.

OSError: sentence-transformers/all-mpnet-base-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

OSError: sentence-transformers/all-mpnet-base-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

openai.error.InvalidRequestError: '$.input' is invalid.

I tried running it on ~50 files from a grep result, of types:

pdf
csv
txt
md
html
py

Pressing y for each file to be processed by openai was annoying, so I cancelled and tried again with --no-confirm and got this error. I then tried again without --no-confirm and still get the same error:

  File "C:\Users\endolith\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\endolith\anaconda3\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\users\endolith\.local\bin\semantra.exe\__main__.py", line 7, in <module>
    try:
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 619, in main
    documents[fn] = process(
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 307, in process
    flush_pool()
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\semantra\semantra.py", line 272, in flush_pool
    embedding_results = model.embed(tokens, pool)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\semantra\models.py", line 144, in embed
    response = openai.Embedding.create(model=self.model_name, input=texts)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\openai\api_resources\embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\openai\api_resources\abstract\engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\openai\api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\openai\api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "C:\Users\endolith\.local\pipx\venvs\semantra\lib\site-packages\openai\api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.

Location of the index

Is it possible to choose another location for the indexes? Now Semantra seems to write the data in the User directory on Windows. I regularily restore images of the C: drive, thus the indexes will get lost. Moreover, I want to keep the data on C: as small as possible.

PyTorch no longer supports GPU because it is too old

After following this tutorial to install pytorch: https://www.linode.com/docs/guides/pytorch-installation-ubuntu-2004/

print(torch.cuda.is_available)

gives:

<function is_available at 0x7f1517429700>

as an answer. I suppose I can understand it as "true".

But when running semantra, it gives me some errors and I can't make it work nor know what to do with those messages:

/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/torch/cuda/__init__.py:152: UserWarning: 
Found GPU0 NVIDIA GeForce GT 740M which is of cuda capability 3.5.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 3.7.

warnings.warn(old_gpu_warn % (d, name, major, minor, min_arch // 10, min_arch % 10))
powershell.pdf:   0%|                                     | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):                                              
 File "/home/francisco/.local/bin/semantra", line 8, in <module>
 sys.exit(main())
 File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
 return self.main(*args, **kwargs)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 619, in main
documents[fn] = process(
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 304, in process
flush_pool()
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 272, in flush_pool
embedding_results = model.embed(tokens, pool)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/models.py", line 309, in embed
model_output = self.model(
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 827, in forward
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
File "/home/francisco/.local/pipx/venvs/semantra/lib/python3.8/site-packages/transformers/modeling_utils.py", line 911, in get_extended_attention_mask
extended_attention_mask = extended_attention_mask.to(dtype=dtype)  # fp16 compatibility
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

No such option: --query_token_pre

im following instructions to load (venv) PS D:\AI\Semantra> semantra --transformer-model Muennighoff/SGPT-1.3B-weightedmean-msmarco-specb-bitfit --query_token_pre='[' --query_token_post=']' --doc_token_pre='{' --doc_token_post='}' url

but getting error:

Usage: semantra [OPTIONS] [FILENAME]...
Try 'semantra --help' for help.

Error: No such option: --query_token_pre (Possible options: --doc-token-pre, --query-token-post, --query-token-pre)

windows 11

How to use models in the Hugging Face Hub?

I am new to HuggingFace. The command below is about how to use HF model in this project. So what "filenames" should be? Can anyone show me an example?
semantra --transformer-model sentence-transformers/all-distilroberta-v1 "filenames"

Thanks a lot!

requirement torch>=2.0.0

On Raspberry PI 4 I am receiving following error message when installing via pip and pipx:
ERROR: Could not find a version that satisfies the requirement torch>=2.0.0 (from semantra) (from versions: none)
ERROR: No matching distribution found for torch>=2.0.0

Python version: 3.11.3
Also tried with 3.7, 3.9, 3.11.2

The same install is working fine on M1 Mac ... both are arm64 ...

Show sources of results

I love the program, but how can I quickly find matching results on the right side? Is it only possible to use Ctrl+F in the browser to search for keywords in the entire text? Can I click on the left-hand result to locate its paragraph on the right? Also, browsing can be slow for larger texts.

OpenAI key problem?

Error:
Traceback (most recent call last):
File "/Users/kidthecat/.local/bin/semantra", line 8, in
sys.exit(main())
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/semantra/semantra.py", line 581, in main
model: BaseModel = model_config"get_model"
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/semantra/models.py", line 320, in
"get_model": lambda: OpenAIModel(
File "/Users/kidthecat/.local/pipx/venvs/semantra/lib/python3.9/site-packages/semantra/models.py", line 116, in init
raise Exception(
Exception: OpenAI API key not set. Please set the OPENAI_API_KEY environment variable or create a .env file with the key in the same directory.

I tried .env and exporting the key, but something wrong. I got other OPenAI keys working with Auto-GPT, so the keys do work.

Non-English models

semantra works very well on English articles.
But I encountered some obstacles on Chinese articles.

I have installed a bert chinese model on PowerShell:

PS D:\semantra> semantra.exe --transformer-model ckiplab/bert-base-chinese C:\Users\xxx\Desktop\booktxt\*.txt

but I got this:

What should I fix? Thank you!

PDF parsing error handling

Hi, it would be useful if some error handling was added in case a PDF fails to parse. I earlier got this error after parsing 1000s of PDFs and had to restart from scratch (not a big deal of course I used a small model for embedding but annoying if a large openai model would have been used).

(semantra) rico@xxx:~/src/semantra$ semantra --model sgpt-1.3B data/*pdf
semantra --model sgpt-1.3B data/test.pdf 
test.pdf:   0%|  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rico/.local/bin/semantra", line 8, in <module>
    sys.exit(main())
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 594, in main
    documents[fn] = process(
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 146, in process
    content = get_text_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/semantra.py", line 45, in get_text_content
    return get_pdf_content(md5, filename, semantra_dir, force, silent)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/semantra/pdf.py", line 53, in get_pdf_content
    pdf = pdfium.PdfDocument(filename)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 86, in __init__
    self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
  File "/home/rico/.local/pipx/venvs/semantra/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 721, in _open_pdf
    raise PdfiumError(f"Failed to load document (PDFium: {consts.ErrorToStr.get(err_code)}).")
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

Support Microsoft Office file formats

Most of the documents I would like to search are in ppt or pptx format (Powerpoints).
Would be nice if Powerpoint and Word documents can be indexed, even without a preview option.

OCR pdf

Hey, thanks for this incredible project,

Does this support scanned pdf files (that need to get OCR) to search them by text? That will be super helpful.

Thank you :)

[Feature request] Chat with docs, i.e. Retrieval-based QA etc

I like the user interactive design of this project. Is it possible to combine retrieval based QA with the user-friendly interaction of this project?

Watch folder

Nice job, app works nicely!
Some questions & suggestions, maybe i missed some answers in the docs:

It would be wonderful to have a "watch folder", where documents can be added and then are automatically added to next search results.
It would be useful to have embeddings for the documents permanently stored, especially those from openai, to avoid duplicate charges.
It would be nice to have models, document embeddings in user-specified directory.

The address cannot be accessed

hi:

The address cannot be accessed.

^Cuser@lsp-ws:~$ semantra /ark-contexts/data/hamlet.pdf
hamlet.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 23.64it/s]

Serving Flask app 'semantra.semantra'
Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Running on all addresses (0.0.0.0)
Running on http://127.0.0.1:8080
Running on http://172.29.60.49:8080

Error installing semantra.

hi:

  The following error was found, please help.

Fatal error from pip prevented installation. Full pip output in file:
C:\Users\lh_ti.local\pipx\logs\cmd_2023-04-25_15.39.21_pip_errors.log

pip failed to build package:
annoy-fixed

Some possibly relevant errors from pip install:
error: subprocess-exited-with-error
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

Error installing semantra.

Failed to install with pipx

Reason:

pipx install semantra

No apps associated with package semantra or its dependencies. If you are attempting to install a library, pipx should not be used. Consider using
pip or a similar tool instead.

I also tried to install using pip install. The package is installed but semantra command is not found when run in cmd.

System-wide .env

https://github.com/freedmand/semantra/blob/main/docs/guide_openai.md

Semantra will look in the .env file in the directory it's run in and load the environment variable if found.

Which directory is this on Windows? semantra.exe exists in two places:

%USERPROFILE%\.local\bin\semantra.exe
%USERPROFILE%\.local\pipx\venvs\semantra\Scripts\semantra.exe

But putting the .env file in either of these folders does not work.

Error encountered while using model "sentence-transformers/all-mpnet-base-v2"

Hello, I am encountering an error while trying to use the model "sentence-transformers/all-mpnet-base-v2" in a script.

(base) PS F:\Download> semantra hamlet.pdf
Traceback (most recent call last):
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "E:\software\Anaconnda\lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "E:\software\Anaconnda\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\file_download.py", line 1195, in hf_hub_download
metadata = get_hf_file_metadata(
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\file_download.py", line 1541, in get_hf_file_metadata
hf_raise_for_status(r)
File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_errors.py", line 291, in hf_raise_for_status
raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-64599c33-4e9b1c4a0839489551a6eee6)

Repository Not Found for url: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer_config.json.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:\software\Anaconnda\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "E:\software\Anaconnda\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "E:\software\Anaconnda\Scripts\semantra.exe_main.py", line 7, in
File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "E:\software\Anaconnda\lib\site-packages\semantra\semantra.py", line 598, in main
model: BaseModel = model_config"get_model"
File "E:\software\Anaconnda\lib\site-packages\semantra\models.py", line 334, in
"get_model": lambda: TransformerModel(model_name=mpnet_model_name),
File "E:\software\Anaconnda\lib\site-packages\semantra\models.py", line 166, in init
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
File "E:\software\Anaconnda\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 642, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "E:\software\Anaconnda\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 486, in get_tokenizer_config
resolved_config_file = cached_file(
File "E:\software\Anaconnda\lib\site-packages\transformers\utils\hub.py", line 424, in cached_file
raise EnvironmentError(
OSError: sentence-transformers/all-mpnet-base-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

I have attempted to resolve the issue by ensuring I am using the correct model identifier and checking my internet access. I have also tried logging in with the Hugging Face CLI before running the script.

However, the error persists. Any assistance in resolving this issue would be greatly appreciated.

Environment information:

Operating System: [Windows11]

freedmand / semantra Goto Github PK

semantra's Introduction

Semantra

Resources

Installation

Usage

Quick tour of the web app

Quick concepts

Command-line reference

Options

Frequently asked questions

Can it use ChatGPT?

Development

Contributions

semantra's People

Contributors

Stargazers

Watchers

Forkers

semantra's Issues

Error installing semantra.

Recommend Projects

Recommend Topics

Recommend Org