upstash / degree-guru Goto Github PK

View Code? Open in Web Editor NEW

128.0 5.0 34.0 30.06 MB

AI chatbot for expert answers on university degrees

Home Page: https://degreeguru.vercel.app/

JavaScript 5.74% TypeScript 55.60% CSS 2.00% Python 36.10% Dockerfile 0.56%

ai python rag vercel-ai

degree-guru's People

Stargazers

Watchers

degree-guru's Issues

Additional Scrollbar Fix

Remove overflow-y-scroll from page.tsx main tag as it's showing an additional scroll bar area which isn't being used

scrape doesn't crawl any pages?

I tried to get scrapy to crawl a basic website, but it doesn't seem to crawl anything. First I thought it was due to the vercel deploy, but even on a basic droplet nothing happens. The documentation is also a bit sparse. Any idea what could be wrong?

2024-05-13 17:24:26 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: degreegurucrawler)
2024-05-13 17:24:26 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.7, Platform Linux-6.8.0-31-generic-x86_64-with-glibc2.39
2024-05-13 17:24:26 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-13 17:24:26 [httpx] DEBUG: load_verify_locations cafile='/root/scrape/venv/lib/python3.12/site-packages/certifi/cacert.pem'
2024-05-13 17:24:26 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-13 17:24:26 [httpx] DEBUG: load_verify_locations cafile='/root/scrape/venv/lib/python3.12/site-packages/certifi/cacert.pem'
2024-05-13 17:24:26 [httpcore.connection] DEBUG: connect_tcp.started host='adjusted-quagga-67119-eu1-vector.upstash.io' port=443 local_address=None timeout=5.0 socket_options=None
2024-05-13 17:24:26 [httpcore.connection] DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7189d254c6b0>
2024-05-13 17:24:26 [httpcore.connection] DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7189d252c750> server_hostname='adjusted-quagga-67119-eu1-vector.upstash.io' timeout=5.0
2024-05-13 17:24:26 [httpcore.connection] DEBUG: start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7189d2921070>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_headers.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_headers.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_body.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_body.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_headers.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Mon, 13 May 2024 17:24:26 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'270'), (b'Connection', b'keep-alive'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains')])
2024-05-13 17:24:26 [httpx] INFO: HTTP Request: POST https://MYVECTORURL.vector.upstash.io/info "HTTP/1.1 200 OK"
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_body.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_body.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: response_closed.started
2024-05-13 17:24:26 [httpcore.http11] DEBUG: response_closed.complete
Creating a vector index at https://MYVECTORURL.vector.upstash.io.
Vector store info before crawl: InfoResult(vector_count=0, pending_vector_count=0, index_size=0, dimension=1536, similarity_function='DOT_PRODUCT', namespaces={'': NamespaceInfo(vector_count=0, pending_vector_count=0)})
2024-05-13 17:24:26 [scrapy.addons] INFO: Enabled addons:
[]
2024-05-13 17:24:26 [asyncio] DEBUG: Using selector: EpollSelector
2024-05-13 17:24:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-05-13 17:24:26 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-05-13 17:24:26 [scrapy.extensions.telnet] INFO: Telnet Password: a8d1a25a67da58af
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-05-13 17:24:26 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'degreegurucrawler',
'DEPTH_LIMIT': 3,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'degreegurucrawler.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['degreegurucrawler.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Spider opened
2024-05-13 17:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-05-13 17:24:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-05-13 17:24:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://WEBSITEURL.com> (referer: None)
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Closing spider (finished)
2024-05-13 17:24:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 217,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11688,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.223009,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 5, 13, 17, 24, 26, 770066, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 65674,
'httpcompression/response_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'memusage/max': 89600000,
'memusage/startup': 89600000,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 5, 13, 17, 24, 26, 547057, tzinfo=datetime.timezone.utc)}
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Spider closed (finished)

httpx.UnsupportedProtocol

I have followed all the steps in the README, but I end up with this error everytime I run this code

scrapy crawl configurable --logfile degreegurucrawl.log from the README. May you know why ? I haven't found a solution yet.

httpx.UnsupportedProtocol: Request URL is missing an 'http://' or 'https://' protocol.

An example of the crawl.yaml:

Suggestions for this repo

This repo is amazing and thank you for it. My suggestions are:

Somehow merge or refactor this repo with Vercel's own https://github.com/vercel/ai-chatbot, as they have some features and improvements in the UI. Also, they are already using Vercel KV (which is upstash), but they are missing the Upstash vector database for RAG. Combining to KV for login and saving chat with the Vector for RAG would be very powerful
Instead of just providing code for scraping, I think it would be better to just store the data that can be vectorized in a Redis KV from Upstash. That will allow editing of the content and updating a vector. I don't scraping is the best thing, especially if you can't edit the content after before creating the vector entries.

Hallucinations and unresponsiveness

After doing several crawls of the same website, the answers are not exact at all. I have tried changing models but I have not achieved good results.

But, no results. Thanks!

ModuleNotFoundError: No module named 'langchain'

Hi,

I am completely new and while I try to run the project on my iMac, I encountered error - ModuleNotFoundError: No module named 'langchain'

Do I need to install something?
I have installed Ollama.

Thanks,
Wallace.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.