upstash / degree-guru Goto Github PK
View Code? Open in Web Editor NEWAI chatbot for expert answers on university degrees
Home Page: https://degreeguru.vercel.app/
AI chatbot for expert answers on university degrees
Home Page: https://degreeguru.vercel.app/
Remove overflow-y-scroll
from page.tsx main tag as it's showing an additional scroll bar area which isn't being used
I tried to get scrapy to crawl a basic website, but it doesn't seem to crawl anything. First I thought it was due to the vercel deploy, but even on a basic droplet nothing happens. The documentation is also a bit sparse. Any idea what could be wrong?
2024-05-13 17:24:26 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: degreegurucrawler)
2024-05-13 17:24:26 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.7, Platform Linux-6.8.0-31-generic-x86_64-with-glibc2.39
2024-05-13 17:24:26 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-13 17:24:26 [httpx] DEBUG: load_verify_locations cafile='/root/scrape/venv/lib/python3.12/site-packages/certifi/cacert.pem'
2024-05-13 17:24:26 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-13 17:24:26 [httpx] DEBUG: load_verify_locations cafile='/root/scrape/venv/lib/python3.12/site-packages/certifi/cacert.pem'
2024-05-13 17:24:26 [httpcore.connection] DEBUG: connect_tcp.started host='adjusted-quagga-67119-eu1-vector.upstash.io' port=443 local_address=None timeout=5.0 socket_options=None
2024-05-13 17:24:26 [httpcore.connection] DEBUG: connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7189d254c6b0>
2024-05-13 17:24:26 [httpcore.connection] DEBUG: start_tls.started ssl_context=<ssl.SSLContext object at 0x7189d252c750> server_hostname='adjusted-quagga-67119-eu1-vector.upstash.io' timeout=5.0
2024-05-13 17:24:26 [httpcore.connection] DEBUG: start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7189d2921070>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_headers.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_headers.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_body.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: send_request_body.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_headers.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Mon, 13 May 2024 17:24:26 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'270'), (b'Connection', b'keep-alive'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains')])
2024-05-13 17:24:26 [httpx] INFO: HTTP Request: POST https://MYVECTORURL.vector.upstash.io/info "HTTP/1.1 200 OK"
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_body.started request=<Request [b'POST']>
2024-05-13 17:24:26 [httpcore.http11] DEBUG: receive_response_body.complete
2024-05-13 17:24:26 [httpcore.http11] DEBUG: response_closed.started
2024-05-13 17:24:26 [httpcore.http11] DEBUG: response_closed.complete
Creating a vector index at https://MYVECTORURL.vector.upstash.io.
Vector store info before crawl: InfoResult(vector_count=0, pending_vector_count=0, index_size=0, dimension=1536, similarity_function='DOT_PRODUCT', namespaces={'': NamespaceInfo(vector_count=0, pending_vector_count=0)})
2024-05-13 17:24:26 [scrapy.addons] INFO: Enabled addons:
[]
2024-05-13 17:24:26 [asyncio] DEBUG: Using selector: EpollSelector
2024-05-13 17:24:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-05-13 17:24:26 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-05-13 17:24:26 [scrapy.extensions.telnet] INFO: Telnet Password: a8d1a25a67da58af
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-05-13 17:24:26 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'degreegurucrawler',
'DEPTH_LIMIT': 3,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'degreegurucrawler.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['degreegurucrawler.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-05-13 17:24:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Spider opened
2024-05-13 17:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-05-13 17:24:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-05-13 17:24:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://WEBSITEURL.com> (referer: None)
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Closing spider (finished)
2024-05-13 17:24:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 217,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11688,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.223009,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 5, 13, 17, 24, 26, 770066, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 65674,
'httpcompression/response_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'memusage/max': 89600000,
'memusage/startup': 89600000,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 5, 13, 17, 24, 26, 547057, tzinfo=datetime.timezone.utc)}
2024-05-13 17:24:26 [scrapy.core.engine] INFO: Spider closed (finished)
I have followed all the steps in the README
, but I end up with this error everytime I run this code
scrapy crawl configurable --logfile degreegurucrawl.log
from the README
. May you know why ? I haven't found a solution yet.
httpx.UnsupportedProtocol: Request URL is missing an 'http://' or 'https://' protocol.
This repo is amazing and thank you for it. My suggestions are:
Somehow merge or refactor this repo with Vercel's own https://github.com/vercel/ai-chatbot, as they have some features and improvements in the UI. Also, they are already using Vercel KV (which is upstash), but they are missing the Upstash vector database for RAG. Combining to KV for login and saving chat with the Vector for RAG would be very powerful
Instead of just providing code for scraping, I think it would be better to just store the data that can be vectorized in a Redis KV from Upstash. That will allow editing of the content and updating a vector. I don't scraping is the best thing, especially if you can't edit the content after before creating the vector entries.
After doing several crawls of the same website, the answers are not exact at all. I have tried changing models but I have not achieved good results.
But, no results. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.