- ๐ย I'm currently working on ss.lv Web Scraper
- ๐คย I'm open to collaborating on Interesting projects
My GitHub Stats
ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
License: GNU General Public License v3.0
My GitHub Stats
result files are not cleaned
requirements :
Data formater module has implementation
Implement logging for analitics.py module
Only get 5 urls are scraped for development increase to max URL count
-- sleep might need to be tuned in webscraper.py module
-- changing url count requires for container to be restarted to take effect
Restore temp file cleanup functionality
file examples: Ogre-report*.txt, pandas-df.csv, cleaned-df.csv
Insert dictionary to delisted_ads table
Analytics module should run and create pdf report
Email should include pdf report in attachment and text in body of email
Requires logging to docker console and to file
2021-07-23 17:13:28,419: db_worker: INFO: db_worker_main: 45: --- Satrting db_worker module ---
2021-07-23 17:13:28,421: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,454: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table
2021-07-23 17:13:28,454: db_worker: INFO: categorize_hashes: 148: Categorizing hashes based on listed_ads table hashes and and new df hashes
2021-07-23 17:13:28,456: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,467: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table
Implement function that creates email_body_txt_m4.txt for sendgrid mailer
2021-08-01 12:13:14,164: db_worker: INFO: update_dlv_in_db_table: 460: Updated days_listed value for 1800 messages in listed_ads table
Extract data as dictionary from listed_ads table
Move data removing code sendgrid_mailer.py
Increment listed days value in listed_ads table
Affected release 1.4.3
Traceback (most recent call last):
File "./app.py", line 18, in
import db_worker
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 549, in
db_worker_main()
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 74, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 461, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '21:53:32.177067'
Major refactoring to improve code readability
listed_ads table needs:
Add better handling of missing database.ini file
Extract ad hashes from database listed_ads table for db_worker
Linting error needs to be fixed
8 def scrape_website():
7 """ Main function of module calls all sub-functions"""
6
53 if task_run_state == True: <<<
Extract data as dictionary from data frame
db_worker: INFO: compare_df_to_db_hashes: 164: Result 15 new, 29 still_listed, 1 to_remove hashes
need to write which exactly where 15 new ads
It seems suspicious that there where 15 new ads but I dont have 15 with the same listed date could be that they got edited but did not get listed the same day
This could be triaged by comparing 2 day database backups
Monthly activity (new ads inserted count, removed ads count,
average days of listed state for removed ads)
.env.prod file example in README.md
2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 15
2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 19
2021-07-25 23:17:30,882: db_worker: INFO: insert_data_to_listed_table: 236: 23:17:30.748172
2021-07-25 23:17:30,882: db_worker: ERROR: insert_data_to_listed_table: 262: invalid input syntax for integer: "23:17:30.748172"
LINE 12: ...00, 35.0, 971.43, 'Skolas iela 1b', '2021.07.25', '23:17:30....
Extract ad hashes from data frame for db_worker
Load daily csv data to data frame
Affected file app/main.py
Current problem:
-- no logging in docker console
-- need to improve wording in file logging
INFO: Started server process [1]
From ws DOCKER console log
Debug info: Started dat_formater module ... <<< need to improve
Debug info: Ended data_formater module ...
Debug info: Starting data frame cleaning module ...
Debug info: Completed dat_formater module ...
Error: 1_rooms_tmp.txt : No such file or directory
Error: Mailer_report.txt : No such file or directory
Error: basic_price_stats.txt : No such file or directory
Error: 1-4_rooms.png : No such file or directory
Error: 1_rooms.png : No such file or directory
Error: 2_rooms.png : No such file or directory
Error: test.png : No such file or directory
Error: mrv2.txt : No such file or directory
Error: Ogre_city_report.pdf : No such file or directory
2022-01-06 16:16:26,071 [MainThread ] [INFO ] : serve: 84: Started server process [1]
INFO: 192.168.176.1:57528 - "GET / HTTP/1.1" 200 OK
INFO: 192.168.176.1:57528 - "GET /favicon.ico HTTP/1.1" 404 Not Found
DEBUG: sleeping 90 sec ... waiting for srape task to complete
DEBUG: sleeping 5 sec .. waiting for dataformater task to complete
DEBUG: sleeping 3 sec
DEBUG: sleeping 5 sec
INFO: 192.168.176.4:43588 - "GET /run-task/ogre HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/applications.py", line 112, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 64, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 142, in call
await self.background()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 35, in call
await task()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 20, in call
await run_in_threadpool(self.func, *self.args, **self.kwargs)
File "/usr/local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "/./app/wsmodules/web_scraper.py", line 57, in scrape_website
exit(0)
File "/usr/local/lib/python3.8/_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: 0
Debug info: Starting website parsing module ...
Checking if job: scrape OGRE apartments has run today
Job did run today state: True
--- Finished ws_worker module because job was run today state: true ---
Implement database Insert functionality
insert dictionary to listed_ads table
./db_worker.py
DEBUG: Satrting db_worker module ...
DEBUG: Loaded cleaned-sorted-df.csv to dataframe in memory ...
Traceback (most recent call last):
File "./db_worker.py", line 145, in
db_worker_main()
File "./db_worker.py", line 36, in db_worker_main
df_hashes = get_data_frame_hashes('cleaned-sorted-df.csv')
File "./db_worker.py", line 54, in get_data_frame_hashes
df = pd.read_csv(df_filename)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'
File "", line 219, in _call_with_frames_removed
File "/./app/main.py", line 7, in
from app.wsmodules.db_worker import db_worker_main
File "/./app/wsmodules/db_worker.py", line 547, in
db_worker_main()
File "/./app/wsmodules/db_worker.py", line 75, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/./app/wsmodules/db_worker.py", line 459, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '19:42:31.992576'
2022-01-16 21:18:12,243 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:23:12,327 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:28:12,425 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
From once in 5 min to once in 50 min = 3000 sec
Categorize hases in 3 categories
Improve code quality with so module is ready for import in fast-api
Create makefile for shortcuts like:
-- docker compose up
-- docker clean unused images
-- pytest
-- flake8
-- docker build and push
-- deploy to AWS
Extract ad hashes from database listed_ads table
Remove delisted ads from listed_ads based on hashes
Goal daily mail must sent via Sendgrid email API
Subgoals:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.