Coder Social home page Coder Social logo

vfedotovs / sslv_web_scraper Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 3.0 594 KB

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database

License: GNU General Public License v3.0

Python 84.84% Dockerfile 0.59% Makefile 2.20% Shell 12.37%
analytics beautifulsoup4 docker email email-sender fpdf-library pandas-library postgresql python requests scraper sendgrid-api webscraping

sslv_web_scraper's Introduction

Hi! My name is Valentins Fedotovs


  • ๐Ÿš€ย  I'm currently working on ss.lv Web Scraper
  • ๐Ÿคย  I'm open to collaborating on Interesting projects

Badges

My GitHub Stats

vfedotovs's GitHub stats

sslv_web_scraper's People

Contributors

vfedotovs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

sslv_web_scraper's Issues

Implement logging for db_worker module

requirements :

  • should create log file named db_worker.log
  • log includes:
    • timestamp
    • source module, function and line num, message
    • should rotate logs max size 1 MB history 5 files

Implement 3 features for MVP

  • Implement basic web scraper functionality
  • Implement data formatting feature for basic text email
  • Implement send formatted text data as daily email functionality

Refactor duplicate function call in db_worker.py

2021-07-23 17:13:28,419: db_worker: INFO: db_worker_main: 45: --- Satrting db_worker module ---
2021-07-23 17:13:28,421: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,454: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table
2021-07-23 17:13:28,454: db_worker: INFO: categorize_hashes: 148: Categorizing hashes based on listed_ads table hashes and and new df hashes
2021-07-23 17:13:28,456: db_worker: INFO: get_data_frame_hashes: 93: Extracted 35 hashes from pandas data frame
2021-07-23 17:13:28,467: db_worker: INFO: clean_db_hashes: 137: Extracted and cleaned 35 hashes from listed_ads table

Fix incorrect handling insert new messages that are less that 1 day old

Traceback (most recent call last):
File "./app.py", line 18, in
import db_worker
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 549, in
db_worker_main()
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 74, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/home/ec2-user/sslv_web_scraper/sslv_web_scraper/db_worker.py", line 461, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '21:53:32.177067'

Add debug logging of all new messages in dbworker.log

db_worker: INFO: compare_df_to_db_hashes: 164: Result 15 new, 29 still_listed, 1 to_remove hashes
need to write which exactly where 15 new ads

It seems suspicious that there where 15 new ads but I dont have 15 with the same listed date could be that they got edited but did not get listed the same day

This could be triaged by comparing 2 day database backups

Fix issue table insert fails for all rows if listed days value is below 1 day in hours

2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 15
2021-07-25 23:17:30,881: db_worker: INFO: insert_data_to_listed_table: 236: 19
2021-07-25 23:17:30,882: db_worker: INFO: insert_data_to_listed_table: 236: 23:17:30.748172
2021-07-25 23:17:30,882: db_worker: ERROR: insert_data_to_listed_table: 262: invalid input syntax for integer: "23:17:30.748172"
LINE 12: ...00, 35.0, 971.43, 'Skolas iela 1b', '2021.07.25', '23:17:30....

FEAT: Implement logging for formater , cleaner, file_remover

INFO: Started server process [1]

From ws DOCKER console log
Debug info: Started dat_formater module ... <<< need to improve
Debug info: Ended data_formater module ...
Debug info: Starting data frame cleaning module ...
Debug info: Completed dat_formater module ...
Error: 1_rooms_tmp.txt : No such file or directory
Error: Mailer_report.txt : No such file or directory
Error: basic_price_stats.txt : No such file or directory
Error: 1-4_rooms.png : No such file or directory
Error: 1_rooms.png : No such file or directory
Error: 2_rooms.png : No such file or directory
Error: test.png : No such file or directory
Error: mrv2.txt : No such file or directory
Error: Ogre_city_report.pdf : No such file or directory
2022-01-06 16:16:26,071 [MainThread ] [INFO ] : serve: 84: Started server process [1]

BUG: ERROR: Exception in ASGI application in web_scraper container

INFO: 192.168.176.1:57528 - "GET / HTTP/1.1" 200 OK
INFO: 192.168.176.1:57528 - "GET /favicon.ico HTTP/1.1" 404 Not Found
DEBUG: sleeping 90 sec ... waiting for srape task to complete
DEBUG: sleeping 5 sec .. waiting for dataformater task to complete
DEBUG: sleeping 3 sec
DEBUG: sleeping 5 sec
INFO: 192.168.176.4:43588 - "GET /run-task/ogre HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/applications.py", line 112, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 64, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 142, in call
await self.background()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 35, in call
await task()
File "/usr/local/lib/python3.8/site-packages/starlette/background.py", line 20, in call
await run_in_threadpool(self.func, *self.args, **self.kwargs)
File "/usr/local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 28, in run_sync
return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 754, in run
result = context.run(func, *args)
File "/./app/wsmodules/web_scraper.py", line 57, in scrape_website
exit(0)
File "/usr/local/lib/python3.8/_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: 0
Debug info: Starting website parsing module ...
Checking if job: scrape OGRE apartments has run today
Job did run today state: True
--- Finished ws_worker module because job was run today state: true ---

Fix bug FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'

./db_worker.py
DEBUG: Satrting db_worker module ...
DEBUG: Loaded cleaned-sorted-df.csv to dataframe in memory ...
Traceback (most recent call last):
File "./db_worker.py", line 145, in
db_worker_main()
File "./db_worker.py", line 36, in db_worker_main
df_hashes = get_data_frame_hashes('cleaned-sorted-df.csv')
File "./db_worker.py", line 54, in get_data_frame_hashes
df = pd.read_csv(df_filename)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/vfedotovs/Library/Python/3.8/lib/python/site-packages/pandas/io/parsers.py", line 2010, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'cleaned-sorted-df.csv'

BUG container keeps crashing if inserted ad time length is less than 24 hours

File "", line 219, in _call_with_frames_removed
File "/./app/main.py", line 7, in
from app.wsmodules.db_worker import db_worker_main
File "/./app/wsmodules/db_worker.py", line 547, in
db_worker_main()
File "/./app/wsmodules/db_worker.py", line 75, in db_worker_main
update_dlv_in_db_table(to_increment_msg_data, todays_date)
File "/./app/wsmodules/db_worker.py", line 459, in update_dlv_in_db_table
if int(correct_dlv) > days_listed:
ValueError: invalid literal for int() with base 10: '19:42:31.992576'

Reduce log event count in task scheduler 10 times

2022-01-16 21:18:12,243 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:23:12,327 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...
2022-01-16 21:28:12,425 [MainThread ] [INFO ] : : 49: ts_loop: checking every 300 sec if cheduled task needs to run again...

From once in 5 min to once in 50 min = 3000 sec

Categorize hases in 3 categories

Categorize hases in 3 categories

  • new hashes (for insert to listed_ads table)
  • seen hashes but not delisted yet (increment listed days value)
  • delisted hashes (for insert to delisted_ads and remove from listed_ads table)

FEAT: Add Makefile

Create makefile for shortcuts like:
-- docker compose up
-- docker clean unused images
-- pytest
-- flake8
-- docker build and push
-- deploy to AWS

Restore daily email functionality with report in attachment

Goal daily mail must sent via Sendgrid email API

Subgoals:

  1. Text report should be included in email (limit to single and double room apartments)
  2. Analytics module should run and create pdf report
  3. Email should include pdf report in attachment and text ( no limiting to single or double room apts)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.