Coder Social home page Coder Social logo

prospect-scraper-mddb-2022's Introduction

leagify

Leagify - complete with new name!

Most of the code is in the repository with the old name Leagueify

prospect-scraper-mddb-2022's People

Contributors

ashwin003 avatar christian-oleson avatar iotalex avatar zo0o0ot avatar

Watchers

 avatar  avatar

prospect-scraper-mddb-2022's Issues

Update program to take an array of drafts to scrape from scraper.conf

Currently, the program reads this line in scraper.conf to determine what draft year should be scraped:

YearToScrape = 2022

The year is loaded in the program here:

var scraperConfig = new Configuration();
scraperConfig = Configuration.LoadFromFile("scraper.conf");
var pageSection = scraperConfig["Pages"];
var generalSection = scraperConfig["General"];
AnsiConsole.Status()
.Start("Thinking...", ctx =>
{
ctx.Spinner(Spinner.Known.Star);
var webGet = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0";
string scrapeYear = generalSection["YearToScrape"].StringValue;
string urlToScrape = scrapeYear + "Url";
var document = webGet.Load(pageSection[urlToScrape].StringValue);

It would be nice to get this information as an array, then run this scraping job for any values in the array.

An example of how an array is stored in the configuration file is right here:

SomeArray = { 1, 2, 3 } # Potentially use this when I start doing multiple years

The configuration uses SharpConfig to read values.

Update .NET version used in dev container and in GitHub actions

.NET is getting updated more often. This repo currently uses .NET 6 for the runtime and the github action that verifies that everything is building properly in pull requests.
Example of the PR action:
image

Here's the GitHub Actions code:

name: .NET 6 Build & Test
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup .NET 6
uses: actions/setup-dotnet@v1
with:
dotnet-version: 6.0.x
- name: Restore Dependencies
run: dotnet restore
- name: Build
run: dotnet build --no-restore
- name: Test
run: dotnet test --no-build --verbosity normal

Here's the commit that made changes from .NET 5 to 6: 38dd342

It would be nice to change the devcontainer as well.
Here's the commit that was used to change from .NET 5 to .NET 6: 7662f93

It seems that .NET 8 will come soon. Maybe November 13-14?
https://devblogs.microsoft.com/dotnet/announcing-dotnet-8-rc1/

We should change things to .NET 7 for now, then do the similar changes for .NET 8 when the official version 8 release happens.

Format change - scrape crash.

image
image

An exception of type 'System.NullReferenceException' occurred in prospect-scraper-mddb-2022.dll but was not handled in user code: 'Object reference not set to an instance of an object.'
at prospect_scraper_mddb_2022.Extensions.HtmlNodeCollectionExtensions.FindProspects(HtmlNodeCollection nodes, String todayString, Dictionary`2& schoolImages) in /workspaces/prospect-scraper-mddb-2022/Extensions/HtmlNodeCollectionExtensions.cs:line 56
at prospect_scraper_mddb_2022.Extensions.StatusContextExtensions.ScrapeYear(StatusContext ctx, HtmlWeb webGet, String scrapeYear, String urlToScrape) in /workspaces/prospect-scraper-mddb-2022/Extensions/StatusContextExtensions.cs:line 77
at prospect_scraper_mddb_2022.Program.<>c__DisplayClass0_0.

b__0(StatusContext ctx) in /workspaces/prospect-scraper-mddb-2022/Program.cs:line 30
at Spectre.Console.Status.<>c__DisplayClass14_0.b__0(StatusContext ctx) in //src/Spectre.Console/Live/Status/Status.cs:line 47
at Spectre.Console.Status.<>c__DisplayClass16_0.<b__0>d.MoveNext() in /
/src/Spectre.Console/Live/Status/Status.cs:line 82

Add better logging - clean up console output.

The previous version of this program was https://github.com/Leagify/prospect-scraper-dt2021.

In that program, console output for "chatty" stuff was handled better, although it was done by manually writing to files outside of any official logging framework, I think.

The goal here: Log the chatty output better, in a way that is rewritten after each run, so the most recent log will have only the info from the last run.

Manual writing to files is probably OK, but if you're looking for a logging framework to use, here is an article discussing possible options.

https://stackify.com/nlog-vs-log4net-vs-serilog/

School/state mismatches for 2/14 ranks.

image

I don't see any record of Culver Stockton College.
Missouri Western should be changed to Missouri Western State.
I don't see any record of West Florida.

Maybe these are JUCOs?

Get to the bottom of this before committing these ranks.

Some Positional data is wrong.

image

Ben Brown of Ole Miss appears to be one of the offenders.

Darnell Jefferies appears to be the other.

I'm not sure if this issue is happening at the scraping level, or at what point the issue was introduced.

Looking at Ben Brown, I think the issue might just be in the 12-31 ranks (Line 410):

409,102,Ben Brown,Ole Miss,IOL Mississippi,2021-12-31,,,Mississippi,SEC,1

Darnell Jefferies's problems seem to go back potentially all the way to the beginning.

207,183,Darnell Jefferies,Clemson,DL Clemson,2021-07-21,,,South Carolina,ACC,5

Looking at the older ranks, these may not be the only players with this positional issue.

This makes me think that when I'm scraping, I'm getting an element that sometimes has more than one thing inside of it. I'll need to debug to verify.

There are two steps to this:

  • Fix the data
  • Fix the underlying issue, possibly by looking for this issue.

Verify local time before putting a date on the ranks. Date sometimes one day into the future.

One issue that I've noticed is that when scraping, sometimes I'll end up with a date that's in the future.

I assume that at some point I'm using GMT for the time, and I'm working in the evening, so GMT shows a date that is technically tomorrow.

For example, it's still 10/2 where I am, but the scraper assigned the date of 10/3 to the most recent scrape from #60.

It would be nice not to have the future date, and to instead have the date from the local time, if possible.

My thought is that it may be happening here:

// Get today's date in the format of yyyy-mm-dd
string today = DateTime.Now.ToString("yyyy-MM-dd");
// Create a ConsensusBigBoardInfo object from the parsed values.
var bigBoardInfo = new ConsensusBigBoardInfo(today, bigBoards, mockDrafts, teamMockDrafts, bigBoard.Count, lastUpdated);

I'm not 100% sure, though.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.