lorey / mlscraper Goto Github PK
View Code? Open in Web Editor NEW🤖 Scrape data from HTML websites automatically by just providing examples
Home Page: https://pypi.org/project/mlscraper/
🤖 Scrape data from HTML websites automatically by just providing examples
Home Page: https://pypi.org/project/mlscraper/
Often, user do not want to match full attributes or text of nodes, but specific substrings.
Solutions:
Rather too narrow than to broad, e.g. text= of bs4 could cause trouble with older versions.
I want to save model, so i can use it for next time.
I can't found how to do it in example
People want to extract proper integers, straightforward way would be to implement item and extractors that return integers.
Does not return results for sites like bbc.com, dnevnik.bg etc when I to scrape even the titles only from the articles.
网页中有这个5190,是不是因为网页提供的数据存在空格或空行?这种情况如何解决,能不能忽略空行只提取数据?
我的代码:
einstein_url = 'http://www.i001.com/main1.shtml'
resp = requests.get(einstein_url)
assert resp.status_code == 200
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': '5190'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
反馈错误:
ValueError:
5190
</td> is not in list
ModuleNotFoundError: No module named 'mlscraper.html'; 'mlscraper' is not a package, I have installed the package using, pip install --pre 'mlscraper==1.0.0rc3' in conda env.
Options:
Does mlscraper still work? I cannot get it to run (not even the sample code). I always get a ModuleNotFoundError:
ModuleNotFoundError: No module named 'mlscraper.html
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200
# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)
# train the scraper with the created training set
scraper = train_scraper(training_set)
# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}
The package is definitely installed, though:
Or what am I missing
Specifically for text matching something fuzzy would be great to reduce errors, e.g. checking for similarity of long texts to avoid whitespace-based errors, etc.
Options
Also it needs to be considered when checking for correctness later as scraper.get(page) == expected_result
could turn out to be false.
If the code changes, examples break. Maybe there's a way to test example code easily during automated testing.
Found solutions:
Currently, the rule-based scraper tries all potential css selectors at once. It would be more performant if we increase the css selector complexity step by step though, so we first try single node rules like div.item
and something like .menu > div.item.company
if the simpler rules don't work.
Showing progress is not easy but should somehow be enabled to visualize to users how long it might take.
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
# fetch the page to train
einstein_url = 'https://github.com/lorey/mlscraper/issues/38'
resp = requests.get(einstein_url)
assert resp.status_code == 200
# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'title': 'Scraper not found error'})
training_set.add_sample(sample)
# train the scraper with the created training set
scraper = train_scraper(training_set)
# scrape another page
resp = requests.get('https://github.com/lorey/mlscraper/issues/27')
result = scraper.get(Page(resp.content))
print(result)
Maybe remove samples altogether and just let Matches deal with extraction? Or just use samples on the surface level?
Currently, mlscraper has issues scraping spiegel online's authors if defined as a set. See https://gist.github.com/lorey/fdb88d6c8e41b9b6bc8df264cffc68e1
The following code,
training_file = BeautifulSoup("<p>with a question mark?</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
'title': 'with a question mark?'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
Throws error
mlscraper.samples.NoMatchFoundException: No match found on page (self.page=<Page self.soup.name='[document]' classes=None, text=with a que...>, self.value='with a question mark?'
But the following code works just without the question mark in the html,
training_file = BeautifulSoup("<p>with a question mark</p>", features="lxml").prettify()
training_set = TrainingSet()
page = Page(training_file)
sample = Sample(page, {
'title': 'with a question mark'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
Gave this a try :-)
Feedback:
mlscraper.html
is missing from the PyPI package.mlscraper.training.NoScraperFoundException: did not find scraper
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()
page = Page(resp.content)
sample = Sample(
page,
{
"name": "Jonas Haag",
"followers": "329", # Note that this doesn't work if 329 passed as an int.
#'company': '@QuantCo', # Does not work.
"twitter": "@_jonashaag", # Does not work without the "@".
"username": "jonashaag",
"nrepos": "282",
},
)
training_set = TrainingSet()
training_set.add_sample(sample)
scraper = train_scraper(training_set)
resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)
This is the code
import logging
import requests
from mlscraper import SingleItemPageSample, RuleBasedSingleItemScraper
items = {
"https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array": {
"title": "Why is processing a sorted array faster than processing an unsorted array?"
},
"https://stackoverflow.com/questions/927358/how-do-i-undo-the-most-recent-local-commits-in-git": {
"title": "How do I undo the most recent local commits in Git?"
},
"https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do": {
"title": "What does the “yield” keyword do?"
},
}
results = {url: requests.get(url) for url in items.keys()}
# train scraper
samples = [
SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
]
scraper = RuleBasedSingleItemScraper.build(samples)
print("Scraping new question")
html = requests.get(
"https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-locally-and-remotely"
).content
result = scraper.scrape(html)
print("Result: %s" % result)
Output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-9f646dab1fca> in <module>()
24 SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
25 ]
---> 26 scraper = RuleBasedSingleItemScraper.build(samples)
27
28 print("Scraping new question")
4 frames
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in build(samples)
89 matches_per_page_right = [
90 len(m) == 1 and m[0].get_text() == s.item[attr]
---> 91 for m, s in zip(matches_per_page, samples)
92 ]
93 score = sum(matches_per_page_right) / len(samples)
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <listcomp>(.0)
88 matches_per_page = (s.page.select(selector) for s in samples)
89 matches_per_page_right = [
---> 90 len(m) == 1 and m[0].get_text() == s.item[attr]
91 for m, s in zip(matches_per_page, samples)
92 ]
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <genexpr>(.0)
86 if selector not in selector_scoring:
87 logging.info("testing %s (%d/%d)", selector, i, len(selectors))
---> 88 matches_per_page = (s.page.select(selector) for s in samples)
89 matches_per_page_right = [
90 len(m) == 1 and m[0].get_text() == s.item[attr]
/usr/local/lib/python3.7/dist-packages/mlscraper/parser.py in select(self, css_selector)
28 def select(self, css_selector):
29 try:
---> 30 return [SoupNode(res) for res in self._soup.select(css_selector)]
31 except NotImplementedError:
32 logging.warning(
/usr/local/lib/python3.7/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit)
1495 if tag_name == '':
1496 raise ValueError(
-> 1497 "A pseudo-class must be prefixed with a tag name.")
1498 pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
1499 found = []
ValueError: A pseudo-class must be prefixed with a tag name.
I am trying to scrape the Case Number from the following HTML File.
mlscraper: pip install --pre mlscraper
python: 3.9
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
HTMLFile = open("/content/PUS06037-BC2116122017-06-12 14_07_29.976088.txt", "r")
index = HTMLFile.read()
training_set = TrainingSet()
index = index.replace(u'\xa0', u' ')
page = Page(index)
sample = Sample(page, {'Filing Date:': '06/08/1999','Case Number:': 'BC211612'})
training_set.add_sample(sample)
scraper = train_scraper(training_set)
NoScraperFoundException Traceback (most recent call last)
in <cell line: 20>()
18
19 # train the scraper with the created training set
---> 20 scraper = train_scraper(training_set)
21
22 # scrape another page
/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity)
72 f"({complexity=}, {match_combination=})"
73 )
---> 74 raise NoScraperFoundException("did not find scraper")
75
76
NoScraperFoundException: did not find scraper
Scrapers should be able to deal with urls, HTML, and parsed DOMs (even requests response objects?) to enable for flexible library usage.
scrape_soup
, scrape_html
, scrape_url
SingleItemPageSample.from_soup
, etc.Even though the response.status_code is 200, can we still train the model based on the manually extracted content from a website that has anti-scraping measures? (I am a beginner)
Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div
most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.
Maybe there's a heuristic for good selectors. An idea:
What if we compute selectivity
for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.
Same interfaces should produce same results -> same test set for same types of scrapers
On training uses infinite RAM, program is Killed by the system during high ram usage, even crashed my google colab with 12gb ram.
mlscraper==1.0.0rc3
This example from the README does not work unfortunately. Perhaps, I'm doing something wrong.
Example:
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200
# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)
# train the scraper with the created training set
scraper = train_scraper(training_set)
# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
Error:
File ~/miniconda3/envs/colbert/lib/python3.8/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:133, in _make_cell_set_template_code()
116 return types.CodeType(
117 co.co_argcount,
118 co.co_nlocals,
(...)
130 (),
131 )
132 else:
--> 133 return types.CodeType(
134 co.co_argcount,
135 co.co_kwonlyargcount,
136 co.co_nlocals,
137 co.co_stacksize,
138 co.co_flags,
139 co.co_code,
140 co.co_consts,
141 co.co_names,
142 co.co_varnames,
143 co.co_filename,
144 co.co_name,
145 co.co_firstlineno,
146 co.co_lnotab,
147 co.co_cellvars, # this is the trickery
148 (),
149 )
TypeError: an integer is required (got type bytes)
Rule extraction can be separated from scrapers.
Rule extraction:
scraper:
Followed the readme and was testing the code after pip install --pre mlscraper
But got a module not found error
from mlscraper.html import Page
ModuleNotFoundError: No module named 'mlscraper.html'
checking the installed library, only the following were present:
ml.py parser.py training.py util.py
For people checking out to library it will be convenient if we add all dependencies in readme present:
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.