Coder Social home page Coder Social logo

danieldotnl / ha-multiscrape Goto Github PK

View Code? Open in Web Editor NEW
262.0 6.0 14.0 755 KB

Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.

License: MIT License

Python 99.09% Shell 0.91%
scraper rest sensor scraping scrape hacs home-assistant-custom home-assistant

ha-multiscrape's Introduction

HA Multiscrape

GitHub Release License

pre-commit Black

hacs hacs installs

Project Maintenance BuyMeCoffee

Discord Community Forum

Important note: troubleshooting

If you don't manage to scrape the value you are looking for, please enable debug logging and log_response. This will provide you with a lot of information for continued investigation. log_response will write all responses to files. If the value you want to scrape is not in the files with the output from BeautifulSoup (*-soup.txt), Multiscrape will not be able to scrape it. Most likely it is retrieved in the background by javascript. Your best chance in this case, is to investigate the network traffic in de developer tools of your browser, and try to find a json response containing the value you are looking for.

If all of this doesn't help, use the home assistant forum. I cannot give everyone personal assistance and please don't create github issues unless you are sure there is a bug. Check the wiki for a scraping guide and other details on the functionality of this component.

Important note: be a good citizen and be aware of your responsibility

You and you alone, are accountable for your scraping activities. Be a good (web) citizen. Set reasonable scan_interval timings, seek explicit permission before scraping, and adhere to local and international laws. Respect website policies, handle data ethically, mind resource usage, and regularly monitor your actions. Uphold these principles to ensure ethical and sustainable scraping practices.

HA MultiScrape custom component

This Home Assistant custom component can scrape multiple fields (using CSS selectors) from a single HTTP request (the existing scrape sensor can scrape a single field only). The scraped data becomes available in separate sensors.

It is based on both the existing Rest sensor and the Scrape sensor. Most properties of the Rest and Scrape sensor apply.

Buy Me A Coffee

Installation

hacs hacs installs

Install quickly via a HACS link
Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.

Example configuration (YAML)

multiscrape:
  - name: HA scraper
    resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".release-date"
        value_template: "{{ value | trim }}"
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"
        attribute: "title"
        value_template: "{{ (value.split('released')[1]) }}"
    binary_sensor:
      - unique_id: ha_version_check
        name: Latest version == 2021.7.0
        select: ".release-date"
        value_template: '{{ value | trim == "2021.7.0" }}'
        attributes:
          - name: Release notes link
            select: ".release-date"
            attribute: href

Options

Based on latest (pre) release.

name description required default type
name The name for the integration. False string
resource The url for retrieving the site or a template that will output an url. Not required when resource_template is provided. True string
resource_template A template that will output an url after being rendered. Only required when resource is not provided. True template
authentication Configure HTTP authentication. basic or digest. Use this with username and password fields. False string
username The username for accessing the url. False string
password The password for accessing the url. False string
headers The headers for the requests. False template - list
params The query params for the requests. False template - list
method The method for the request. Either POST or GET. False GET string
payload Optional payload to send with a POST request. False string
verify_ssl Verify the SSL certificate of the endpoint. False True boolean
log_response Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) False False boolean
timeout Defines max time to wait data from the endpoint. False 10 int
scan_interval Determines how often the url will be requested. False 60 int
parser Determines the parser to be used with beautifulsoup. Either lxml or html.parser. False lxml string
list_separator Separator to be used in combination with select_list features. False , string
form_submit See Form-submit False
sensor See Sensor False list
binary_sensor See Binary sensor False list
button See Refresh button False list

Sensor/Binary Sensor

Configure the sensors that will scrape the data.

name description required default type
unique_id Will be used as entity_id and enables editing the entity in the UI False string
name Friendly name for the sensor False string
See Selector fields True
attributes See Sensor attributes False list
unit_of_measurement Defines the units of measurement of the sensor False string
device_class Sets the device_class for sensors or binary sensors False string
state_class Defines the state class of the sensor, if any. (measurement, total or total_increasing) (not for binary_sensor) False None string
icon Defines the icon or a template for the icon of the sensor. The value of the selector (or value_template when given) is provided as input for the template. For binary sensors, the value is parsed in a boolean. False string/template
picture Contains a path to a local image and will set it as entity picture False string
force_update Sends update events even if the value hasn’t changed. Useful if you want to have meaningful value graphs in history. False False boolean

Refresh button

Configure a refresh button to manually trigger scraping.

name description required default type
unique_id Will be used as entity_id and enables editing the entity in the UI False string
name Friendly name for the button False string

Sensor attributes

Configure the attributes on the sensor that can be set with additional scraping values.

name description required default type
name Name of the attribute (will be slugified) True string
See Selector fields True

Form-submit

Configure the form-submit functionality which enables you to submit a (login) form before scraping a site. More details on how this works can be found on the wiki.

name description required default type
resource The url for the site with the form False string
select CSS selector used for selecting the form in the html. When omitted, the input fields are directly posted. False string
input A dictionary with name/values which will be merged with the input fields on the form False string - dictionary
input_filter A list of input fields that should not be submitted with the form False string - list
submit_once Submit the form only once on startup instead of each scan interval False False boolean
resubmit_on_error Resubmit the form after a scraping error is encountered False True boolean
variables See Form Variables False list

Form Variables

Configure the variables that will be scraped from the form_submit response. These variables can be used in the value_template of the main configuration of the current integration: a selector in sensors/attributes or in a header. A common use case is to populate the X-Login-Token header which is the result of the login.

name description required default type
name Name of the variable True string
See Selector fields True

Example:

multiscrape:
  - resource: "https://somesiteyouwanttoscrape.com"
    form_submit:
      submit_once: True
      resource: "https://authforsomesiteyouwanttoscrape.com"
      input:
        email: "<email>"
        password: "<password>"
      variables:
        - name: token
          value_template: "{{ ... }}"
    headers:
      X-Login-Token: "{{ token }}"
    sensor: ...

Selector

Used to configure scraping options.

name description required default type
select CSS selector used for retrieving the value of the attribute. Only required when select_list or value_template is not provided. False string/template
select_list CSS selector for multiple values of multiple elements which will be returned as csv. Only required when select or value_template is not provided. False string/template
attribute Attribute from the selected element to read as value. False string
value_template Defines a template applied to extract the value from the result of the selector (if provided) or raw page (if selector not provided) False string/template
on_error See On-error False

On-error

Configure what should happen in case of a scraping error (the css selector does not return a value).

name description required default type
log Determines if and how something should be logged in case of a scraping error. Value can be either 'false', 'info', 'warning' or 'error'. False error string
value Determines what value the sensor/attribute should get in case of a scraping error. The value can be 'last' meaning that the value does not change, 'none' which results in HA showing 'Unkown' on the sensor, or 'default' which will show the specified default value. False none string
default The default value to be used when the on-error value is set to 'default'. False string

Services

For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.) The services are named multiscrape.trigger_{name of integration}.

Multiscrape also offers a get_content and a scrape service. get_content retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable log_response and open the page_soup.txt file.
scrape does what it says. It scrapes a website and provides the sensors and attributes.

Both services accept the same configuration as what you would provide in your configuration yaml (what is described above), with a small but important caveat: if the service input contains templates, those are automatically parsed by home assistant when the service is being called. That is fine for templates like resource and select, but templates that need to be applied on the scraped data itself (like value_template), cannot be parsed when the service is called. Therefore you need to slightly alter the syntax and add a ! in the middle. E.g. {{ becomes {!{ and %} becomes %!}. Multiscrape will then understand that this string needs to handled as a template after the service has been called.
If someone has a better solution, please let me know!

To call one of those services, go to 'Developer tools' in Home Assistant and then to 'services'. Find the multiscrape.get_content or multiscrape.scrape services and go to yaml mode. There you enter your configuration. Example:

service: multiscrape.scrape
data:
  name: HA scraper
  resource: https://www.home-assistant.io
  sensor:
    - unique_id: ha_latest_version
      name: Latest version
      select: ".release-date"
      value_template: "{!{ value | trim }!}"
    - unique_id: ha_release_date
      name: Release date
      select: ".release-date"
      attribute: "title"
      value_template: "{!{ (value.split('released')[1]) }!}"

Debug logging

Debug logging can be enabled as follows:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

Depending on your issue, also consider enabling log_response.

Contributions are welcome!

If you want to contribute to this please read the Contribution guidelines

Credits

This project was generated from @oncleben31's Home Assistant Custom Component Cookiecutter template.

Code template was mainly taken from @Ludeeus's integration_blueprint template


ha-multiscrape's People

Contributors

danieldotnl avatar dependabot[bot] avatar hmmbob avatar jeremicmilan avatar ludeeus avatar noxhirsch avatar renovate[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ha-multiscrape's Issues

Control Time of Multiscrape Reload

Version of the custom_component

5.5.50

Configuration

I have a bunch of sensors defined in this way (they work ok)

- name: CAQI Riegrove Sady
  resource: "https://www.chmi.cz/files/portal/docs/uoco/web_generator/aqindex_slide3h1/mp_ARIEA_CZ.html"
  scan_interval: 900
  sensor:
    - unique_id: caqi_time
      name: CAQI Riegrove Sady Update Time
      select: ".list-row-even:nth-child(27) td:nth-child(1)"
      value_template: "{{ relative_time(strptime( value[0:10]+value[18:-5], '%d.%m.%Y %H:%M' )) }}"
      state_class: measurement
    - unique_id: caqi_so3
      force_update: true
      name: CAQI Riegrove Sady SO3
      unit_of_measurement: " "
      state_class: measurement
      # levels: 0, 50, 100, 350, 500
      select: "#content > table >tr:nth-child(27) > td:nth-child(3)"
      value_template: >
        {% macro calc_caqi(val, range_top, range_bot, caqi_top, caqi_bot) -%}
        {{ '%0.1f'|format(caqi_bot + (caqi_top-caqi_bot)/(range_top-range_bot)*(val-range_bot)) }}
        {%- endmacro %}
        {% set levels = [500,350,100,50,0] %}
        {% set caqi_levels = [100,75,50,25,0] %}
        {% set sensor = float(value) %}
        {% if sensor >levels[0] -%}
        {{'%0.1f'|format(caqi_levels[0] + (caqi_levels[0]-caqi_levels[1])/(levels[0]-levels[1])*(val-levels[1])) }}
        {%- elif sensor >= levels[1] -%}
        {{calc_caqi(sensor,levels[3],levels[4],caqi_levels[3],caqi_levels[4]) }}
        {%- elif sensor >= levels[2] -%}
        {{calc_caqi(sensor,levels[1],levels[2],caqi_levels[1],caqi_levels[2]) }}
        {%- elif sensor >= levels[3] -%}
        {{calc_caqi(sensor,levels[2],levels[3],caqi_levels[2],caqi_levels[3]) }}
        {%- elif sensor >= levels[4] -%}
        {{calc_caqi(sensor,levels[3],levels[4],caqi_levels[3],caqi_levels[4]) }}
        {%- else -%}
        N/A
        {%- endif %}

and then I create a min_max sensor out of them

sensor:
  # Aggregator to work out overall CAQI
  - platform: min_max
    entity_ids:
      - sensor.caqi_so3
      - sensor.caqi_no2
      - sensor.caqi_pm10
      - sensor.caqi_o3
      - sensor.caqi_pm25
    round_digits: 0
    type: max
    name: CAQI Riegrové Sady Spikey

Describe the bug

Whenever multiscrape reloads, I get a spike in the min_max sensor: this is presumably caused by the fact that the min_max sensor is reporting the e.g. just the first one, before all the sensors are ready. I guess this is unavoidable and I can live with it, if it just were to happen on HASS restart -- I send out notifications when the pollution is high, but the reboots are rare enough not to worry about it.

The problem is that somehow I also get this spike from the reload every day around 7:30. I presume this is triggered by multiscrape resetting on a 24hour cadence. I would like to move this reset to some earlier time during the night, when my notifications are switch off, so I do not get a random message every morning.

Debug log

Exception within exception when not using form-submit

Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 134, in _handle_refresh_interval
await self._async_refresh(log_failures=True, scheduled=True)
File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 265, in _async_refresh
update_callback()
File "/usr/src/homeassistant/homeassistant/components/rest/entity.py", line 73, in _handle_coordinator_update
self._update_from_rest_data()
File "/config/custom_components/multiscrape/sensor.py", line 154, in _update_from_rest_data
value = self._scrape(
File "/config/custom_components/multiscrape/entity.py", line 75, in _scrape
self.rest.notify_scrape_exception()
File "/config/custom_components/multiscrape/data.py", line 66, in notify_scrape_exception
if self._form_resubmit_error:
AttributeError: 'ScrapedRestData' object has no attribute '_form_resubmit_error'

Need to update to handle breaking changes in Home Assistant 2021.9

Problem

The SensorEntity base component will change in HomeAssistant core 2021.9 such that classes extending it should no longer directly set the state or unit_of_measurement, background in home-assistant/core#48261.

Classes extending SensorEntity should now instead set _attr_native_value or override the native_value method, as well as set the _attr_native_unit_of_measurement or override the native_unit_of_measurement method.

As part of those changes, the _attr_state shorthand is planned to be removed from SensorEntity, but it will still work to override the state method (although not recommended), see home-assistant/core#54624.

The changes in home-assistant/core#54624 will most likely land in HomeAssistant Core 2021.9.

Auto-create sensors based on select_list

First of all THANK YOU SOO MUCH for this nice project!

Is your feature request related to a problem? Please describe.
Manually setting a sensor for each list-item can be exhausting.
Also its problematic if the list is changing.

Describe the solution you'd like
Auto- create and remove sensors based on select_list
maybe add a list_item key next to select_list to define name, select and attribute for all items.
Note: name should accept a select-string here.

Additional context
Maybe it would be convinient to have all select-strings inside list_item scoped to the element:

flowchart TB
A[select_list: li.item] -->|Getting 3 list items| B(list_item)
B -->| item1 | D[Sensor 1]
B -->| item2 | E[Sensor 2]
B -->| item3 | F[Sensor 3]
Loading
  - resource: XXX
    scan_interval: 3600
    sensor:
      - unique_id: XXXX
        name: Items
        select_list: "li.item"
        value_template: |
          {%-set value = value.split(",")-%}
          {{value|count}}
        list_items:
          name: "h2"                            # Should be working like `li.item[i] h2`
          select: ".details > .magicValue"      # Should be working like `li.item[i] .details > .magicValue`
          value_template: "{{ value|int }}"
          attributes:
            - name: Other Value
              select: ".details > .otherValue"  # Should be working like `li.item[i] .details > .otherValue`

Please help with Error: RemoteProtocolError("illegal chunk header: bytearray(b'167 \r\n')")

My Home Assistant version: 2022.7.5

Layout-card version (FROM BROWSER CONSOLE): 2.4.2

Newest Version of Multiscape installed

What I am doing:

I try to read my ESP32 HTML sensor.
HTML Code is as following:

<html lang="en">
	<head>
		<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
		<meta HTTP-EQUIV="refresh" CONTENT="30"/>
	</head>

	<body>
		AQUARIUM Temperatur: 25.00 &deg;C <br>Battery: 4.10V, 100%, -<br>BootTime: CEST, 18.Jul 2022, Mon, 18:44:54, NowTime: CEST, 20.Jul 2022, Wed, 16:55:14
 	</body>
 </html>

my multicraper is like this:

    scan_interval: 60
    verify_ssl: false
    log_response: true
    parser: html.parser
    sensor:
    - unique_id: body_temperature
      name: HTML MultiScrape Temperature
      select: "body"
      value_template: '{{ value.split(": ")[1].split(" °C")[0] }}'
     unit_of_measurement: "°C"

What I expected to happen:

I want to get the temperature value 25.00

What happened instead:

I get following error code but not sure what this means. Ist it a bug or any other problem? Hopefully someone can help me here:

2022-07-20 17:04:46 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Error executing get request to url: http://192.168.1.114/.
Error message:
RemoteProtocolError("illegal chunk header: bytearray(b'167 \r\n')")
2022-07-20 17:04:46 ERROR (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Updating failed with exception: illegal chunk header: bytearray(b'167 \r\n')

Multiscrape stopped working when previous repo version replaced with HACS version

Version of the custom_component

3.0.1

Configuration

This is just an excerpt

sensor:
  - platform: multiscrape
    resource: https://www.wunderground.com/dashboard/pws/IPRAGU342?cm_ven=localwx_pwsdash
    scan_interval: 300
    selectors:
      wu_dejvice_status:
        name: WU Dejvice Status
        select: "#inner-content > section:nth-child(2) > div:nth-child(1) > div > div > div > div > div.dashboard__title.ng-star-inserted > div > span:nth-child(2)"
        value_template: "{{ value|lower }}"
      wu_dejvice_temperature:
        name: WU Dejvice Temperature
        select: "#inner-content > section:nth-child(2) > div:nth-child(1) > div > div > div > div > div:nth-child(2) > div > lib-tile-current-conditions > div > div.module__body > div > div.small-4.columns.text-left.conditions-temp > div.main-temp > lib-display-unit > span > span.wu-value.wu-value-to"
        unit_of_measurement: °C
        value_template: "{{'%0.1f'|format((float(value)-32)/1.8)}}"

Describe the bug

This was working fine under the multiscrape from the previous repository. I've uninstalled that one and installed the HACS version and now all my multiscrape sensors fail.

Debug log

Logger: homeassistant.config
Source: config.py:849
First occurred: 9:06:37 (5 occurrences)
Last logged: 9:06:47

Platform error: sensor - Integration 'multiscrape' not found.

Unable to get value

I installed the latest version via HACS. I log into Amerigas at the url given below and input my username and pw. On the next page that loads that returns is the Estimatedtank percentage. I used the Chrome, inspect, copy selector to get the select code below. I just want that value that is located in "Estimatedtank" in gallons. Not sure what I am doing wrong but I get an unknown. Error says the following...

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:139
Integration: Multiscrape scraping component (documentation, issues)
First occurred: 10:20:05 PM (1 occurrences)
Last logged: 10:20:05 PM

Sensor Tankpercentage was unable to extract data from HTML

Here is my configuration code...

multiscrape:

  • resource: https://www.myamerigas.com/Login/Login?BrandId=002
    scan_interval: 3600
    headers:
    User-Agent: Mozilla/5.0
    form_submit:
    submit_once: True
    select: "form-control-valid"
    input:
    email: myloginemail
    password: mypassword
    sensor:
    - select: "#layoutDiv > main > div.container.pl-0.pr-0.pl-xl-3.pr-xl-3.pl-lg-3.pr-lg-3.pl-md-3.pr-md-3.pl-sm-0.pr-sm-0 > div:nth-child(2) > div.col-12.col-xl-6.col-lg-6.col-md-12.col-sm-12.pl-0.pr-0.pr-xl-3.pr-lg-3.pr-md-0.pr-sm-0 > div.col-12.bg-white.tankanddeliveries-padding.top-margin > div:nth-child(3) > div.col-12.col-xl-4.col-lg-4.col-md-12.col-sm-12.p-0.mt-3.EstimatedTankDiv > div > div.col-12.p-0.lblvalue-Estimatedtank"
    name: Tankpercentage

Thanks in advance

Error adding entities for domain sensor with platform multiscrape

Hi Daniel,

First of all many thanks for your great module multiscrape. After updating Home assistant and multiscrape, I got. an error-message. Home assistant does not view my entities I declared with multiscrape. After checking my logs, I located this behavour in the line where I defined the unit of my measurement. I think that this depence by changes of the ‚core‘ Home Assistant 2022.4.1

thanks

Erich

Version of the custom_component

v5.7.0

Configuration

If unit_of_measurement is used

Describe the bug

When I use unit_of_measurement im my configuration.yaml i get the error. When comment out, everything works fine

Debug log

Logger: homeassistant.components.sensor
Source: components/sensor/init.py:454
Integration: Sensor (documentation, issues)
First occurred: 10:33:24 (2 occurrences)
Last logged: 10:33:24

Error adding entities for domain sensor with platform multiscrape
Error while setting up multiscrape platform for sensor
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 382, in async_add_entities
await asyncio.gather(*tasks)
File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 614, in _async_add_entity
await entity.add_to_platform_finish()
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 799, in add_to_platform_finish
self.async_write_ha_state()
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 532, in async_write_ha_state
self._async_write_ha_state()
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 570, in _async_write_ha_state
state = self._stringify_state(available)
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 538, in _stringify_state
if (state := self.state) is None:
File "/usr/src/homeassistant/homeassistant/components/sensor/init.py", line 454, in state
assert native_unit_of_measurement
AssertionError

Scrape into sensor attribute a data table row/column

Is your feature request related to a problem? Please describe.
I would like to scrape e.g. the forecast for temperature for the next 24 hours as an attribute of my temperature sensor (also being scraped). In the local weather service I use, this is presented as a 2-row table with 24 columns with pairs of values for time and temperature.

See the detailed link to my use-case in https://community.home-assistant.io/t/scrape-sensor-improved-scraping-multiple-values/218350/113?u=wigster.

A CSS selector which picks the whole data row, when processed by HA-MultiScrape returns only the first string.

This sort of thing goes a little against HA philosophy, which is constructed to record a time series of data for the sensor in real-time. But I think in any situation with data about the future (e.g. weather forecast, tides, astronomical events, sporting events, TV schedules), it seems to me an ability to load a table of data would be useful.

Describe the solution you'd like
Ideally, it would be possible to return the data row as a sensor attribute in the form of a list. One would obtain this using some alternative version of value_template. It would be very helpful if the same transformation in this value_template (or one using the data item index as a variable) could be applied to each of the data items, e.g. converting the units of all of them, or manipulating the string in the same manner.

In principle, performing statistical/mathematical operations on the list and returning just a single item could also be useful: e.g. picking the maximum expected tide from the list of the next 10, or returning tomorrow's low temperature.

Describe alternatives you've considered

Currently, it is possible to create separate attributes, one for each data point, but this is very inefficient and makes difficult further manipulation of these data.

Additional context

My aim is to use these data to plot the forecast in a lovelace card, using the ApexChart data generator feature
https://github.com/RomRider/apexcharts-card#data_generator-option, which works easily with a particular list format for the attributes.

Cannot resolve "button" component

Currently I have:
Assistant Core: core-2021.11.1
Operating System: 6.6

The component fails to load, claiming that it cannot find homeassistant.components.button
After I removed all the lines related to this, everything loads OK.
My guess is that the refresh by the button functionality was added, but somehow not all (older?) versions can work with that.

attribute names are not well-formed

Version of the custom_component

pre-4.1.1

This is a follow-on issue to #27

Configuration

# Dell Printer
multiscrape:
  - resource: !secret printer_status
    scan_interval: 900
    sensor:
      - name: Printer Toner Level Cyan
        select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(2) > td:nth-child(1) > b"
        value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False) }}'
        unit_of_measurement: "%"
      - name: Printer Toner Level Magenta
        select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(4) > td:nth-child(1) > b"
        value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False) }}'
        unit_of_measurement: "%"
      - name: Printer Toner Level Yellow
        select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(6) > td:nth-child(1) > b"
        value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False) }}'
        unit_of_measurement: "%"
      - name: Printer Toner Level Black
        select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(8) > td:nth-child(1) > b"
        value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False) }}'
        unit_of_measurement: "%"
      - name: Printer Toner Level
        select: "body > table > tr > td > table:nth-child(10) > tr > td > table > tr:nth-child(2) > td:nth-child(2) > b"
        unit_of_measurement: "%"
        attributes:
          - name: Cyan Toner
            select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(2) > td:nth-child(1) > b"
            value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False)|int }}'
          - name: Magenta
            select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(4) > td:nth-child(1) > b"
            value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False)|int }}'
          - name: Yellow
            select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(6) > td:nth-child(1) > b"
            value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False)|int }}'
          - name: Black
            select: "body > table > tr > td > table:nth-child(6) > tr > td > table > tr:nth-child(8) > td:nth-child(1) > b"
            value_template: '{{ value|regex_findall_index(find="(\d+)\%", index=0, ignorecase=False)|int }}'

Describe the bug

Attribute names are not well-formed, i.e.

  • they are not converted to all lower case
  • spaces are not converted to underscore

In the example above, the attribute named Cyan Toner will not show up as an attribute cyan_toner in the sensor, which is what I would expect:

image

Disable scraping during startup

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I have noticed the integration takes a long time to complete during startup (~600s) of my HA instance. I guess it will take continue to increase as more and more sensors are added.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Is it possible to disable the integration/stop the scrape process during startup and perform the scraping once HA has started?

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Multiscrape not working after most recent update

I just upgraded to the latest HACs version of multiscrape and my sensors now no longer work.

Version of the custom_component

v6.2.1

Configuration

multiscrape:
  - resource: https://covidlive.com.au/report/daily-cases/nsw
    scan_interval: 3600
    headers: 
      User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
    sensor:
      - unique_id: nsw_covid_total 
        name: NSW COVID Cases - Total
        select: ".DAILY-CASES tr:nth-child(2) .COL3.CASES"
        icon: mdi:virus
        #value_template: "{{ value and value or '' }}"

      - unique_id: nsw_covid_today
        name: NSW COVID Cases - Today
        select: ".DAILY-CASES tr:nth-child(2) .COL5.NET"
        icon: mdi:virus
        #value_template: "{{ value and value or '' }}"

      - unique_id: nsw_covid_last_update
        name: NSW COVID Cases - Last Update
        select: ".DAILY-CASES tr:nth-child(2) .COL1.DATE"
        icon: mdi:virus
        #value_template: "{{ value and value or '' }}"

Describe the bug

I've been running multiscrape for some time and the above sensors have been working fine. After upgrading to v6.2.1, multiscrape is no longer functioning and errors as per the error below.

Debug log


This error originated from a custom integration.

Logger: homeassistant.loader
Source: custom_components/multiscrape/sensor.py:40 
Integration: Multiscrape scraping component (documentation, issues) 
First occurred: 8:15:46 PM (3 occurrences) 
Last logged: 8:15:46 PM

Unexpected exception importing platform custom_components.multiscrape.sensor
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/loader.py", line 603, in get_platform
    cache[full_name] = self._import_platform(platform_name)
  File "/usr/src/homeassistant/homeassistant/loader.py", line 620, in _import_platform
    return importlib.import_module(f"{self.pkg_path}.{platform_name}")
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/config/custom_components/multiscrape/sensor.py", line 40, in <module>
    discovery_info: DiscoveryInfoType | None = None,
TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'NoneType'

How to read attribute from tag?

In the previous version, I could read the value of the tag attributes. As shown below.

- platform: multiscrape
  name: ralay device
  username: !secret pwr_sw_username
  password: !secret pwr_sw_password
  resource: !secret pwr_sw_address
  scan_interval: 5
  selectors:
    rele_1:
      name: relay_1
      select: "button[name=relayon1] ~ img"
      attribute: "src"
      value_template: '{{ value.split("light")[1].split(".jpg")[0] }}'
    rele_2:
      name: relay_2
      select: "button[name=relayon2] ~ img"
      attribute: "src"
      value_template: '{{ value.split("light")[1].split(".jpg")[0] }}'

This option does not work now. And when I try to specify such a parameter in the new configuration option, I get a message that it does not exist.

multiscrape:
  - resource: !secret pwr_sw_address
    username: !secret pwr_sw_username
    password: !secret pwr_sw_password
    scan_interval: 5
    sensor:
      - name: relay_1
        select: "button[name=relayon1] ~ img"
        attribute: "src"
        value_template: '{{ value.split("light")[1].split(".jpg")[0] }}'
      - name: relay_2
        select: "button[name=relayon2] ~ img"
        attribute: "src"
        value_template: '{{ value.split("light")[1].split(".jpg")[0] }}'


Logger: homeassistant.components.websocket_api.http.connection
Source: components/homeassistant/__init__.py:161
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 11:53:36 (1 occurrences)
Last logged: 11:53:36

[281471024226560] The system cannot restart because the configuration is not valid: Invalid config for [multiscrape]: [attribute] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->sensor->0->attribute. (See /home/homeassistant/.homeassistant/configuration.yaml, line 160).
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.9/site-packages/homeassistant/components/websocket_api/commands.py", line 185, in handle_call_service
    await hass.services.async_call(
  File "/srv/homeassistant/lib/python3.9/site-packages/homeassistant/core.py", line 1491, in async_call
    task.result()
  File "/srv/homeassistant/lib/python3.9/site-packages/homeassistant/core.py", line 1526, in _execute_service
    await handler.job.target(service_call)
  File "/srv/homeassistant/lib/python3.9/site-packages/homeassistant/helpers/service.py", line 728, in admin_handler
    await result
  File "/srv/homeassistant/lib/python3.9/site-packages/homeassistant/components/homeassistant/__init__.py", line 161, in async_handle_core_service
    raise HomeAssistantError(
homeassistant.exceptions.HomeAssistantError: The system cannot restart because the configuration is not valid: Invalid config for [multiscrape]: [attribute] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->sensor->0->attribute. (See /home/homeassistant/.homeassistant/configuration.yaml, line 160). 

What do I need to do to get the value of the tag attributes?

when multiscrape cant find its select

If the sensor is trying to scrape a div that isn't there, it breaks all the other sensors as well.

example:

  - platform: multiscrape
    name: home assistant scraper
    resource: https://www.tides.gc.ca/eng/station?sid=4422
    scan_interval: 30
    selectors:
    
      firsthightidetime:
        name: First High Tide
        select: "div.stationTables > div.grid-12.indent-medium > div:nth-child(1) > table > tbody > tr:nth-child(1) > td.time"

      secondhightidetime:
        name: Second High Tide
        select: "div.stationTables > div.grid-12.indent-medium > div:nth-child(1) > table > tbody > tr:nth-child(3) > td.time"

some days don't have the second high or second low, so the last cell of the table doesn't exist. The result is:

This error originated from a custom integration.

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:149
Integration: multiscrape (documentation, issues)
First occurred: May 24, 2021, 11:08:26 PM (1827 occurrences)
Last logged: 2:30:18 PM

Sensor Second Low Tide was unable to extract data from HTML

and also

image

How can I check for this in advance? Is there a way to let one sensor fail and keep the others?

Problems after upgrade from old version, No module named 'bs4'

Version of the custom_component

5.1.1

Configuration

multiscrape:
  - name: home assistant scraper
    resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: version
        name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'

Describe the bug

After upgrade from the old version, I'm having a problem with dependencies. I get an error that bs4 is missing. I am running python 3.9 and Debian Bullseye.
If I install beautifulsoap4 then I get a new error saying "AttributeError: 'ScrapedRestData' object has no attribute 'soup'"
I am trying with your example code just to make sure it's not my configuration.
Do you have any idea what I should do?

Debug log

2021-07-25 18:49:30 ERROR (MainThread) [homeassistant.config] Package temperaturarena setup failed. Integration multiscrape No module named 'bs4' (See /root/.homeassistant/packages/temperaturarena.yaml:0).
File "/usr/local/lib/python3.9/dist-packages/homeassistant/helpers/entity_platform.py", line 383, in async_add_entities
await asyncio.gather(*tasks)
File "/usr/local/lib/python3.9/dist-packages/homeassistant/helpers/entity_platform.py", line 588, in _async_add_entity
await entity.add_to_platform_finish()
File "/usr/local/lib/python3.9/dist-packages/homeassistant/helpers/entity.py", line 665, in add_to_platform_finish
await self.async_added_to_hass()
File "/usr/local/lib/python3.9/dist-packages/homeassistant/components/rest/entity.py", line 64, in async_added_to_hass
self._update_from_rest_data()
File "/root/.homeassistant/custom_components/multiscrape/sensor.py", line 155, in _update_from_rest_data
self.rest.soup,
AttributeError: 'ScrapedRestData' object has no attribute 'soup'

log_response is an invalid option for multiscrape

Version of the custom_component: 5.7.0 (Installed via HACS)

Version of Home Assistant Core: core-2022.4.1
Installation Type Home Assistant OS
Development false
Supervisor true
Docker true
Virtual Environment false
Python Version 3.9.9
Operating System Family Linux
Operating System Version 5.10.103-v8
CPU Architecture aarch64

Configuration

multiscrape:
  - resource_template: >
      {{ "https://flightaware.com/live/flight/" + states('sensor.opensky_flight_in_vicinity') }}
    name: OpenSky FlightAware Tracker
    log_response: true
    scan_interval: 360000
    sensor:
      - unique_id: opensky_flight_friendly_name
        name: Flight Friendly Name
        select: ".flightPageSummaryContainer > .flightPageFlightIdentifier > .flightPageSummary > .flightPageFriendlyIdent > .flightPageFriendlyIdentLbl > h1"
        icon: mdi:airplane
      - unique_id: opensky_origin_airport_code
        name: Origin Airport Code
        select: "#flightPageTourStep1 > div.flightPageSummaryAirports > div.flightPageSummaryOrigin > span.flightPageSummaryAirportCode > span"
        icon: mdi:airport
      - unique_id: opensky_origin_city
        name: Origin City
        icon: mdi:city
        select: "#flightPageTourStep1 > div.flightPageSummaryAirports > div.flightPageSummaryOrigin > span.flightPageSummaryCity"
      - unique_id: opensky_destination_airport_code
        name: Destination Airport Code
        select: ".flightPageSummaryAirports > .flightPageSummaryDestination > .flightPageSummaryAirportCode > span"
        icon: mdi:airport
      - unique_id: opensky_destination_city
        name: Destination City
        select: ".flightPageSummaryAirports > .flightPageSummaryDestination > .flightPageSummaryCity > span"
        icon: mdi:city

Describe the bug

When I check my configuration before restart, Home Assistant warned me that adding the line "log_response: true" in my configuration.yaml is invalid.

Debug log

The system cannot restart because the configuration is not valid: Invalid config for [multiscrape]: [log_response] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->log_response. (See /config/configuration.yaml, line 569).

Unique ID is an invalid option

Version of the custom_component

v4.0.1

Configuration

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: 197410fe-726e-45b2-b758-6050bd55c6a9
        name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - unique_id: 3d2e81ba-3e45-42e2-bfad-d5543609ef0a
        name: Release date
        select: ".release-date"
    binary_sensor:
      - name: Latest version == 2021.6.0
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) | trim == "2021.6.0" }}'

Describe the bug

Simply adding the provided example or any other configuration results to the error. Unique ID is not a valid option

Invalid config for [multiscrape]: [unique_id] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->1->sensor->1->unique_id. (See /config/configuration.yaml, line 51).

Debug log


N/A

Resource offline

Is your feature request related to a problem? Please describe.
I have an IOT that is not online but every once in a while.

Describe the solution you'd like
I would like the capability to quieting logs when the resource is offline and use the on_error to still have the entities default to a value.

Describe alternatives you've considered
On_error doesn't take effect when the resource in unreachable

Additional context
None

Add the ability to add entity_picture:

First of all this multiscrape is so awsome!!!!

Well when i'm scraping and sometimes its a geolocation of a car or police or wathever, i want it to show the picture on the map, without doing it manually.

Now it shows the first letter of the device i just scraped on my map, and no way to add a picture.

like adding this to my scrape as an attribute, but won't let me

entity_picture: "/local/test.jpg"

this is how i scrape now

sensor:
        name: sysmic_activity_italy
        select: "description:nth-child(1) >text"
        attributes:
          - name: Time
            select: "creationInfo > creationTime"
          - name: Magnitude
            select: "mag> value"
          - name: Latitude
            select: "lat> value"
          - name: Longitude
            select: "lat> value"

it shows the correct location on the map but with a big round S, because the name starts with sysmic.
With entity_picture: it will show up a picture instead, just like an person.

regards and keep this project up !!

add icon or icon template to the sensors.

Do you ever plan to add icons to sensors, better icon templates?
Because now, after creating sensors, you have to create sensor templates and only add icons there.

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - name: Release date
        select: ".release-date"
    binary_sensor:
      - name: Latest version == 2021.7.0
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) | trim == "2021.7.0" }}'

sensor:
  - platform: template
    sensors:
      latest_version_good:
         friendky_name: Latest version
         value_template: '{{ states('sensor.latest_version') }}'
         icon_template: >-
             {% if is_state('sensor.latest_version', states('sensor.version')) %}
                 mid:movie-star-outline
             {% else %}
                 mid:history
             {% end if %}

binary_sensor:
  - platform: template
    sensors:
       latest_version_good:
          friendly_name: Latest version
          value_template: '{{ is_state('binary_sensor.latest_version', 'on') }}'
          icon_template: >-
             {% if is_state('binary_sensor.latest_version', 'on') %}
                mid:check-circle-outline
             {% else %}
                mid:checkbox-blank-circle-outline
             {% end if %}

After that, duplicate sensors appear with and without icons.

Latest releases disappeared from repository

Version of the custom_component

v6.0.1

Describe the bug

It seems that the newer versions of multiscrape have disappeared from this repository -- I "updated" from 6.2.x to pre-6.0.1 when it was offered last night without looking and now my sensors no longer have units. I have just noticed that anything newer than 6.0.1 is gone from the releases, which is what caused this to be offered in the first place, as far as I can surmise.

Namespace in sensort definition doesn't seem to work

Version of the custom_component

6.2.0

Configuration

- name: Wolfram Sunburn Time
  resource: https://api.wolframalpha.com/v2/query?input=time+to+sunburn+skin+type+i+prague+cz&format=plaintext&output=JSON&appid=[id]
  scan_interval: 1500
  method: GET
  log_response: true
  sensor:
    - unique_id: wolfram_prague_sunburn_time_skin_i
      name: Sunburn Time Prague Skin Type I
      value_template: >
        {%- set bla = value_json.queryresult.pods[1].subpods[0].plaintext.split('|')[5] %}
        {%- set timestring =  bla.split(' ') %}
        {%- set ns = namespace(minutes = 0) %}
        {%- for i in range((timestring | length / 2) | int) %}
        {%- if timestring[i * 2 + 1] == 'h' %}
          {%- set ns.minutes = ns.minutes + (timestring[i * 2] | float) * 60 %}
        {%- else %}
          {%- set ns.minutes = ns.minutes + (timestring[i * 2] | float) %}
        {%- endif %}
        {%- endfor %}
        {{ns.minutes}}
      state_class: measurement
      attributes:
        - name: SPF15
          value_template: "{{value_json.queryresult.pods[1].subpods[0].plaintext.split('|')[6]}}"
        - name: SPF30
          value_template: "{{value_json.queryresult.pods[1].subpods[0].plaintext.split('|')[7]}}"
        - name: SPF45
          value_template: "{{value_json.queryresult.pods[1].subpods[0].plaintext.split('|')[8]}}"

Describe the bug

Multiscrape fetches the data without issues, but the template does not get processed in the way that I would like.

For some reason the value of the ns.minutes variable that is returned by this sensor is always what I set it to in the line where I define ns above the for/if loop (i.e. in this case 0). The output is correct in the Templates Developer Tools.

Does multiscrape not support a writing into a namespace in a loop in its value_template?

(Just for concreteness, the template is complex, because the server returns the time in one of the formats (3.5 h, 1 h 50 min or 50 min) and I want to convert it to minutes for the output of the sensor)

Debug log


Add your logs here.

Multiscrape broken in newest update?

Version of the custom_component

Installed component version: 4.0.1
Core Version: 2021.6.4
Supervisor: 2021.06.0
Host: Home Assistant OS 6.0

Configuration

- resource: http://192.168.178.45/livedata.htm
  scan_interval: 15
  sensor:
    - name: weatherstation_windspeed
      unit_of_measurement: km/h
      select: "input[name='avgwind'][type='text']"
      attribute: value
    - name: weatherstation_winddirection
      unit_of_measurement: °
      select: "input[name='windir'][type='text']"
      attribute: value
    - name: weatherstation_temperature
      unit_of_measurement: °C
      select: "input[name='outTemp'][type='text']"
      attribute: value
    - name: weatherstation_battery
      select: "input[name='outBattSta1'][type='text']"
      attribute: value
    - name: weatherstation_humidity
      unit_of_measurement: '%'
      select: "input[name='outHumi'][type='text']"
      attribute: value
    - name: weatherstation_uvi
      select: "input[name='uvi'][type='text']"
      attribute: value
    - name: weatherstation_solarradiation
      unit_of_measurement: w/m2
      select: "input[name='solarrad'][type='text']"
      attribute: value
    - name: weatherstation_eventrain
      unit_of_measurement: mm
      select: "input[name='eventr

Describe the bug

Multiscrape is just not starting up anymore for me since the latest update. Before the update everything worked
fine with this configuration.

Debug log

Screenshot_1


Logger: custom_components.multiscrape
Source: custom_components/multiscrape/data.py:45
Integration: Multiscrape scraping component (documentation, issues)
First occurred: 13:52:46 (8 occurrences)
Last logged: 13:59:53

Unexpected error fetching rest data data: object of type 'NoneType' has no len()
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 187, in _async_refresh
    self.data = await self._async_update_data()
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 147, in _async_update_data
    return await self.update_method()
  File "/config/custom_components/multiscrape/data.py", line 45, in async_update
    self.soup = BeautifulSoup(self.data, self._parser)
  File "/usr/local/lib/python3.8/site-packages/bs4/__init__.py", line 310, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()


Log response is an invalid option for multiscrape

Version 5.7.0

Configuration

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

multiscrape:
  - resource: https://url.com/
    scan_interval: 3600
    log_response: True
    sensor:
      - unique_id: scrape_price
        name: Price of Item
        select: "#ProductSection- > div > div.one-quarter > div.product-cta > div.pricing-block > div > span"
        value_template: '{{ (value.split("$")[1]) | float }}'
  - resource: ...

Describe the bug

When checking configuration, Home Assistant returns an error that log_response is an invalid option for multiscrape. I tried restarting with just logger block first and log_response commented out. HA still says it is an invalid option when uncommenting after restart.

Am I specifying log_response incorrectly?

Invalid config for [multiscrape]: [log_response] is an invalid option for [multiscrape]. Check: multiscrape->multiscrape->0->log_response. (See /config/configuration.yaml, line 166).

Debug log


Can't include as I can't enable.

Add feature to allow testing/templating of multiscrape sensors

Is your feature request related to a problem? Please describe.
When I set up a new mutliscrape sensor, where I have to perform a template operation on the scraped output, there isn't really any way to test this apart from editing the configuration.yaml and reloading everything. This is a bit slow.

Describe the solution you'd like
I'd love some way of playing in the Developer Tools with mutliscrape's output -- maybe some sort of optional dummy sensor which would contain the raw scraped data for which the value_template could be edited and then pasted back into the final configuration.yaml.

Additional context
I scrape multiple sensors from a single webpage, together with attributes to them, so maybe that's my difficulty.

Sensors have stopped scraping after updating to 2021.9.x

Version of the custom_component

Configuration

Add your logs here.

Describe the bug

A clear and concise description of what the bug is.
After updating from 2021.8.7 to 2021.9.1, I have noticed all of my multiscrape sensors have stopped scraping.

Debug log

2021-09-03 11:38:36 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day0 was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day0 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day0 attribute day was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day1 was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day1 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day1 attribute day was unable to extract data from HTML
2021-09-03 11:38:37 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day2 was unable to extract data from HTML
2021-09-03 11:38:38 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day2 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:38 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day2 attribute day was unable to extract data from HTML
2021-09-03 11:38:38 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day3 was unable to extract data from HTML
2021-09-03 11:38:38 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day3 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day3 attribute day was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day4 was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day4 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day4 attribute day was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.sensor] Sensor Grass Pollen Forecast Day5 was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day5 attribute pollen_station was unable to extract data from HTML
2021-09-03 11:38:39 ERROR (MainThread) [custom_components.multiscrape.entity] Sensor Grass Pollen Forecast Day5 attribute day was unable to extract data from HTML
2021-09-03 11:35:06 ERROR (MainThread) [custom_components.multiscrape.scraper] Error fetching data: https://www.coronavirus.vic.gov.au/covid-19-vaccine-data failed with
2021-09-03 11:35:06 ERROR (MainThread) [custom_components.multiscrape.scraper] Unable to parse response.
2021-09-03 11:35:08 ERROR (MainThread) [custom_components.multiscrape.scraper] Error fetching data: https://www.coronavirus.vic.gov.au/victorian-coronavirus-covid-19-data failed with
2021-09-03 11:35:08 ERROR (MainThread) [custom_components.multiscrape.scraper] Unable to parse response.
2021-09-03 11:35:08 ERROR (MainThread) [custom_components.multiscrape.scraper] Error fetching data: https://www.melbournepollen.com.au/ failed with
2021-09-03 11:35:08 ERROR (MainThread) [custom_components.multiscrape.scraper] Unable to parse response.
2021-09-03 11:35:10 ERROR (MainThread) [custom_components.multiscrape.scraper] Error fetching data: https://www.sunsmart.com.au/uvalert/default.asp?version=australia&locationid=160 failed with
2021-09-03 11:35:10 ERROR (MainThread) [custom_components.multiscrape.scraper] Unable to parse response.

Invalid config for Multiscrape

##Version

v6.3.2

Configuration

multiscrape:

Logs:

Logger: homeassistant.components.websocket_api.http.connection
Source: components/hassio/init.py:679
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 4:45:38 PM (1 occurrences)
Last logged: 4:45:38 PM
[140607003037904] The system cannot restart because the configuration is not valid: Invalid config for [multiscrape]: string value is None for dictionary value @ data['multiscrape'][0]['form_submit']['input']['extra']. Got None string value is None for dictionary value @ data['multiscrape'][0]['form_submit']['select']. Got None. (See /config/configuration.yaml, line 137).

Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/components/websocket_api/commands.py", line 189, in handle_call_service
await hass.services.async_call(
File "/usr/src/homeassistant/homeassistant/core.py", line 1627, in async_call
task.result()
File "/usr/src/homeassistant/homeassistant/core.py", line 1664, in _execute_service
await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)(
File "/usr/src/homeassistant/homeassistant/components/hassio/init.py", line 679, in async_handle_core_service
raise HomeAssistantError(
homeassistant.exceptions.HomeAssistantError: The system cannot restart because the configuration is not valid: Invalid config for [multiscrape]: string value is None for dictionary value @ data['multiscrape'][0]['form_submit']['input']['extra']. Got None
string value is None for dictionary value @ data['multiscrape'][0]['form_submit']['select']. Got None. (See /config/configuration.yaml, line 137).

Describe the bug

Trying to add to config.yaml and checking config I'm getting error

Add Attributes to a sensor device

Would be really cool to have the ability to add attributes from the scrape to the sensor as well as have multiple sensors

for example if the webpage had a table of data, and you wanted other columns in the table to be attributes.

Error while setting up multiscrape platform for sensor.

Version of the custom_component

pre-v5.5.0

Configuration

Not yet

Home Assistant Version

version core-2021.9.2
installation_type Home Assistant Core
dev false
hassio false
docker false
user homeassistant
virtualenv true
python_version 3.8.5
os_name Linux
os_version 5.4.0-1028-raspi
arch aarch64
timezone Asia/Taipei
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Lovelace
dashboards 1
resources 0
views 4
mode storage

Debug log


Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.8/site-packages/homeassistant/helpers/entity_platform.py", line 249, in _async_setup_platform
    await asyncio.shield(task)
  File "/home/homeassistant/.homeassistant/custom_components/multiscrape/sensor.py", line 42, in async_setup_platform
    if scraper.data is None:
UnboundLocalError: local variable 'scraper' referenced before assignment

Error 500 when using multiscrape

I've set-up a multiscrape sensor, that uses a login form before scraping data.
Using the multiscrape sensor form_submit option, the debug log gives me "Internal Server Error 500" when connecting to the login page.
However: When using a default browser and copy-pasting the exact same URL, the site works normal.
The browser trace doesn't show anything about being redirected first.
I've written a .net application myself, and I'm using the exact same URL as provided to multiscrape. This is still working fine.
The reason I want to use multiscrape is not to have a separate application for the scraping of data.

Version of the custom_component

"version": "6.2.0"

Configuration

- resource: 'https://formresourceurl/schedule#my_schedule'
  name: scrape1
  log_response: True
  scan_interval: 3600
  form_submit:
    submit_once: True
    resource: 'https://formresourceurl/u'
    select: "#frm_login"
    input:
      user_id: 'MYSECRETUSERNAME'
      password: 'MYSECRETPASSWORD'
  sensor:
      - unique_id: scrape1_sensor1
        select: 'div.users-full-row:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(1)'

Debug log

2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] # Start loading multiscrape
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] # Reload service registered
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] # Start processing config from configuration.yaml
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] scrape1 # Setting up multiscrape with config:
 OrderedDict([('resource', 'https://formresourceurl/schedule#my_schedule'), ('name', 'scrape1'), ('log_response', True), ('scan_interval', datetime.timedelta(seconds=3600)), ('form_submit', OrderedDict([('submit_once', True), ('resource', 'https://formresourceurl/u'), ('select', '#frm_login'), ('input', OrderedDict([('user_id', 'MYSECRETUSERNAME'), ('password', 'MYSECRETPASSWORD!')])), ('resubmit_on_error', True)])), ('sensor', [OrderedDict([('unique_id', 'scrape1_sensor1'), ('select', Template("div.users-full-row:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(1)")), ('name', 'Multiscrape Sensor'), ('force_update', False)])]), ('timeout', 10), ('method', 'GET'), ('parser', 'lxml'), ('verify_ssl', True)])
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] scrape1 # Log responses enabled, creating logging folder: /config/multiscrape/scrape1/
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # Initializing http wrapper
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.form] scrape1 # Initializing form submitter
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] scrape1 # Initializing scraper
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.scraper] scrape1 # Initializing scraper
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # Initializing http wrapper
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape] scrape1 # Initializing coordinator
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.coordinator] scrape1 # New run: start (re)loading data from resource
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.coordinator] scrape1 # Deleting logging files from previous run
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.form] scrape1 # Starting with form-submit
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.form] scrape1 # Requesting page with form from: https://formresourceurl/u
2022-04-18 09:30:26 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # Executing form_page-request with a GET to url: https://formresourceurl/u.
2022-04-18 09:30:27 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # Response status code received: 500
2022-04-18 09:30:27 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # response_headers written to file: form_page_response_headers.txt
2022-04-18 09:30:27 DEBUG (MainThread) [custom_components.multiscrape.http] scrape1 # response_body written to file: form_page_response_body.txt
2022-04-18 09:30:27 ERROR (MainThread) [custom_components.multiscrape.http] scrape1 # Error executing GET request to url: https://formresourceurl/u.
 Error message:
 HTTPStatusError("Server error '500 Internal Server Error' for url 'https://formresourceurl/u'\nFor more information check: https://httpstatuses.com/500")
2022-04-18 09:30:27 ERROR (MainThread) [custom_components.multiscrape.coordinator] scrape1 # Exception in form-submit feature. Will continue trying to scrape target page.


Removal of Index from the config broke my sensors

Version of the custom_component

6.2.0

Configuration

multiscrape:
  - resource: https://meteoregionelazio.it/rete/stazione.php?id=RM-139
    scan_interval: 600
    sensor:
      - unique_id: infernetto_hum
        name: Umidità Infernetto
        select: ".dato_grande"
        index: 1
        value_template: '{{ (value.split("%")[0]) }}'
        device_class: "humidity"
        state_class: "measurement"
        unit_of_measurement: "%"

Describe the bug

Removal of Index from the config broke all my sensors. It was a useful feature... now how can I get the same result?

Also, the readme.md still lists Index as a parameter.

Unable to scrape

im trying to scrape my socialblade page:
i followed the wiki https://github.com/danieldotnl/ha-multiscrape/wiki/Scraping-guide and this is what i tried

 - resource: https://socialblade.com/youtube/channel/UCb9Iz9w_jEq3iXIVrQqwdsQ
    scan_interval: 600
    sensor:
      - unique_id: youtube_channel_subs
        name: Youtube Channel Subscribers
        select: "#YouTubeUserTopInfoBlock > div:nth-child(3) > span:nth-child(3)"
      - unique_id: youtube_channel_views
        name: Youtube Channel Views
        select: "#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)"

i also tried:

select: "#youtube-stats-header-subs" (which works in regular scrape platform
select: ".youtube-stats-header-subs"
select: "youtube-stats-header-subs"

none of them works

i thought maybe the site blocks me so i tried adding:

    headers:
      User-Agent: Mozilla/5.0

didnt worked also

what am i missing?

Data Pulled but Field Not Seen

Version of the custom_component

pre-v6.0.1

Configuration

- resource: https://www.mudomaha.com/node/239
  scan_interval: 14400
  sensor:
    - unique_id: mud_gas_service_charge
      name: Utility - MUD - Gas Service Charge
      icon: mdi:currency-usd
      select: ".field-node--body > div:nth-child(1) > div:nth-child(1) > table:nth-child(13) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2)"
      value_template: "{{ value.replace('$', '') }}"
    - unique_id: mud_gas_infrastructure_charge
      name: Utility - MUD - Gas Infrastructure Charge
      icon: mdi:currency-usd
      select: ".field-node--body > div:nth-child(1) > div:nth-child(1) > table:nth-child(13) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2)"
      value_template: "{{ value.replace('$', '') }}"
    - unique_id: mud_gas_base_unit_charge
      name: Utility - MUD - Gas Base Unit Charge
      icon: mdi:currency-usd
      select: ".field-node--body > div:nth-child(1) > div:nth-child(1) > table:nth-child(13) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2)"
      value_template: "{{ value.replace('$', '0') }}"
    - unique_id: mud_gas_monthly_unit_charge
      name: Utility - MUD - Gas Monthly Unit Charge
      icon: mdi:currency-usd
      select: ".field-node--body > div:nth-child(1) > div:nth-child(1) > table:nth-child(30) > tbody:nth-child(3) > tr:nth-child(4) > td:nth-child(6)"
      value_template: "{{ value.replace('$', '') }}"
  log_response: true
  #parser: html.parser
  headers:
    User-Agent: Mozilla/5.0

Describe the bug

Not sure if this is a bug or maybe something weird with the webpage, but the data is showing up in the body response file that gets pulled by the plugin but for some reason, Multiscrape is not "finding" it even though it finds others on the same page just a few lines up. I have tried everything I found Googling around. Is there a limit to how many items it can pull or how far down the page it looks?
image

Debug log


2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_1 # Utility - MUD - Gas Infrastructure Charge # Updated sensor and attributes, now adding to HA
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Start scraping to update sensor
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Select selected tag: <td class="text-align-right">$.1283</td>
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Selector result: $.1283
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Applying value_template on selector result
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Final selector value: 0.1283
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Selected: 0.1283
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Icon template rendered and set to: mdi:currency-usd
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_1 # Utility - MUD - Gas Base Unit Charge # Updated sensor and attributes, now adding to HA
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # Start scraping to update sensor
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # Select selected tag: None
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # Exception selecting sensor data: 'NoneType' object has no attribute 'name'
HINT: Use debug logging and log_response for further investigation!
2022-04-11 11:20:12 ERROR (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # Unable to extract data
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # On-error, set value to None
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_1 # Utility - MUD - Gas Monthly Unit Charge # Updated sensor and attributes, now adding to HA
2022-04-11 11:20:12 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_2 # Price - Yardmatery - Stress Fert # Start scraping to update sensor

EDIT: to include a success in the log for another sensor on the same page

Handling intermediate pages after form submit

Version of the custom_component

6.0.0

Configuration

multiscrape:
  - resource: "https://www.amaysim.com.au/my-account/my-amaysim/products"
    name: Amaysim
    scan_interval: 30
    log_response: true
    method: GET
    form_submit:
      submit_once: true
      resubmit_on_error: false
      resource: "https://accounts.amaysim.com.au/identity/login"
      select: "#new_session"
      input:
        username: !secret amaysim_username
        password: !secret amaysim_password
    sensor:
      - select: "#outer_wrap > div.inner-wrap > div.page-container > div:nth-child(2) > div.row.margin-bottom > div.small-12.medium-6.columns > div > div > div:nth-child(2) > div:nth-child(2)"
        name: amaysim_remaining_data
        value_template: "{{ value }}"

Describe the bug

I can see from the logs that after successfully submitting the form, it says that it is getting the data from the resource url, with a response code of 200.

I have pasted the contents from the log_response file page_soup.txt below

<html><body><p>/**/('OK')</p></body></html>

Below is the content from the form_submit_response_body.txt

<html><body>You are being <a href="https://accounts.amaysim.com.au/identity">redirected</a>.</body></html>

It seems that after submitting the form, the sensor is scraping data from the intermediate page and not from the resource url.

Debug log


2022-02-01 18:51:06 DEBUG (MainThread) [custom_components.multiscrape] Amaysim # Setting up multiscrape with config:
OrderedDict([('resource', 'https://www.amaysim.com.au/my-account/my-amaysim/products'), ('name', 'Amaysim'), ('scan_interval', datetime.timedelta(seconds=30)), ('log_response', True), ('method', 'GET'), ('form_submit', OrderedDict([('submit_once', True), ('resubmit_on_error', False), ('resource', 'https://accounts.amaysim.com.au/identity/login'), ('select', '#new_session'), ('input', OrderedDict([('username', 'xxxx'), ('password', 'yyyy')]))])), ('sensor', [OrderedDict([('select', Template("#outer_wrap > div.inner-wrap > div.page-container > div:nth-child(2) > div.row.margin-bottom > div.small-12.medium-6.columns > div > div > div:nth-child(2) > div:nth-child(2)")), ('name', 'amaysim_remaining_data'), ('value_template', Template("{{ value }}")), ('force_update', False)])]), ('parser', 'lxml'), ('timeout', 10), ('verify_ssl', True)])
2022-02-01 18:51:06 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Initializing scraper
2022-02-01 18:51:06 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Found form-submit config
2022-02-01 18:51:07 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Refresh triggered
2022-02-01 18:51:07 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Continue with form-submit
2022-02-01 18:51:07 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Requesting page with form from: https://accounts.amaysim.com.au/identity/login
2022-02-01 18:51:07 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Executing form_page-request with a GET to url: https://accounts.amaysim.com.au/identity/login.
2022-02-01 18:51:17 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response status code received: 200
2022-02-01 18:51:17 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/form_page_response_headers.txt
2022-02-01 18:51:17 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/form_page_response_body.txt
2022-02-01 18:51:17 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Start trying to capture the form in the page
2022-02-01 18:51:17 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Parse HTML with BeautifulSoup parser lxml
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # The page with the form parsed by BeautifulSoup has been written to file: /config/multiscrape/amaysim/form_page_soup.txt
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Try to find form with selector #new_session
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Found the form, now finding all input fields
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Found the following fields: {'utf8': '✓', 'authenticity_token': 'abc123', 'continue': None, 'ga_client_id': '', 'username': None, 'password': ''}
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Found form action /identity/sessions and method post
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Determined the url to submit the form to: https://accounts.amaysim.com.au/identity/sessions
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Merged input fields with input data in config. Result: {'utf8': '✓', 'authenticity_token': 'abc123', 'continue': None, 'ga_client_id': '', 'username': 'xxxx', 'password': 'yyyy'}
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Going now to submit the form
2022-02-01 18:51:18 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Executing form_submit-request with a post to url: https://accounts.amaysim.com.au/identity/sessions.
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response status code received: 302
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/form_submit_response_headers.txt
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/form_submit_response_body.txt
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Form seems to be submitted succesfully! Now continuing to update data for sensors
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Updating data from https://www.amaysim.com.au/my-account/my-amaysim/products
2022-02-01 18:51:19 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Executing page-request with a get to url: https://www.amaysim.com.au/my-account/my-amaysim/products.
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response status code received: 200
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/page_response_headers.txt
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/page_response_body.txt
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Data succesfully refreshed. Sensors will now start scraping to update.
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Start loading the response in BeautifulSoup.
2022-02-01 18:51:20 DEBUG (MainThread) [custom_components.multiscrape.scraper] Amaysim # Response headers written to file: /config/multiscrape/amaysim/page_soup.txt

Input id/name with colon

Version of the custom_component

v6.3.2

Configuration

multiscrape:
  - resource: 'https://czview.contazara.es/private/czview/meter_list.xhtml'
    scan_interval: 86400
    form_submit:
      submit_once: True
      resource: 'https://czview.contazara.es/login.xhtml'
      select: "#loginForm"
      input:
        loginForm:username: '*******'
        loginForm:password: '*************'
    sensor:
      - name: 'Consumo agua'
        select: '.ui-datatable-data > tr > td:nth-child(4)'
        value_template: '{{ value | float }}'
        unit_of_measurement: 'm³'

Describe the bug

Hello! I think the problem comes from the fact that the id/name of the form in the username input and password input has colon.

image

Debug log

2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Consumo agua # Setting up sensor
2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Consumo agua # Start scraping to update sensor
2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Consumo agua # Tag selected: None
2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Exception occurred while scraping, will try to resubmit the form next interval.
2022-07-18 11:37:50 ERROR (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Consumo agua # Unable to scrape data: Could not find a tag for given selector.
Consider using debug logging and log_response for further investigation.
2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # Consumo agua # On-error, set value to None
2022-07-18 11:37:50 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # Consumo agua # Updated sensor and attributes, now adding to HA

Thank you for your work!

Add capability to include multi-scraped sensor in statistics graph

Is your feature request related to a problem? Please describe.
Currently, scraped sensors are not considered measurements by HA and therefore long-term data are not collected. This means that one cannot make long-term plots on the new statistics graph card. I scrape a whole bunch of numerical data into HA and it would be great to be able to look at long-term trends.

Describe the solution you'd like
It would be great if it were possible to set a property for a sensor such that HA would collect statistics on it, according to the spec here.

Describe alternatives you've considered
Extending the database is not really an option beyond some 10s of days and is completely unnecessary for most cases.

CAN'T REPLACE THE SIGNAL ' --- SINGLE QUOTES

EDIT: SOLVE THE PROBLEM TO MY OWN!
Just need to use these lines [have a great day all :))))]:

      - unique_id: zoo_brasilia_imagem_6
        icon: mdi:elephant
        name: Zoo de Brasília - Img 6
        select: '.img-destaque'
        attribute: 'style'
        value_template: 'https://www.zoo.df.gov.br/{{value.split(".")[4] | replace("br/","")}}.jpg' #png
        #'https://www.zoo.df.gov.br/{{value.split("/")[3]}}/{{value.split("/")[4]}}/{{value.split("/")[5]}}/{{value.split("/")[6]}}/{{value.split("/")[7] | replace (")","") | replace(".jpg", ".jpg                                               ")}}' 
        #'https://www.zoo.df.gov.br{{value.split("https://www.zoo.df.gov.br")[1] | replace(")", "") }}'
        #'{{value.split(" ")[1] | replace("(", "") | replace(")", "") | replace(";", "") | replace("url", "") }}'
        #'{{ value.split("*")[0] | replace("g')", "g") | replace("background-image: ", "") | replace("url('h", "h") }}'
        #'{{value.split(" ")[1] | replace("(", "") | replace(")", "") | replace(";", "") | replace("url", "") }}'
        #'{{value.split(" ")[1] | replace("(", " ") | replace(")", " ") | replace(";", " ") | replace("url", " ") }}'

I need the url from image without the single quotes to put the link from multiscrape intro my generic cameras, but when I scrape e replace all the useless things I got this template with the 2 single quotes:

image

my scrape sensor that return value with 2 single quotes:
### ZOO DE BRASÍLIA ###

  - resource: https://www.zoo.df.gov.br/
    scan_interval: 3600
    sensor:
      - unique_id: zoo_brasilia_imagem_7
        icon: mdi:elephant
        name: Zoo de Brasília - Img 7
        select: '.img-destaque'
        attribute: 'style'
        index: 1
        value_template: '{{value.split(" ")[1] | replace("(", "") | replace(")", "") | replace(";", "") | replace("url", "") }}'

sensor with problem when I try to remove the quotes:
image
### ZOO DE BRASÍLIA ###

  - resource: https://www.zoo.df.gov.br/
    scan_interval: 3600
    sensor:
      - unique_id: zoo_brasilia_imagem_6
        icon: mdi:elephant
        name: Zoo de Brasília - Img 6
        select: '.img-destaque'
        attribute: 'style'
        value_template: '{{value.split("*")[0] | replace("g')", "g") | replace("background-image: ", "") | replace("url('h", "h") }}'

log problem:
can not read an implicit mapping pair; a colon is missed at line 228, column 117:
... ground-image: ", "") | replace("url('h", "h") }}'
^

humble request: can I get the sensor without that 2 single quotes in the image?
please, don't close iit I think it's simple.. help me s2

Unable to parse response

Hey

Trying to parse the date for my next bin collection. Tried a few varying configurations but all are failing with "Unable to parse response"

I'm able to make a simple curl request or postman request to this site and a response is returned. No authentication or custom headers are required.

have also tried with the older parser: html.parser

Version of the custom_component = 5.2.0

Configuration

multiscrape:
  - resource: https://wasteservices.sheffield.gov.uk/property/100051011212
    scan_interval: 86400
    sensor:
      - unique_id: bins_black_nextcollection
        name: Next Black Bin Collection
        select: "#main > div.container.results-table-wrapper > div:nth-child(1) > div > table > tbody > tr.service-id-1.task-id-1.complete > td.next-service"
        value_template: '{{ (value.split(",")[1]) }}

Error/Logs

2021-08-04 20:48:08 ERROR (MainThread) [custom_components.multiscrape.data] Error fetching data: https://wasteservices.sheffield.gov.uk/property/100051011212 failed with
2021-08-04 20:48:08 ERROR (MainThread) [custom_components.multiscrape.data] Unable to parse response.

Select JSON response returned from POST

Version 6.0.0

Configuration

multiscrape:
  - resource: https://www.url.com/file.js
    scan_interval: 3600
    name: scrape_item
    method: POST
    payload: "properties%5B_uuid%5D=123&properties%5B_linkedLineItemUuid%5D=369&properties%5B_suggestion%5D=&properties%5B_maxQty%5D=5&properties%5B_freightPrice%5D=&properties%5B_productGroup%5D=&id=123&quantity=1"
    log_response: True
    sensor:
      - unique_id: scrape_price_item
        name: Price of Item
        select: "*"
        value_template: '{{ value.price | float/100 }}'

Describe the bug

I am using multiscrape to POST a payload to a JS script which returns a JSON object like below (shortened for this issue). I would like to extract the "price" attribute from this JSON object. Is this possible with multiscrape? Multiscrape sensor requires a CSS selector, what should this be if the entire response is a JSON object. I thought select: "*" and value.price would work but it isn't.

Am I missing a text to JSON step in the value_template?

Multiscrape may not be fit for this purpose except it does work up until the extraction so it would be good if I could get this to work without having to use a different integration.

Thanks!

Debug log

page_response_body.txt from multiscrape logs.

{
  "id": 123,
  "quantity": 1,
  "variant_id": 123,
  "key": "123",
  "title": "Text title",
  "price": 123,
  "original_price": 123,
  "discounted_price": 123,
  "line_price": 123,
  "original_line_price": 123,
  "total_discount": 0,
  "discounts": [],
  "sku": "123",
  "grams": 123,
  "vendor": "ABC",
  "taxable": true,
  "product_id": 123,
  "product_has_only_default_variant": true,
  "gift_card": false,
  "final_price": 123,
  "final_line_price": 123,
  "url": "/url"
}

on_error issue, triggers my event all the times.

i'm using last update and HASS OS 7.5

The thing what i have is that on_error is not working, or i'm doing something wrong.

So this is the thing,

  1. it scrapes every let say 10 minutes.
    SOmetimes the scrape comes back with an empty value..
  2. i have setup a automation that, using trigger entity> attribute>above 0

So when the script fails and then have a success it triggers my event all the time.. how can i solve this.

I checked my log and before it happens it returns (none) to all sensors/attributes


multiscrape:
  - resource: http://
    scan_interval: 300
    verify_ssl: false
    sensor:
        name: sysmic_activity
        picture: "/local/sysmic.png"
        select: "description:nth-child(1) >text"
        attributes:
          - name: Time
            select: "creationInfo > creationTime"
            on_error:
              value: last
          - name: Magnitude
            select: "mag> value"
            value_template: '{{ value | round(1) }}'
            on_error:
              value: last
          - name: Depth
            select: "depth >value"
            on_error:
              value: last
          - name: city
            select: "description:nth-child(1) >text"
            on_error:
              value: last
          - name: address
            select: "description:nth-child(1) >text"
            on_error:
              value: last
          - name: latitude
            select: "latitude> value"
            on_error:
              value: last
          - name: longitude
            select: "longitude> value"
            on_error:
              value: last

event trigger part

platform: numeric_state
entity_id: sensor.sysmic_activity
above: '0'
attribute: magnitude

Execute a real-time scrape when someone is looking at the data

Is your feature request related to a problem? Please describe.
I would like to see up-to-date data when I actively look at the dashboard containing the output of a multi scale sensor, but would like to have a long scraping interval if I don’t.

Describe the solution you'd like
not sure this is at all possible, but maybe there is some API in HASS that tells you if someone is logged in and looking at the dashboard in the browser/app. During that time, the scraping interval could be shortened to some user-set alternative so that up-to-date data appear in the plots and entity cards. Otherwise the intervals could be increased to be rare. Currently I have a rotten compromise interval of 15 minutes which is neither up to date when something sudden that causes me to log in happens, and also too detailed for my purpose of historical record.

Describe alternatives you've considered
Cannot think of any.

[feature request] Is login successful? Why does scrape fail?

Version of the custom_component

v5.5.0

Configuration

multiscrape:
  #https://thepagewiththedatathatyouwant.com
  - resource: "https://customer.xfinity.com/#/services/internet#usage"
    scan_interval: 3600
    form_submit:
      submit_once: False
      #https://thesitewiththeform.com
      resource: "https://login.xfinity.com/login"
      select: "#right > div > form"
      input:
        user: !secret xfinity_username
        passwd: !secret xfinity_password
    sensor:
      - select: "#usage > div > div:nth-child(2) > div > div > div > p > span > b:nth-child(1)"
        name: scraped-value-after-form-submit

Describe the bug

Failed to extract data, but why? Is select not available on the page? Did login occur successfully? I would like more debug info to troubleshoot the issue.

Debug log

Logger: custom_components.multiscrape.sensor
Source: custom_components/multiscrape/sensor.py:139
Integration: Multiscrape scraping component (documentation, issues)
First occurred: 6:36:38 PM (1 occurrences)
Last logged: 6:36:38 PM

Sensor scraped-value-after-form-submit was unable to extract data from HTML

Can you use Multiscrape to scrape info from 2 or more URLs? I tried adding another name: section but it didnt seem to work

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Schedule for scraping interval

Is your feature request related to a problem? Please describe.
I would like to be able to vary how often multiscrape makes a page request, depending on a schedule. This is for those websites where I can query the API for free a limited number of times per day. E.g. I would like to get weather updates once every 15 minutes during the day but once an hour at night.

Describe the solution you'd like
One idea would be to let the scan_interval be read from a helper, which I could adjust as I see fit with my own automations. If that's not possible, then maybe a peak/off-peak scanning interval with some start and end times for these?

Describe alternatives you've considered
I could set the scan_interval to once every 24h and write an automation to do trigger the scan manually. But this gets bulky quickly.

Preserve sensor state after HA restart

Is your feature request related to a problem? Please describe.
I have two sensors set up with ha-multiscrape:

- name: Katya Yandex Rain State
  resource_template: URL_HERE
  scan_interval: 120
  headers: 
    user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
  sensor:
    - unique_id: katya_yandex_rain_state
      name: Katya Yandex Rain State
      select: div.weather-maps-fact__nowcast-alert
      on_error:
        value: last
      value_template: >
        {{ (value.split(".")[1]) }}

- name: Katya Home to Work Commute State
  resource_template: URL_HERE
  scan_interval: 120
  headers: 
    user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
  sensor:
    - unique_id: katya_home_to_work_commute_state
      name: Katya Home to Work Commute State
      select: div.auto-route-snippet-view__route-title-primary
      unit_of_measurement: min
      on_error:
        value: last
      value_template: >
        {% if "мин" in value and "ч" in value %}
          {{ value|regex_findall_index(find="\d+.", index=0, ignorecase=False)|int * 60 +
            value|regex_findall_index(find="\d+.", index=1, ignorecase=False)|int }}
        {% elif "ч" in value %}
          {{ value|regex_findall_index(find="\d+.", index=0, ignorecase=False)|int * 60 }}
        {% elif "мин" in value %}
          {{ value|regex_findall_index(find="\d+.", index=0, ignorecase=False)|int }}
        {% else %}
          unavailable
        {% endif %}

They both work fine with one exception: after Home Assistant restart they have their values lost. The first sensor has "unknown" state, second -- "1" state.

Describe the solution you'd like
I would like to have sensors that will preserve their state after HA restart.

Add default value if element not available

I am selecting value from a very dynamic page and some of the time some values are not included in the HTML. This results in a very ugly error in the log "Sensor {name} unable to extract data from HTML" and a value of unknown for the sensor.

What I would like to have is to have some kind of default value to be used when the data can not be extracted or at least some way of hiding the error from the log with some option. When there multiple sensor the log gets a lot of false errors.

Using resource_template at startup of home assistant there are not important errors in log

Version of the custom_component

v5.7.0

System Health

version core-2022.3.8
installation_type Home Assistant Container
dev false
hassio false
docker true
user root
virtualenv false
python_version 3.9.9
os_name Linux
os_version 5.10.103-v7l+
arch armv7l
timezone Europe/Rome
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 3957
Installed Version 1.24.1
Stage running
Available Repositories 1006
Downloaded Repositories 33
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Lovelace
dashboards 1
resources 19
views 24
mode storage

Configuration

sensor:
  - platform: file
    file_path: /config/filestatisensori/linkautoletturagas.txt
    name: link gas da file

multiscrape:
  - resource_template: "{{ states('sensor.link_gas_da_file')}}"
    scan_interval: 3600
    name: letture date gas
    sensor:
      - unique_id: fai_la_lettura_del_gas_dal
        name:  Fai la lettura del gas dal
        select: 'script[type="text\/plain"]'
        value_template: "{{ (value | regex_findall_index('data_da..............', index=0)).split('\"')[-2].strip()  }}"
      - unique_id: fai_la_lettura_del_gas_al
        name:  Fai la lettura del gas al
        select: 'script[type="text\/plain"]'
        value_template: "{{ (value | regex_findall_index('data_a..............', index=0)).split('\"')[-2].strip()  }}"

Describe the bug

During the startup of home assistant there are Errors in log that tell that the "resource_template" is not a link like http or https.
How you can see from my configuration.yaml I use a sensor file where inside the txt file is wrote the url link.
I don't know if it's possible to add a "delay" (standard or editable from the user) during the startup for the component/sensor OR a waiting template.
Is not a big problem. Maybe there is already a solution that I don't know.
However also with this log info/error it works great... so thanks a lot for your fantastic component!

Debug log


2022-04-01 22:21:19 ERROR (MainThread) [custom_components.multiscrape.scraper] Error fetching data: unknown failed with Request URL is missing an 'http://' or 'https://' protocol.
2022-04-01 22:21:19 ERROR (MainThread) [custom_components.multiscrape.scraper] Unable to parse response.
2022-04-01 22:21:19 WARNING (MainThread) [homeassistant.components.sensor] Platform multiscrape not ready yet: Request URL is missing an 'http://' or 'https://' protocol.; Retrying in background in 30 seconds


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.