darrenjennings / algolia-docsearch-action Goto Github PK

View Code? Open in Web Editor NEW

29.0 4.0 14.0 14 KB

runs the docsearch scraper and updates an index

Dockerfile 13.04% Shell 86.96%

algolia-docsearch-action's People

Contributors

Stargazers

Watchers

Forkers

formium zahedbri sbrichardson zenfection psi-4ward cloudlayerio devanshdixit decipad citrusjunoss 68686 tecladocode lawsssscat

algolia-docsearch-action's Issues

Decoding error

Hi, I'm getting an error when running an indexing job:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb6 in position 8: ordinal not in range(128)

Do you have any hints?

Here is the full log:

(...)
[36](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:37)
Successfully installed certifi-2022.5.18.1 distlib-0.3.4 filelock-3.4.1 importlib-metadata-4.8.3 importlib-resources-5.4.0 pipenv-2022.4.8 platformdirs-2.4.0 six-1.16.0 typing-extensions-4.1.1 virtualenv-20.14.1 virtualenv-clone-0.5.7 zipp-3.6.0
[37](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:38)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
[38](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:39)
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
[39](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:40)
Installing dependencies from Pipfile.lock (aabb41)...
[40](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:41)
2022-05-31 16:57:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://dev.decipad.com/docs/language/numbers/> (referer: https://dev.decipad.com/docs/sitemap.xml)
[41](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:42)
Traceback (most recent call last):
[42](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:43)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 40, in get_dom
[43](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:44)
    body = response.body.decode(response.encoding)
[44](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:45)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 4: ordinal not in range(128)
[45](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:46)

[46](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:47)
During handling of the above exception, another exception occurred:
[47](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:48)

[48](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:49)
Traceback (most recent call last):
[49](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:50)
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
[50](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:51)
    current.result = callback(current.result, *args, **kw)
[51](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:52)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 169, in parse_from_sitemap
[52](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:53)
    self.add_records(response, from_sitemap=True)
[53](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:54)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 148, in add_records
[54](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:55)
    records = self.strategy.get_records_from_response(response)
[55](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:56)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/default_strategy.py", line 39, in get_records_from_response
[56](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:57)
    self.dom = self.get_dom(response)
[57](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:58)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 43, in get_dom
[58](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:59)
    result = lxml.html.fromstring(response.body)
[59](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:60)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 875, in fromstring
[60](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:61)
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
[61](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:62)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 764, in document_fromstring
[62](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:63)
    "Document is empty")
[63](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:64)
lxml.etree.ParserError: Document is empty
[64](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:65)
2022-05-31 16:57:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://dev.decipad.com/docs/language/> (referer: https://dev.decipad.com/docs/sitemap.xml)
[65](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:66)
Traceback (most recent call last):
[66](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:67)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 40, in get_dom
[67](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:68)
    body = response.body.decode(response.encoding)
[68](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:69)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb6 in position 8: ordinal not in range(128)
[69](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:70)

[70](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:71)
During handling of the above exception, another exception occurred:
[71](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:72)

[72](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:73)
Traceback (most recent call last):
[73](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:74)
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 662, in _runCallbacks
[74](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:75)
    current.result = callback(current.result, *args, **kw)
[75](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:76)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 169, in parse_from_sitemap
[76](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:77)
    self.add_records(response, from_sitemap=True)
[77](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:78)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/documentation_spider.py", line 148, in add_records
[78](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:79)
    records = self.strategy.get_records_from_response(response)
[79](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:80)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/default_strategy.py", line 39, in get_records_from_response
[80](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:81)
    self.dom = self.get_dom(response)
[81](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:82)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/strategies/abstract_strategy.py", line 43, in get_dom
[82](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:83)
    result = lxml.html.fromstring(response.body)
[83](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:84)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 875, in fromstring
[84](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:85)
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
[85](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:86)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 764, in document_fromstring
[86](https://github.com/decipad/decipad/runs/6675169856?check_suite_focus=true#step:4:87)
    "Document is empty")

[FEATURE REQUEST] Make the Action Faster

First of all, excellent action for uploading search indexes to algolia!
I am using it it for a documentation repo.

Currently, the action is taking around 1.5 mins to complete which is according to me okay but can be certainly improved.

If we can speed up the build using caching, it would be awesome.
I found one example: https://pythonspeed.com/articles/speeding-up-docker-ci/

If this idea can work out, I am willing to collaborate.
Thank you.

About algolia configuring key for Name

Thank you very much for making this GitHub Actions warehouse. I have the following questions to ask.
In my configuration, there is the configuration of name. Do you need to enter it in the warehouse? in addition, an error has been reported here. Please take a look at it sometime.

Doesn't work, you get a run command not found error?

This does not appear to work, you get a "run command not found" error. I can't seem to figure out why.

Exception: Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file

Please see the action log: https://github.com/ant-design-blazor/ant-design-blazor/runs/6823621463?check_suite_focus=true

Installing dependencies from Pipfile.lock (aabb[41](https://github.com/ant-design-blazor/ant-design-blazor/runs/6823621463?check_suite_focus=true#step:4:42))...
Traceback (most recent call last):
  File "docsearch", line 5, in <module>
    run()
  File "/github/workspace/docsearch-scraper/cli/src/index.py", line 161, in run
    exit(command.run(sys.argv[2:]))
  File "/github/workspace/docsearch-scraper/cli/src/commands/run_config.py", line 21, in run
    return run_config(args[0])
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/index.py", line 33, in run_config
    config = ConfigLoader(config)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/config/config_loader.py", line 78, in __init__
    self.user_agent)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/config/browser_handler.py", line 34, in init
    CHROMEDRIVER_PATH))
Exception: Env CHROMEDRIVER_PATH='/usr/bin/chromedriver' is not a path to a file

ValueError: CONFIG is not a valid JSON

Since 2 days, the workflow crashes with the following error:

Cloning into 'docsearch-scraper'...
Collecting pipenv
  Downloading pipenv-2021.5.29-py2.py3-none-any.whl (3.9 MB)
Collecting virtualenv-clone>=0.2.5
  Downloading virtualenv_clone-0.5.7-py3-none-any.whl (6.6 kB)
Requirement already satisfied: pip>=18.0 in /usr/local/lib/python3.6/site-packages (from pipenv) (21.2.4)
Collecting certifi
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Requirement already satisfied: setuptools>=36.2.1 in /usr/local/lib/python3.6/site-packages (from pipenv) (57.5.0)
Collecting virtualenv
  Downloading virtualenv-20.10.0-py2.py3-none-any.whl (5.6 MB)
Collecting importlib-resources>=1.0
  Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)
Collecting filelock<4,>=3.2
  Downloading filelock-3.3.2-py3-none-any.whl (9.7 kB)
Collecting six<2,>=1.9.0
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting distlib<1,>=0.3.1
  Downloading distlib-0.3.3-py2.py3-none-any.whl (496 kB)
Collecting platformdirs<3,>=2
  Downloading platformdirs-2.4.0-py3-none-any.whl (14 kB)
Collecting importlib-metadata>=0.12
  Downloading importlib_metadata-4.8.1-py3-none-any.whl (17 kB)
Collecting backports.entry-points-selectable>=1.0.4
  Downloading backports.entry_points_selectable-1.1.0-py2.py3-none-any.whl (6.2 kB)
Collecting zipp>=0.5
  Downloading zipp-3.6.0-py3-none-any.whl (5.3 kB)
Collecting typing-extensions>=3.6.4
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Installing collected packages: zipp, typing-extensions, importlib-metadata, six, platformdirs, importlib-resources, filelock, distlib, backports.entry-points-selectable, virtualenv-clone, virtualenv, certifi, pipenv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Successfully installed backports.entry-points-selectable-1.1.0 certifi-2021.10.8 distlib-0.3.3 filelock-3.3.2 importlib-metadata-4.8.1 importlib-resources-5.4.0 pipenv-2021.5.29 platformdirs-2.4.0 six-1.16.0 typing-extensions-3.10.0.2 virtualenv-20.10.0 virtualenv-clone-0.5.7 zipp-3.6.0
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Installing dependencies from Pipfile.lock (aabb41)...
WARNING: The directory '/github/home/.cache/pipenv' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Collecting incremental==21.3.0
  Downloading incremental-21.3.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: incremental
WARNING: Ignoring invalid distribution -mportlib-metadata (/usr/local/lib/python3.6/site-packages)
Successfully installed incremental-21.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Traceback (most recent call last):
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/config/config_loader.py", line 101, in _load_config
    data = json.loads(config, object_pairs_hook=OrderedDict)
  File "/usr/local/lib/python3.6/json/__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "docsearch", line 5, in <module>
    run()
  File "/github/workspace/docsearch-scraper/cli/src/index.py", line 161, in run
    exit(command.run(sys.argv[2:]))
  File "/github/workspace/docsearch-scraper/cli/src/commands/run_config.py", line 21, in run
    return run_config(args[0])
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/index.py", line 33, in run_config
    config = ConfigLoader(config)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/config/config_loader.py", line 69, in __init__
    data = self._load_config(config)
  File "/github/workspace/docsearch-scraper/cli/../scraper/src/config/config_loader.py", line 106, in _load_config
    raise ValueError('CONFIG is not a valid JSON')
ValueError: CONFIG is not a valid JSON

Here is my config:

name: Docsearch Scrap

on:
  schedule:
    - cron: "0 8 * * *"

jobs:
  scrap:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: darrenjennings/algolia-docsearch-action@master
        with:
          algolia_application_id: "MY_ID"
          algolia_api_key: ${{secrets.DOCSEARCH_API_KEY}}
          file: "$GITHUB_WORKSPACE/docsearch-scrapper-config.json"

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.