kunansy / rnc Goto Github PK
View Code? Open in Web Editor NEWAPI for Russian National Corpus
Home Page: https://kunansy.github.io/RNC
License: MIT License
API for Russian National Corpus
Home Page: https://kunansy.github.io/RNC
License: MIT License
Make Corpus
and Example
classes abstract.
Implement these methods, use generators
Now it set as mode=para-eng/fra/ger
etc.
Corpus returns 'Not found...' if the query is a dict.
But there're results, corp.open_url()
shows the page with them.
Задать через класм Pages. количество страниц: ≤, все, случайная страница
Line 735 in 2a3e5a4
Here the are some minor changes, fixes and improvements
Path.mkdir(exist_ok=True)
instead of Line 47 in 0ef7f28
bug
[ERROR] [corpora_requests:whether_result_found] The request is not correct
code
import rnc
ru = rnc.MainCorpus(
query='корпус',
p_count=5,
marker=str.upper)
ru.request_examples()
logs
[28.03.2023 15:24:07,632] [DEBUG] [corpora_requests:is_request_correct] Validating that everything is OK
[28.03.2023 15:24:07,634] [DEBUG] [corpora_requests:whether_result_found] Validating that the request is OK
[28.03.2023 15:24:07,635] [INFO] [corpora_requests:get_htmls] Requested to 'https://processing.ruscorpora.ru/search.xml' [0;1) with params {'env': 'alpha', 'api': '1.0', 'lang': 'en', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 1, 'lex1': 'корпус', 'mode': 'main'}
[28.03.2023 15:24:07,636] [ERROR] [corpora_requests:whether_result_found] The request is not correct: {'env': 'alpha', 'api': '1.0', 'lang': 'en', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 1, 'lex1': 'корпус', 'mode': 'main'}
[28.03.2023 15:24:07,636] [ERROR] [corpora_requests:is_request_correct] HTTP request is wrong
[28.03.2023 15:24:07,637] [ERROR] [corpora:request_examples] Query = ['корпус'], 5, {'env': 'alpha', 'api': '1.0', 'lang': 'en', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 1, 'lex1': 'корпус', 'mode': 'main'}
e = {'env': 'alpha', 'api': '1.0', 'lang': 'en', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 1, 'lex1': 'корпус', 'mode': 'main'}
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?df6b87e8-4758-497a-a54d-7a456297814e)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File ~/.conda-envs/venv2/lib/python3.8/site-packages/rnc/corpora_requests.py:200, in whether_result_found(url, **kwargs)
199 try:
--> 200 page_html = get_htmls(url, **kwargs)[0]
201 except Exception:
File ~/.conda-envs/venv2/lib/python3.8/site-packages/rnc/corpora_requests.py:164, in get_htmls(url, start, stop, **kwargs)
162 coro_start = time.time()
--> 164 html_codes = asyncio.run(
165 get_htmls_coro(url, start, stop, **kwargs)
166 )
168 logger.info("Request was successfully completed")
File ~/.conda-envs/venv2/lib/python3.8/asyncio/runners.py:33, in run(main, debug)
32 if events._get_running_loop() is not None:
---> 33 raise RuntimeError(
34 "asyncio.run() cannot be called from a running event loop")
36 if not coroutines.iscoroutine(main):
RuntimeError: asyncio.run() cannot be called from a running event loop
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
...
--> 294 raise WrongHTTPRequest(f"{kwargs}")
295 logger.debug("HTTP request is correct, result found")
297 logger.debug("Validating that the last page exists")
WrongHTTPRequest: {'env': 'alpha', 'api': '1.0', 'lang': 'en', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 1, 'lex1': 'корпус', 'mode': 'main'}
Я так понимаю с текущей версией сайта не библиотека не работает? Ошибка при обрашении к https://processing.ruscorpora.ru/search.xml
aiojobs uses deprecated args for event_loop. It will be removed in Python 3.10.
Like
[date time] [level] [line] [message]
RNC should get only this params in query, raise and exception if sth went wrong.
Use ujson instead of json, it is quicker
Use
... open() as f:
for line in f:
...
Let the user to turn the logging to file off.
Do not create folder 'data' while it is not demanded
Parsing in MultilingualParaCorpus gets a lot of time.
Which of the parse method to Multiprocessing: parse_doc or parse_example? Profile the project to know.
Add cases when the result not found, last page doesn't exist etc
Get full info about the document
Class Query will be used instead of dict with tags.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.