fuyb1992 / es_pandas Goto Github PK

View Code? Open in Web Editor NEW

35.0 7.0 11.0 90 KB

Read, write and update large scale pandas DataFrame with Elasticsearch

License: MIT License

Python 100.00%

pandas elasticsearch large-scale

es_pandas's People

Contributors

Stargazers

Watchers

Forkers

xuehh zhangbk920209 virtustate gxflove307 oskrdt robomotic asmitaccenture gnandaki mrandyaswin shuguangbo

es_pandas's Issues

ModuleNotFoundError: No module named 'progressbar'

progressbar2 is not being installed when installing the package

to_pandas报错 None of ['_id'] are in the columns

支持将多个具有相同映射关系index中的数据导入一个DataFrame

对于相同映射关系的index，比如按日期保存的数据，应该支持将多个index中的数据导入一个DataFrame中。
比如 index-2022-01, index-2022-02, index-2022-03 ...
df = ep.to_pandas('index-2022*',...)

User credentials

Can I pass a user and pass to the connection to es?
TXS

Version check fails using SNAPSHOT

The version of my elasticsearch instance ends with SNAPSHOT and that's causing to fail when trying to init.
Version:

7.9.1-SNAPSHOT

Error I'm getting

ValueError: invalid literal for int() with base 10: ‘1-SNAPSHOT’

sql_query fetch size

Hi there,
what parameter should I pass to provide the fetch size:

https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-translate.html

POST /_sql/translate
{
  "query": "SELECT * FROM library ORDER BY page_count DESC",
  "fetch_size": 10
}

AttributeError: module 'progressbar' has no attribute 'version'

Summary

On python 3.6 virtual environment in Ubuntu 14.04.5 LTS after installing es_pandas and progressbar2, I get the error "AttributeError: module 'progressbar' has no attribute 'version'" when trying to:
from es_pandas import es_pandas

Details

root@ns502245:~# source p36/bin/activate
(p36) root@ns502245:~# pip install progressbar2
Requirement already satisfied: progressbar2 in ./p36/lib/python3.6/site-packages (3.50.0)
Requirement already satisfied: six in ./p36/lib/python3.6/site-packages (from progressbar2) (1.13.0)
Requirement already satisfied: python-utils>=2.3.0 in ./p36/lib/python3.6/site-packages (from progressbar2) (2.4.0)
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(p36) root@ns502245:~# python
Python 3.6.9 (default, Nov 19 2019, 14:10:59)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from es_pandas import es_pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/p36/lib/python3.6/site-packages/es_pandas/__init__.py", line 1, in <module>
    from .es_pandas import es_pandas
  File "/root/p36/lib/python3.6/site-packages/es_pandas/es_pandas.py", line 7, in <module>
    if not progressbar.__version__.startswith('3.'):
AttributeError: module 'progressbar' has no attribute '__version__'

Any ways to force push all columns as string?

While importing, pandas makes phone numbers float, so converting to string adds .0 at the end.
I decided to check .0 at the every line and erase it if exists, but now importing is 100x slower

如果不显示进度，应不计算index中的文档数

如果不显示进度，则不应该计算index中的文档数。
特别是index中的文档数量巨大或index数量多时，会节约开销。

为什么第一次写入的时候, _id 是自己生成的?

第一次,让es随机生成,后面想要更新,怎么根据这个来确定唯一.(每次查出来,再去更新可以,有时候没必要).

to_es error with show_progress=False

Using version 0.17 to_es gives error with show_progress=False

Traceback (most recent call last):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 304, in parallel_bulk
for result in pool.imap(
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 144, in _helper_reraises_exception
raise ex
File "/opt/anaconda3/envs/algorithms/lib/python3.8/multiprocessing/pool.py", line 388, in _guarded_task_generation
for i, x in enumerate(iterable):
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/elasticsearch/helpers/init.py", line 58, in _chunk_actions
for action, data in actions:
File "/opt/anaconda3/envs/algorithms/lib/python3.8/site-packages/es_pandas/es_pandas.py", line 136, in rec_to_actions
bar.update(i)
TypeError: update() takes 1 positional argument but 2 were given

_op_type='update' not working

Running below command, does not update the records in elasticsearch.

ep.to_es(df.iloc[:1000, 1:], index, doc_type=doc_type, _op_type='update')

N/A% (0 of 1000) | | Elapsed Time: 0:00:00 ETA: --:--:--
1000

Unable to upload array as value

(Edited)
I'm having the following failure trying to upload a value with an array.

>>> response = ep.to_es(df, index='myindex', _op_type='update')
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It seems that he serialize function fails since pd.isna function returns an array when the input is an array.
Could you please consider to use np.all method to wrap pd.isna output to always produce a boolean and enable arrays to be processed?

Something wrong when you run template code

I just pip install es_pandas, and attach other packages including progressbar2 (>3), but can't work.

The following error message:
Incorrect version of progerssbar package, please do pip install progressbar2
but the version python detect is the python_utils package, then I fixed out, the following error outputs
TypeError: __init__() got an unexpected keyword argument 'max_value'

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers.errors import BulkIndexError
import time
import pandas as pd
from es_pandas import es_pandas


import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# crete es_pandas instance
es = es_pandas(es_url,verify_certs=False,ssl_show_warn=False)

es.to_pandas(index='priam_unified_host-2021-05-03', query_sql='select top 10 * from day-2021-05-03 WHERE EventID=4688')

I get this error:

TypeError: search() got an unexpected keyword argument 'query_sql'

在to_pandas函数中，set_index应该在设置dtype以后，否则通过dtype重置'_id'类型会失败

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count)).set_index('_id')
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    return df

    df = pd.DataFrame(self.get_source(anl, show_progress=show_progress, count=count))
    if infer_dtype:
        dtype = self.infer_dtype(index, df.columns.values)
    if len(dtype):
        df = df.astype(dtype)
    df = df.set_index('_id')   <<< 返回之前set_index
    return df