milvus-io / milvus-tools Goto Github PK
View Code? Open in Web Editor NEWA data migration tool for Milvus.
License: Apache License 2.0
A data migration tool for Milvus.
License: Apache License 2.0
May I ask if there are any tools available to help me import data from the standalone version of Milvus into the cluster version of Milvus?
Hello guys, here is my question:
Why If I migrate collection from Milvus to HDF5 and then from HDF5 to another Milvus - the milvus id auto generation breaks down. So If I want to insert the new vector into migrated collection of another Milvus it produces the following error mesage:
Status(code=12, message='Entities IDs are user-defined. Please provide IDs for all entities of the collection.')
[]
I want to save the auto id generation on migrated collection. Maybe I am doing something wrong?
There will occur the error when the segment_list and row_list are empty, then the total_vectors and total_ids are assignment before clarify. This happened for me when I create all of the partition tags for data, and give None for partition tags when I do the milvus to hdf5, cause the default partition tag None has no data, then the error will occur.
here is the pull requests: #42
when I execute milvusdm --yaml M2M.yaml i encounter an error.
2021-04-15 19:50:23,301 | ERROR | milvus_to_milvus.py | transform_milvus_data | 44 | Error with: cannot reshape array of size 350208 into shape (171,64)2021-04-15 19:50:23,301 | ERROR |
My milvus version is 1.0.0.
Example of float vectors and binary vectors h5 file, not knowing the dimensionality of the vectors
使用milvusdm迁移collection
源和目标节点版本均为0.10.3(源为单机版、目标为集群版)
报错信息:Error with: local variable 'total_vectors' referenced before assignment
M2M配置信息如下:
M2M:
# The dest-milvus version.
milvus_version: 0.10.3
# Working directory of the source Milvus.
source_milvus_path: '/data0/milvus'
mysql_parameter:
host: '172.18.248.189'
user: 'root'
port: 3306
password: '123456'
database: 'milvus'
source_collection: # specify the 'partition_1' and 'partition_2' partitions of the 'test' collection.
tidea_is_sample:
- ''
dest_host: '172.18.151.165'
dest_port: 19531
mode: 'skip' # 'skip/append/overwrite'
错误日志:
2021-11-05 21:18:58,140 | DEBUG | read_milvus_meta.py | connect_mysql | 20 | Successfully connect mysql
2021-11-05 21:18:58,142 | INFO | milvus_to_milvus.py | transform_milvus_data | 38 | Ready to transform all data of collection: tidea_is_sample/partitions: ['']
2021-11-05 21:18:58,143 | DEBUG | read_milvus_meta.py | get_collection_info | 72 | Get collection info(dimension, index_file_size, metric_type, version):((512, 1073741824, 1, '0.10.3'),)
2021-11-05 21:18:58,147 | DEBUG | read_milvus_data.py | read_milvus_file | 89 | Reading milvus/db data from collection: tidea_is_sample/partition:
2021-11-05 21:18:58,148 | DEBUG | read_milvus_meta.py | get_collection_dim_type | 96 | Get meta data about dimension and types: ((512, 1),)
2021-11-05 21:18:58,148 | DEBUG | read_milvus_meta.py | get_collection_segments_rows | 109 | Get meta data about segment and rows: ()
2021-11-05 21:18:58,149 | ERROR | milvus_to_milvus.py | transform_milvus_data | 44 | Error with: local variable 'total_vectors' referenced before assignment
When I run the collection_prepare.py file to run the milvus benchmark, I get this error
AttributeError: 'Collection' object has no attribute 'flush'
How can we fix this?
my codes:
export MILVUSDM_PATH='/home/${MY_USER_NAME}/milvusdm'
export LOGS_NUM=0
pip3 install pymilvusdm
and then:
pymilvusdm
pymilvusdm: command not found
Anything wrong?
It was found that in the Milvus to Milvus process if one of the Milvus data was indexing, the transfer would be unsuccessful
OS: CentOS7.4,
Milvus old version: 0.10.3
Milvus new version: 1.1.1
We migrate to version1.1.1 from version0.10.3 with milvusdm:
yaml file:
M2M: milvus_version: 1.1.1 source_milvus_path: '/data0/milvus' mysql_parameter: source_collection: # specify the 'partition_1' and 'partition_2' partitions of the 'test' collection intelligence_picture_v1: dest_host: '10.11.205.18' dest_port: 19530 mode: 'skip' # 'skip/append/overwrite'
ERROR as flowing:
2022-05-06 16:53:26,929 | ERROR | grpc_handler.py | handler | 72 |
Addr [10.11.205.18:19530] fake_register_link
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = ""
debug_error_string = "{"created":"@1651827206.928677178","description":"Error received from peer ipv4:10.11.205.18:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"","grpc_status":12}"
{'API start': '2022-05-06 16:53:26.927801', 'RPC start': '2022-05-06 16:53:26.928089', 'RPC error': '2022-05-06 16:53:26.928964'}
It seems that the hdf5 to milvus does not support custom schema. I have 3 columns "embedding", "id", "other".
It seems that the DM tool only imports the hardcoded group "embeddings", and "ids".
使用milvusdm迁移collection
目标:1.x版本
数据源:0.10.x版本
使用三个版本的milvusdm均遇到了异常,详情如下:
0.1版
2021-08-23 16:30:23,042 | ERROR | milvus_to_milvus.py | transform_milvus_data | 44 | Error with: cannot reshape array of size 357564416 into shape (43648,256)
1.0版
2021-08-23 16:20:14,277 | ERROR | milvus_client.py | insert | 98 | The amount of data inserted each time cannot exceed 256 MB
0%| | 0/1 [00:09<?, ?it/s]
2.0版
2021-08-23 16:16:19,198 | ERROR | grpc_handler.py | handler | 71 |
Addr [xx.xx.xx.xx:19530](隐去ip地址) fake_register_link
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = ""
debug_error_string = "{"created":"@1629706579.197927267","description":"Error received from peer ipv4:xx.xx.xx.xx:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"","grpc_status":12}"
Does not follow the yaml file collection_parameter: dimension: 256, while the Faiss file has dimension 128 . After creating the collectiton, the dimension is 128 instead of 256 as set in the yaml file
My files are 11 g in size. When I perform migration, I am prompted that the number of single inserts is too large
Hi,
I've tried to export Milvus data to HDF5
when running Milvus in a standalone setup using dockerized milvus
, etcd
and minio
.
The yaml configuration looks like:
M2H:
milvus_version: 2.0.0
source_milvus_path: '<directory-where-milvus-volume-is-mapped>'
mysql_parameter:
source_collection:
<my-collection-name>:
- '_default'
data_dir: '<data-directory-where-to-export>'
However this fails with:
ERROR | read_milvus_meta.py | connect_sqlite | 31 | SQLite ERROR: connect failed with unable to open database file
So I wonder if it is possible to use the tool from the standalone run Milvus?
Thanks!
How to pass search param of inv_flat
index type exactly with your benchmark code, assume that I have successfully create INV_FLAT index on the dataset. Passing
search_parameters = {
"anns_field": anns_field,
"metric_type": metric_type,
"param": {
"nprobe": 32,
},
"limit": topk,
"expression": expression,
}
gives error:
Traceback (most recent call last):
File "go_benchmark.py", line 167, in <module>
go_search(go_benchmark=go_benchmark, uri=uri, user=user, password=password, collection_name=collection_name,
File "go_benchmark.py", line 113, in go_search
raise ValueError(msg)
ValueError: The type of go_benchmark response is not json: panic: nprobe not valid
Appreciate you helps. Thanks.
when I execute milvusdm --yaml M2H.yaml
I got:
2021-04-23 14:28:21,740 | INFO | milvus_to_hdf5.py | read_milvus_data | 50 | Ready to read all data of collection: ann_1m_sq8/partitions: [None] 0%| | 0/1 [00:00<?, ?it/s] 2021-04-23 14:28:21,908 | ERROR | milvus_to_hdf5.py | read_milvus_data | 56 | Error with: cannot reshape array of size 307200000 into shape (600000,16)
Is there any wrong with the data volume?
from milvus import Milvus, IndexType, MetricType, Status
milvus = Milvus(host='milvusv2.local', port='19530')
param = {'collection_name':'test01', 'dimension':256, 'index_file_size':1024, 'metric_type':MetricType.L2}
milvus.create_collection(param)
milvus.create_partition('test01', 'tag01')
import random
vectors = [[random.random() for _ in range(256)] for _ in range(20)]
vector_ids = [id for id in range(20)]
milvus.insert(collection_name='test01', records=vectors, ids=vector_ids)
milvus.insert('test01', vectors, partition_tag="tag01")
ivf_param = {'nlist': 16384}
milvus.create_index('test01', IndexType.IVF_FLAT, ivf_param)
services:
milvus:
image: 'milvusdb/milvus:1.0.0-cpu-d030521-1ea92e'
hostname: milvus.local
networks:
binhbtn:
ipv4_address: 172.23.0.3
volumes:
- /tmp/db:/var/lib/milvus/db
- /tmp/logs:/var/lib/milvus/logs
- /tmp/wal:/var/lib/milvus/wal
milvusv2:
image: 'milvusdb/milvus:1.0.0-cpu-d030521-1ea92e'
hostname: milvusv2.local
networks:
binhbtn:
ipv4_address: 172.23.0.4
volumes:
- /tmp/2/db:/var/lib/milvus/db
- /tmp/2/logs:/var/lib/milvus/logs
- /tmp/2/wal:/var/lib/milvus/wal
python37:
image: 'python:3.7.13'
tty: true
networks:
binhbtn:
ipv4_address: 172.23.0.5
volumes:
- /tmp/2/db:/var/lib/milvus/db
- /tmp/2/logs:/var/lib/milvus/logs
- /tmp/2/wal:/var/lib/milvus/wal
- /tmp/db:/var/lib/milvus/dest/db
- /tmp/logs:/var/lib/milvus/dest/logs
- /tmp/wal:/var/lib/milvus/dest/wal
depends_on:
- milvus
- milvusv2
networks:
binhbtn:
driver: bridge
ipam:
config:
- subnet: 172.23.0.0/16
M2M: milvus_version: 1.0.0
source_milvus_path: '/var/lib/milvus'
mysql_parameter:
source_collection:
test01:
dest_host: 'milvus.local'
dest_port: 19530
mode: 'overwrite'
H2M:
milvus_version: 1.x
data_path:
data_dir: '/var/lib/milvus/backup'
dest_host: '172.23.0.3'
dest_port: 19530
mode: 'overwrite'
dest_collection_name: 'test01'
dest_partition_name: 'tag01'
collection_parameter:
dimension:
index_file_size:
metric_type:
M2H:
milvus_version: 1.0.0
source_milvus_path: '/var/lib/milvus'
mysql_parameter:
source_collection:
test01:
data_dir: '/var/lib/milvus/backup'
ERROR | milvus_to_milvus.py | transform_milvus_data | 44 | Error with: local variable 'total_vectors' referenced before assignment
I'm wondering how we can backup index and vector data from a Milvus cluster to an HDF5 file for HA purposes.
When I transform data of milvus, some situations happend which some vectors can not find in new milvus.
origin milvus:
version=0.10.4,
index_type=IndexType.IVF_FLAT
index_param={'nlist': 16384}
new milvus:
version=1.0.0,
index_type=IndexType.IVF_FLAT
index_param={'nlist': 16384}
2021-02-08 11:52:48,802 | DEBUG | data_to_milvus.py | insert_data | 69 | Successfuly insert collection: test_bina/partition: , total num: 5000
Only the total number of vectors, and hopefully all the information inserted in Milvus colletion can be printed
milvusdm --yaml H2M.yaml
0%| | 0/1 [00:00<?, ?it/s]2023-01-09 21:10:13,679 | ERROR | grpc_handler.py | handler | 72 |
Addr [192.168..:19530] bulk_insert
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "grpc: received message larger than max (77985775 vs. 67108864)"
debug_error_string = "{"created":"@1673316613.678811865","description":"Error received from peer
ipv4:192.168..:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"grpc: received message larger than max (77985775 vs. 67108864)","grpc_status":8}"
{'API start': '2023-01-09 21:10:11.611670', 'RPC start': '2023-01-09 21:10:11.612275', 'RPC error': '2023-01-09 21:10:13.679653'}
2023-01-09 21:10:13,680 | ERROR | milvus_client.py | insert | 86 | <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "grpc: received message larger than max (77985775 vs. 67108864)"
debug_error_string = "{"created":"@1673316613.678811865","description":"Error received from peer
ipv4:192.168..:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"grpc: received message larger than max (77985775 vs. 67108864)","grpc_status":8}"
0%| | 0/1 [00:04<?, ?it/s]
how can i solve this problem? thanks.
(Status(code=0, message='Show collections successfully!'), ['milvus_datas'])
2021-03-25 16:50:16,962 | ERROR | milvus_to_milvus.py | transform_milvus_data | 47 | Error with: The sour collection: milvus_datas does not exists.
An error encountered when using DM to migrate data from Milvus to HDF5.
22-06-07 16:21:30,979 | INFO | milvus_to_hdf5.py | read_milvus_data | 49 | Ready to read all data of collection: video_fingerprint/partitions: [None]
0%| | 0/1 [00:03<?, ?it/s]
2022-06-07 16:21:34,230 | ERROR | milvus_to_hdf5.py | read_milvus_data | 56 | Error with: name 'delids' is not defined
Have tried different versions of milvusdm (1.0, 2.0) and the results are same. And it happened with another milvus server.
Version of milvusdm 2.0
Version of milvus 1.1.1
Configuration shown below ( M2H.yaml )
M2H:
milvus_version: 1.1.1
source_milvus_path: '/data1/milvus_1.x_uni_video'
mysql_parameter:
host: '127.0.0.1'
user: 'root'
port: 3376
password: 'xxxxxxxxxxxx'
database: 'milvus'
source_collection:
video_fingerprint:
data_dir: '/data1/milvus_migration/uni_video'
mode: 'overwrite'
When trying to read an empty collection, milvusdm
fails saying that total_vectors
variable is referenced before assignment. This error origins from get_files_data
function in read_milvus_data.py
:
milvus-tools/pymilvusdm/core/read_milvus_data.py
Lines 55 to 80 in 41143e5
When either segment_list
or row_list
is empty, the for loop won't run and thus an attempt is made to return total_vectors
and total_ids
which haven't been initialized.
I've created a PR with a simple fix/workaround for this issue: #33
Using Milvus 1.1.1, pymilvus==1.1.1
and pymilvusdm==2.0
.
pymilvus
:_DIM = 8
from milvus import Milvus, IndexType, MetricType, Status
milvus = Milvus('127.0.0.1', '19530')
collection_name = 'example_collection'
param = { 'collection_name': collection_name, 'dimension': _DIM }
milvus.create_collection(param)
milvus.flush([collection_name])
M2H.yml
:M2H:
milvus_version: 1.1.1
source_milvus_path: '<SOURCE_MILVUS_PATH>'
mysql_parameter:
host: '127.0.0.1'
user: 'root'
port: 3306
password: 'password'
database: 'milvus'
source_collection:
example_collection:
data_dir: 'backup'
milvusdm --yaml M2H.yml
:<TIMESTAMP> | INFO | milvus_to_hdf5.py | read_milvus_data | 49 | Ready to read all data of collection: example_collection/partitions: [None]
0%| | 0/1 [00:00<?, ?it/s]
<TIMESTAMP>| ERROR | milvus_to_hdf5.py | read_milvus_data | 56 | Error with: local variable 'total_vectors' referenced before assignment
Same error happens when non-default partition is used and contains some vectors, while the default partition stays empty.
Why is the benchmark in a binary, and not in actual code? This is not a transparent way to share and replicate benchmark results.
Why is Milvus 2.x not supported now?
It hasn't been updated for so long.
i want to test the search performance of annoy index.
After i ingested data into milvus and built index, i run the go_benchmark and receive the following exception:
It seems that 'benchmark' binary currently not support annoy index.
Does milvus has the plan to open-source the source code of benchmark?
we are going to add new field like "hash". but i read some article milvus didn't support alter feature yet.
so we're testing milvusDM tool.
is it possible to migrate data from original collection to new collection on same host like below ?
original collection schema is id / image_url / embeddings.
new collection schema is id / image_url / embeddings / hash.
thank you
I tested Milvusdm with 100 million data, 10 million data, respectively, and when the amount of data is large, it simply does not run in small machines, summary: Milvusdm occupies a very high memory, please fix
2022-07-25 11:26:22,835 | ERROR | grpc_handler.py | ping | 338 | Retry to connect server 127.0.0.1:19530 failed.
2022-07-25 11:26:22,835 | ERROR | main.py | fai2mil | 72 | Fail connecting to server on 127.0.0.1:19530. Timeout
when I execute 'milvusdm --yaml M2M.yaml'
Exception:Error with: name 'delids' is not defined
as title, our milvus use etcd to manage metadata, but milvus to milvus no config to connect etcd
Can the milvus benchmark 2.1 measure recall? I find that it can only measure throughput now.
For now "pymilvusdm only supports faiss flat and ivf_flat index files", so is there a plan to support faiss ivf_pq index files?
Or any clue or document to start this work by loading the index?
The header is 'IxPT'.
使用milvusdm迁移collection
源和目标节点版本均为0.10.5
由于0.10.5版本无分区,配置如下:
source_collection:
collection_name_xxx:
- ''
执行迁移时提示以下异常,请问是目标集群的限制吗?该如何解决呢?
ERROR | milvus_to_milvus.py | transform_milvus_data | 44 | Error with: cannot reshape array of size 100335616 into shape (48992,64)
When Milvus to HDF5.Partition not set,Generated file named None.h5.I don't know if that works
When HDF5 to Milvus.I don't know what to specify as the ’dest_partition_name‘ attribute when I don't have a partition,Finally, I gave an empty string
Finally, the following error occurred
2022-06-21 16:44:23,939 | ERROR | grpc_handler.py | handler | 72 |
Addr [192.168.23.131:19530] fake_register_link
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = ""
debug_error_string = "{"created":"@1655801063.939163866","description":"Error received from peer ipv4:192.168.23.131:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"","grpc_status":12}"
{'API start': '2022-06-21 16:44:23.937994', 'RPC start': '2022-06-21 16:44:23.938438', 'RPC error': '2022-06-21 16:44:23.939350'}
2022-06-21 16:47:28,638 | ERROR | main.py | execute | 139 | server is not healthy, please try again later
I'm really going crazy. Can you tell me how to specify configuration items when I don't have a specific partition
Hi, I'm using "/" to use a hierarchical structure in the partition, fails to save M2H.
In save_data.py, hdf5_filename and yaml_filename are just created by concatenating partition_tag.
I think "/" be escaped or os.makedir
in save_hdf5_data (not in __init__
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.