Coder Social home page Coder Social logo

learnedindex_yang's Introduction

Learned Indexes

Implementation of BTree part for paper 'The Case for Learned Index Structures'.

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017

Language: Python
Support content: Integer values, Random and Exponential distribution

Files Structures

data/: test data
model/: learned NN model
perfromance/:NN and BTree performance
Learned_BTree.py: main file
Trained_NN.py: NN structures

HOW TO RUN

First, you need to install python2.7.x and package tensorflow, pandas, numpy, enum.
Second, use command to run the Learned_BTree.py fule, that is,
python Learned_BTree.py -t <Type> -d <Distribution> [-p|-n] [Percent]|[Number] [-c] [New data] [-h].

Parameters:
'type': 'Type: sample, full',
'distribution': 'Distribution: random, exponential',
'percent': 'Percent: 0.1-1.0, default value = 0.5; sample train data size = 300,000',
'number': 'Number: 10,000-10,000,000, default value = 300,000',
'new data' 'New Data: INTEGER, 0 for no creating new data file, others for creating'

Example:
python Learned_BTree.py -t full -d random -n 100000 -c 1

Other Content

Sample Training

Sample training is also included in this project, you can use parameter 'sample' for '-t' to test sample training, while '-p' is used for change the sample training percent.

Example:
python Learned_BTree.py -t sample -d random -p 0.3 -c 0

Storage Optimization

More Information will be added soon.


中文

本项目代码是对《The Case For Learned Index Structures》一文的简单实现,实现了文章中BTree的部分,目前支持整数测试集,可以选用随机分布或者指数分布进行测试。

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017

此外,项目还对有新数据插入的场景进行了探索。

数据索引

主要步骤

  1. 依据论文中**,搭建混合多级神经网络架构 Stage Models
 Input: int threshold, int stages[]
 Data: record data[]
 Result: trained index
1 M = stages.size;
2 tmp_records[][];
3 tmp_records[1][1] = all_data;
4 for i ← 1 to M do
5   for j ← 1 to stages[i] do
6     index[i][j] = new NN trained on tmp.records[i][j];
7     if i < M then
8       for r ∈ tmp.records[i][j] do
9         𝑞 = f(r.key) / stages[i + 1];
10        tmp.records[i + 1][ 𝑞].add(r);
11 for j ← 1 to index[M].size do
12   index[M][j].calc_err(tmp.records[M][j]);
13   if index[M][j].max_abs_err > threshold then
14     index[M][j] = new B-Tree trained on tmp_records[M][j];
15 return index;

在以上程序中,从整个数据集开始(第3行)开始,首先训练第1级模型。基于第1级模型的预测,从下一级挑选模型,并添加相应的关键字到该模型训练集(第9行和第10行),然后训练这个模型。最后,检查最后一级的模型,如果平均误差高于预定义的阈值,则用B树替换神经网络模型(第11-14行)。 所使用的模型均为全连接网络,随机分布用的是没有隐藏的全连接网络;指数分布用的是有1个有8个核的隐藏层的全连接网络

  1. 使用数据测试神经网络索引和B树索引,对比两者性能

抽样学习

神经网络模型的训练需要较长的时间,通过抽取一部分数据训练的方式,加快训练的速度。


存储优化

基于后续插入数据的分布与现有分布相近的观点。

主要步骤

  1. 根据建立的数据索引估计数据分布,并移动数据的位置,预留出空间。比如原先0-100的数据占据100个BLOCK,预计最终存储数据是现在的2倍,则预留100个BLOCK。

  2. 插入数据,与不进行优化比较。

优势

  1. 新插入数据冲突少,加快插入速度。
  2. 无需重新调整索引,降低索引维护代价,支持了新数据插入场景。

learnedindex_yang's People

Contributors

yangjufo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.