Coder Social home page Coder Social logo

sesame's Introduction

Sesame

This project aims at building a scalable stream mining library on modern hardware.

  • The repo contains currently several representative real-world stream clustering algorithms and several synthetic algorithms.
  • We welcome your contributions, if you are interested to contribute to the project, please fork and submit a PR. If you have questions, feel free to log an issue.

Build Dependency

  • GCC-11 (In our paper, we use gcc-11.2.0)
  • Boost: 1.78.0 Link
  • GFLAGS: 2.2.0 Link

Real-world algorithms

Algorithm Window Model Outlier Detection Summarizing Data Structure Offline Refinement
BIRCH LandmarkWM OutlierD CFT
CluStream LandmarkWM OutlierD-T MCs
DenStream DampedWM OutlierD-BT MCs
DStream DampedWM OutlierD-T Grids
StreamKM++ LandmarkWM NoOutlierD CoreT
DBStream DampedWM OutlierD-T MCs
EDMStream DampedWM OutlierD-BT DPT
SL-KMeans SlidingWM NoOutlierD AMS

Synthetic algorithms

Algorithm Window Model Outlier Detection Summarizing Data Structure Offline Refinement
G1 LandmarkWM OutlierD MCs
G2 LandmarkWM OutlierD MCs
G3 LandmarkWM OutlierD CFT
G4 SlidingWM OutlierD MCs
G5 DampedWM OutlierD-B MCs
G6 LandmarkWM NoOutlierD MCs
G8 LandmarkWM OutlierD MCs
G9 LandmarkWM OutlierD Grids
G10 LandmarkWM OutlierD DPT
G11 LandmarkWM OutlierD-T MCs
G12 LandmarkWM OutlierD-B MCs
G13 LandmarkWM OutlierD-BT MCs
G14 LandmarkWM OutlierD AMS
G15 LandmarkWM OutlierD CoreT

Datasets

DataSet Length Dimension Cluster Number
CoverType 581012 54 7
KDD-99 4898431 41 23
Insects 905145 33 24
Sensor 2219803 5 55
EDS 45690, 100270, 150645, 200060, 245270 2 75, 145, 218, 289, 363
ODS 94720,97360,100000 2 90, 90, 90

You may download the datasets here: https://zenodo.org/records/8210331

How to Cite Sesame

  • [SIGMOD 2023] Xin Wang and Zhengru Wang and Zhenyu Wu and Shuhao Zhang and Xuanhua Shi and Li Lu. Data Stream Clustering: An In-depth Empirical Study, SIGMOD, 2023
@inproceedings{wang2023sesame,
	title        = {Data Stream Clustering: An In-depth Empirical Study},
	author       = {Xin Wang and Zhengru Wang and Zhenyu Wu and Shuhao Zhang and Xuanhua Shi and Li Lu},
	year         = 2023,
	booktitle    = {Proceedings of the 2023 International Conference on Management of Data (SIGMOD)},
	location     = {Seattle, WA, USA},
	publisher    = {Association for Computing Machinery},
	address      = {New York, NY, USA},
	series       = {SIGMOD '23},
	abbr         = {SIGMOD},
	bibtex_show  = {true},
	selected     = {true},
	pdf          = {papers/Sesame.pdf},
	code         = {https://github.com/intellistream/Sesame},
	doi	         = {10.1145/3589307},
        url          = {https://doi.org/10.1145/3589307}
}

sesame's People

Contributors

gabrielwunr avatar shuhaozhangtony avatar tuidan avatar wzru avatar zhonghao-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

sesame's Issues

Data Generator

Many existing works use their own syntetic data generator. We need to design our own comprehensive synthetic data generator to evaluate different workload aspects.

DataSource.cpp

load method still contains hard-coded components, try to remove them and clean up the code.

Make print_help more useful.

void BenchmarkUtils::print_help(char *string) {
SESAME_ERROR("Usage: " << string << " [options]"); SESAME_ERROR(" Available options: ");
}

Den-stream

  • Need to summarize Den-stream in paper
  • Complete coding of Den-stream

Implement STREAM algorithm

Clustering Data Streams, by Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan, which appeared in FOCS 2000.

Synthetic DataSet

The previous synthetic dataset is not suitable so we need to design a new one. Below are some considerations:
length / cluster number / dimension / arrival rate / concept evolution occurring frequency / outliers

Remove hard-coded queue initialization

For example.

SESAME::DataSource::DataSource() {
inputQueue = std::make_shared<rigtorp::SPSCQueue>(1000);//TODO: remove hard-coded queue initialization.
threadPtr = std::make_shared();
}

Add SSE evaluation measurement

Add external evaluation measurement -- Which can measure stream clustering on evolution stream

Reference to ‘An Effective Evaluation Measure for Clustering on Evolving Data Streams’ (CMM)

An existing implementation of MOA

Compile on Mac

Log4cxx not available on Mac.
Use a Macro to control it.
If Log4cxx is not available, change Log4cxx_LOG to cout.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.