Coder Social home page Coder Social logo

bon's People

Contributors

tiendung avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

bon's Issues

An Adaptive Radix Tree để lưu từ điển

https://github.com/travisstaloch/art.zig

https://db.in.tum.de/~leis/papers/ART.pdf

This library provides a zig implementation of the Adaptive Radix Tree or ART. The ART operates similar to a traditional radix tree but avoids the wasted space of internal nodes by changing the node size. It makes use of 4 node sizes (4, 16, 48, 256), and can guarantee that the overhead is no more than 52 bytes per key, though in practice it is much lower. As a radix tree, it provides the following:

  • O(k) operations. In many cases, this can be faster than a hash table since the hash function is an O(k) operation, and hash tables have very poor cache locality.
  • Minimum / Maximum value lookups
  • Prefix compression
  • Ordered iteration
  • Prefix based iteration

Xin hướng dẫn cài đặt telexify bằng zig

Xin chào team ạ.
Em đọc mà không có thấy phần nào hướng dẫn build và sử dụng telexify ạ.
Em đã cài đặt zig (window x86) nhưng không thể sử dụng được telexify (bản build sẵn) hay thử build lại thông qua:
zig build zig build -Drelease-fast=true tại folder bon\simdify nhưng không thành công ạ.

error: no field or member function named 'standardReleaseOptions' in 'Build'
const mode = b.standardReleaseOptions();
~^~~~~~~~~~~~~~~~~~~~~~~
D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1:1: note: struct declared here
const std = @import("std.zig");
^~~~~
referenced by:
runBuild__anon_7220: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1602:27
steps__anon_6992: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\build_runner.zig:914:20
remaining reference traces hidden; use '-freference-trace' to see all reference traces

Vậy anh có thể cho em xin tài liệu hướng dẫn sử dụng cho phần này được không ạ?

Minimal perfect hash function for Zig

https://github.com/judofyr/zini

Given a set of n elements, with the only requirement being that you can hash them, it generates a hash function which maps each element to a distinct number between 0 and n - 1. The generated hash function is extremely small, typically consuming less than 4 bits per element, regardless of the size of the input type.
The algorithm provides multiple parameters to tune making it possible to optimize for (small) size, (short) construction time, or (short) lookup time.

To give a practical example:

In ~0.6 seconds Zini was able to create a hash function for /usr/share/dict/words containing 235886 words.
The resulting hash function required in total 865682 bits in memory. This corresponds to 108.2 kB in total or 3.67 bits per word.

In comparison, the original file was 2.49 MB and compressing it with gzip -9 only gets it down to 754 kB (which you can't use directly in memory without decompressing it).

It should of course be noted that they don't store the equivalent data as you can't use the generated hash function to determine if a word is present or not in the list. The comparison is mainly useful to get a feeling of the magnitudes.

In addition, Zini provides various functionality for dealing with arrays of numbers:

  • zini.CompactArray stores n-bit numbers tightly packed, leaving no bits unused.
    If the largest value in an array is m then you actually only need n = log2(m) + 1 bits per element.
    E.g. if the largest value is 270, you will get 7x compression using CompactArray over []u64 as it stores each element using only 9 bits (and 64 divided by 9 is roughly 7).
  • zini.DictArray finds all distinct elements in the array, stores each once into a CompactArray (the dictionary), and creates a new CompactArray containing indexes into the dictionary.
    This will give excellent compression if there's a lot of repetition in the original array.
  • zini.EliasFano stores increasing 64-bit numbers in a compact manner.
  • zini.darray provides constant-time support for the select1(i) operation which returns the i-th set bit in a std.DynamicBitSetUnmanaged.

(almost) concurrent hashtable chuỗi bytes ngắn, áp dụng vào counting elems (insert & lookup only)

Hashtable lý tưởng được dùng bởi nhiều threads mà ko conflict, cache friendly (flat_map), tận dụng SIMD intrinsics cho các thao tác comparing, hashing, probing ...

NGUỒN: https://github.com/search?q=simd+hash+table

https://github.com/matmuher/hash_table_optimize thử nghiệm nhiều hàm hash và nhiều cách optimize hashtable bao gồm SIMD

https://github.com/michaelvlach/ADbHash
The ADbHash is a hash table inspired by google's "Swiss table" presented at CppCon 2017 by Matt Kulukundis. It is based on open-addressing hash table storing extra byte per element (key-value pair). In this byte there are stored a control bit (controlling whether the element is empty or full) and the rest of the byte is taken from the hash of the key. When searching through the table 16 of these control bytes are loaded from the position the element we look for is supposed to be. Then they are compared to a byte constructed from the hash we search for. This is achieved by using Single Instruction Multiple Data (SIMD) and thus a typical search in this hash table will take exactly two instructions:

  • Compare 16 bytes with 1 byte.

  • Jump to the matching element.

https://github.com/telekons/42nd-at-threadmill a nearly lock-free* hash table based on Cliff Click's NonBlockingHashMap, and Abseil's flat_hash_map. use SSE2 intrinsics for fast probing, and optionally use AVX2 for faster byte broadcasting.

https://github.com/efficient/libcuckoo

BPE: parse gặp lỗi "Ko tìm thấy count của candidate"

Ko tìm thấy count của nearby symbol 62010d:ael 2:par
Ko tìm thấy count của nearby symbol 710102:par
Ko tìm thấy count của nearby symbol 710102:par 3:Bel 4:den
Ko tìm thấy count của nearby symbol 650105:den 5:uT 6:rit 7:Oc
Ko tìm thấy count của nearby symbol 730118:rit

=> Lỗi do hash count

Chữa lỗi HashCount key's value bị trùng lặp

f64eb80

TOTAL 1118271 entries, max_probs: 16, avg_probs: 1 (1321322 / 1118271).

Hash Count Validation: false

count[–]=70342
count[Internet]=42891
count[ km]=39915
count[km²]=39343
count[Geometridae]=38669
count[album]=37671
count[Internet]=36997
count[Internet]=1
count[ km]=1
count[km²]=1

use MMAP to optimize data reading

https://github.com/Highload-fun/platform/wiki/How-to-use-MMAP-to-optimize-data-reading

STDIN is a file in RAM. That's why you can use STDIN_FILENO to get access to the file.
Now you can create a mapping to this file in the virtual address space of the calling process using the mmap function.

C++:

    #include <unistd.h>
    #include <sys/mman.h>
    ...
    off_t fsize = lseek(0, 0, SEEK_END);
    char* buffer = (char*)mmap(0, fsize, PROT_READ, MAP_PRIVATE | MAP_POPULATE, 0, 0);

Also, the Huge Pages feature is available.

Dùng pre-computed filters để xem token có là syllable tiếng Việt nhanh

Dùng bit filters 65kb để lọc 3-bytes đầu hoặc 3-bytes cuối của token xem có khả năng là tiếng Việt hay không? Nếu có mới phân tích tiếp.

Phức tạp hơn nữa thì dùng XorFilter để lọc cả token.

Hoặc dùng HashMap, DaTrie ... để map trực tiếp 1 token string thành syllable_id

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.