telexyz / bon Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 5.38 MB

Tối ưu hoá xử lý ngữ liệu tiếng Việt

C 0.84% Zig 98.51% Shell 0.65%

bon's People

Contributors

Stargazers

Watchers

bon's Issues

Fast hashtable áp dụng vào counting elems (`insert` & `lookup` only)

Kết hợp:

Note: tự viết 1 hàm hash hiệu quả rất khó => nên dùng hàm hash sẵn của Zig

`char_stream.zig` tách và phân loại token ko phải alphabet

An Adaptive Radix Tree để lưu từ điển

https://github.com/travisstaloch/art.zig

https://db.in.tum.de/~leis/papers/ART.pdf

This library provides a zig implementation of the Adaptive Radix Tree or ART. The ART operates similar to a traditional radix tree but avoids the wasted space of internal nodes by changing the node size. It makes use of 4 node sizes (4, 16, 48, 256), and can guarantee that the overhead is no more than 52 bytes per key, though in practice it is much lower. As a radix tree, it provides the following:

O(k) operations. In many cases, this can be faster than a hash table since the hash function is an O(k) operation, and hash tables have very poor cache locality.
Minimum / Maximum value lookups
Prefix compression
Ordered iteration
Prefix based iteration

`char_stream.zig` bỏ qua token lớn hơn buff.len (64 bytes)

Note: token quá dài thường là đoạn text vô nghĩa nên bỏ qua được => giúp đơn giản hoá code.

Tìm data struct phù hợp để lưu BPE symbols

tham khảo #18 #19

Xin hướng dẫn cài đặt telexify bằng zig

Xin chào team ạ.
Em đọc mà không có thấy phần nào hướng dẫn build và sử dụng telexify ạ.
Em đã cài đặt zig (window x86) nhưng không thể sử dụng được telexify (bản build sẵn) hay thử build lại thông qua:
zig build zig build -Drelease-fast=true tại folder bon\simdify nhưng không thành công ạ.

error: no field or member function named 'standardReleaseOptions' in 'Build'
const mode = b.standardReleaseOptions();
~^~~~~~~~~~~~~~~~~~~~~~~
D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1:1: note: struct declared here
const std = @import("std.zig");
^~~~~
referenced by:
runBuild__anon_7220: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\std\Build.zig:1602:27
steps__anon_6992: D:\setup\zig-windows-x86_64-0.11.0-dev.3704+729a051e9\lib\build_runner.zig:914:20
remaining reference traces hidden; use '-freference-trace' to see all reference traces

Vậy anh có thể cho em xin tài liệu hướng dẫn sử dụng cho phần này được không ạ?

Impl BPE apply

Dùng MacOS perf tools

https://gist.github.com/loderunner/36724cc9ee8db66db305 đo branch misses, cache misses ...

Minimal perfect hash function for Zig

https://github.com/judofyr/zini

Given a set of n elements, with the only requirement being that you can hash them, it generates a hash function which maps each element to a distinct number between 0 and n - 1. The generated hash function is extremely small, typically consuming less than 4 bits per element, regardless of the size of the input type.
The algorithm provides multiple parameters to tune making it possible to optimize for (small) size, (short) construction time, or (short) lookup time.

To give a practical example:

In ~0.6 seconds Zini was able to create a hash function for /usr/share/dict/words containing 235886 words.
The resulting hash function required in total 865682 bits in memory. This corresponds to 108.2 kB in total or 3.67 bits per word.

In comparison, the original file was 2.49 MB and compressing it with gzip -9 only gets it down to 754 kB (which you can't use directly in memory without decompressing it).

It should of course be noted that they don't store the equivalent data as you can't use the generated hash function to determine if a word is present or not in the list. The comparison is mainly useful to get a feeling of the magnitudes.

In addition, Zini provides various functionality for dealing with arrays of numbers:

zini.CompactArray stores n-bit numbers tightly packed, leaving no bits unused.
If the largest value in an array is m then you actually only need n = log2(m) + 1 bits per element.
E.g. if the largest value is 270, you will get 7x compression using CompactArray over []u64 as it stores each element using only 9 bits (and 64 divided by 9 is roughly 7).
zini.DictArray finds all distinct elements in the array, stores each once into a CompactArray (the dictionary), and creates a new CompactArray containing indexes into the dictionary.
This will give excellent compression if there's a lot of repetition in the original array.
zini.EliasFano stores increasing 64-bit numbers in a compact manner.
zini.darray provides constant-time support for the select1(i) operation which returns the i-th set bit in a std.DynamicBitSetUnmanaged.

Làm token repair dựa vào từ điển

https://github.com/telexyz/engine/blob/main/docs/_token_repair.md

(almost) concurrent hashtable chuỗi bytes ngắn, áp dụng vào counting elems (insert & lookup only)

Hashtable lý tưởng được dùng bởi nhiều threads mà ko conflict, cache friendly (flat_map), tận dụng SIMD intrinsics cho các thao tác comparing, hashing, probing ...

NGUỒN: https://github.com/search?q=simd+hash+table

https://github.com/matmuher/hash_table_optimize thử nghiệm nhiều hàm hash và nhiều cách optimize hashtable bao gồm SIMD

https://github.com/michaelvlach/ADbHash
The ADbHash is a hash table inspired by google's "Swiss table" presented at CppCon 2017 by Matt Kulukundis. It is based on open-addressing hash table storing extra byte per element (key-value pair). In this byte there are stored a control bit (controlling whether the element is empty or full) and the rest of the byte is taken from the hash of the key. When searching through the table 16 of these control bytes are loaded from the position the element we look for is supposed to be. Then they are compared to a byte constructed from the hash we search for. This is achieved by using Single Instruction Multiple Data (SIMD) and thus a typical search in this hash table will take exactly two instructions:

Compare 16 bytes with 1 byte.
Jump to the matching element.

https://github.com/telekons/42nd-at-threadmill a nearly lock-free* hash table based on Cliff Click's NonBlockingHashMap, and Abseil's flat_hash_map. use SSE2 intrinsics for fast probing, and optionally use AVX2 for faster byte broadcasting.

https://github.com/efficient/libcuckoo

BPE: parse gặp lỗi "Ko tìm thấy count của candidate"

Ko tìm thấy count của nearby symbol 62010d:ael 2:par
Ko tìm thấy count của nearby symbol 710102:par
Ko tìm thấy count của nearby symbol 710102:par 3:Bel 4:den
Ko tìm thấy count của nearby symbol 650105:den 5:uT 6:rit 7:Oc
Ko tìm thấy count của nearby symbol 730118:rit

=> Lỗi do hash count

BPE những tokens ko phải âm tiết

Tài liệu https://github.com/telexyz/turbo/blob/main/docs/byte_pair_encoding.md

Cạnh tranh với https://github.com/VKCOM/YouTokenToMe

Chữa lỗi HashCount key's value bị trùng lặp

f64eb80

TOTAL 1118271 entries, max_probs: 16, avg_probs: 1 (1321322 / 1118271).

Hash Count Validation: false

count[–]=70342
count[Internet]=42891
count[ km]=39915
count[km²]=39343
count[Geometridae]=38669
count[album]=37671
count[Internet]=36997
count[Internet]=1
count[ km]=1
count[km²]=1

`syllable_count.zig` nên count syllables chính xác 100%

cài lock để dùng chung 1 hash count hoặc dùng mỗi hash count cho 1 thread rồi merge sau!

Cài đặt lại `libdatrie` bằng zig

https://github.com/tlwg/libdatrie

use MMAP to optimize data reading

https://github.com/Highload-fun/platform/wiki/How-to-use-MMAP-to-optimize-data-reading

STDIN is a file in RAM. That's why you can use STDIN_FILENO to get access to the file.
Now you can create a mapping to this file in the virtual address space of the calling process using the mmap function.

C++:

    #include <unistd.h>
    #include <sys/mman.h>
    ...
    off_t fsize = lseek(0, 0, SEEK_END);
    char* buffer = (char*)mmap(0, fsize, PROT_READ, MAP_PRIVATE | MAP_POPULATE, 0, 0);

Also, the Huge Pages feature is available.

Đảo idx của enums trong AmDau AmGiua AmCuoi Tone để xuất ra SyllbleID gần giống với thứ tự Alphabet nhất

Để tiện cho việ impl thuật toán các enums trên gộp theo nhóm thuộc tính (độ dài, âm đóng, ...) khi xuất ra ID cần tráo vị trí của chúng theo Alphabet order và revert khi convert từ ID ra Syllable

Dùng pre-computed filters để xem token có là syllable tiếng Việt nhanh

Dùng bit filters 65kb để lọc 3-bytes đầu hoặc 3-bytes cuối của token xem có khả năng là tiếng Việt hay không? Nếu có mới phân tích tiếp.

Phức tạp hơn nữa thì dùng XorFilter để lọc cả token.

Hoặc dùng HashMap, DaTrie ... để map trực tiếp 1 token string thành syllable_id

telexyz / bon Goto Github PK

bon's People

Contributors

Stargazers

Watchers

bon's Issues

Recommend Projects

Recommend Topics

Recommend Org