update README, and consider repository finished (at least for now)

This commit is contained in:
2024-09-11 13:19:41 +01:00
parent db8fa8ae80
commit 0dda4dc974
15 changed files with 154 additions and 8 deletions

View File

@@ -69,12 +69,80 @@ end
The reason for implementing a new trie specialised to strings rather than using Chris Okasaki's IntMap data structure is to enable prefix searching, where it is possible to get a list of all keys matching a certain prefix. The reason for implementing a new trie specialised to strings rather than using Chris Okasaki's IntMap data structure is to enable prefix searching, where it is possible to get a list of all keys matching a certain prefix.
# To-do # Benchmarks
- [x] Add `foldl`, `foldr`, `foldlWithPrefix`, `foldrWithPrefix` functions to string set There are a few benchmarks in the `bench` folder, comparing three operations (insertion, lookup and retrieval of keys matching a prefix).
- [ ] Benchmarks (possibly comparing to a set of strings in a balanced binary tree)
- [ ] Use unrolled linked list with zipper in trie, to limit size of vector as allocating large vectors repeatedly is expensive The two data structures compared include:
- [ ] Implement StringMap, containing both keys and values
- An implementation of 1-2 Brother Trees described by Ralf Hinze
- The compressed string tries implemented in this repository, not based on an existing paper
## Insertion
- `bench/insert-string-set`
- 247.5 milliseconds
- `bench/insert-bro-tree`
- 183.9 milliseconds
The `insertion` benchmarks inserting every word from `bench/words.sml` into the respective data structure, in order.
StringSet is 1.3x slower than BroTree here.
## Exists
- `bench/build-exists-string-set`
- 48 milliseconds
- `bench/build-exists-bro-tree`
- 16 milliseconds
These benchmarks involve:
- Inserting every word from `bench/words.sml` to build a data structure with keys to look for
- Then testing to see if every key from `bench/words.sml` exists in the data structure
The reported times only measure the time taken for the second bullet point; the first bullet point was already measured in the `insertion` benchmark.
StringSet is 3x slower than BroTree here.
## Get prefix list
- `bench/build-get-prefix-string-set`
- 310,000 nanoseconds
- `bench/build-get-prefix-bro-tree`
- 3,477,000 nanoseconds
These benchmarks involve:
- Inserting every word from `bench/words.sml` to build a data structure with keys to look for
- Creating a list containing every word in the data structure that starts with "a"
As with the `exists` benchmark, only the time for the second bullet point is measured.
StringSet is 11x faster than BroTree here.
This result shouldn't be a surprise.
A binary tree needs to fold over every node in the tree, checking if the keys in node starts with the prefix. That takes O(n) time.
A trie is smarter about this. It only needs to travel to a specific prefix and get the subtrie for that prefix. Then one can fold over the subtrie rather than the whole trie, which takes much less time.
## Benchmarks conclusion
The benchmarks have a clear similarity to those in [Chris Okasaki's paper on Fast Mergeable Integer Maps](https://ia600204.us.archive.org/0/items/djoyner-papers/SHA256E-s118221--efee082ebebce89bebdbc041ab9bf8cbd2bcb91e48809a204318e1a89bf15435.pdf).
- The insertion and lookup/exists operations are both faster on balanced binary trees
- The trie-specific operation (in this repository: search by prefix, in the paper: merge tries together) is much faster for tries than for binary trees.
Like the paper says, it's probably worth using a trie only if you care about using the trie-specific operation a lot.
The description of Data.IntMap for Haskell seems to disagree with the first bullet point, stating:
> my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map implementation (see Data.Map).
This statement surprises me. It's not the case that IntMap was faster for insertion and lookup in the aforementioned paper, and an IntMap implementation I coded in F# was also slower for these operations.
I would be interested in whether it is true for Haskell that these operations were faster. Lazy evaluation might help somehow, or the Haskell implementation might use tricks not described in the paper.
# Credits # Credits

View File

@@ -78,9 +78,25 @@ struct
end end
| _ => raise Match | _ => raise Match
fun helpStartsWith (pos, prefix, key) =
if pos = String.size prefix then
true
else
let
val prefixChr = String.sub (prefix, pos)
val keyChr = String.sub (key, pos)
in
if keyChr = prefixChr then helpStartsWith (pos + 1, prefix, key)
else false
end
fun startsWith (prefix, key) =
if String.size prefix > String.size key then false
else helpStartsWith (0, prefix, key)
fun getPrefixList (prefix, tree) = fun getPrefixList (prefix, tree) =
foldr foldr
( (fn (k, acc) => if String.isSubstring prefix k then k :: acc else acc) ( (fn (k, acc) => if startsWith (prefix, k) then k :: acc else acc)
, [] , []
, tree , tree
) )

BIN
bench/build-exists-bro-tree Executable file

Binary file not shown.

View File

@@ -23,7 +23,8 @@ struct
val finishTime = Time.now () val finishTime = Time.now ()
val searchDuration = Time.- (finishTime, startTime) val searchDuration = Time.- (finishTime, startTime)
val searchDuration = Time.toString searchDuration ^ "\n" val searchDuration = Time.toMilliseconds searchDuration
val searchDuration = LargeInt.toString searchDuration ^ "\n"
in in
print searchDuration print searchDuration
end end

BIN
bench/build-exists-string-set Executable file

Binary file not shown.

View File

@@ -24,7 +24,8 @@ struct
val finishTime = Time.now () val finishTime = Time.now ()
val searchDuration = Time.- (finishTime, startTime) val searchDuration = Time.- (finishTime, startTime)
val searchDuration = Time.toString searchDuration ^ "\n" val searchDuration = Time.toMilliseconds searchDuration
val searchDuration = LargeInt.toString searchDuration ^ "\n"
in in
print searchDuration print searchDuration
end end

BIN
bench/build-get-prefix-bro-tree Executable file

Binary file not shown.

View File

@@ -0,0 +1,10 @@
$(SML_LIB)/basis/basis.mlb
ann
"allowVectorExps true"
in
words.sml
end
bro-tree.sml
build-get-prefix-bro-tree.sml

View File

@@ -0,0 +1,20 @@
structure BuildGetPrefixBroTree =
struct
fun main () =
let
val endTrie =
Vector.foldl BroTree.insert BroTree.empty WordsList.words
val startTime = Time.now ()
val lst = BroTree.getPrefixList ("a", endTrie)
val finishTime = Time.now ()
val searchDuration = Time.- (finishTime, startTime)
val searchDuration = Time.toNanoseconds searchDuration
val searchDuration = LargeInt.toString searchDuration ^ " ns\n"
in
print searchDuration
end
end
val _ = BuildGetPrefixBroTree.main ()

BIN
bench/build-get-prefix-string-set Executable file

Binary file not shown.

View File

@@ -0,0 +1,10 @@
$(SML_LIB)/basis/basis.mlb
ann
"allowVectorExps true"
in
words.sml
end
../src/string-set.sml
build-get-prefix-string-set.sml

View File

@@ -0,0 +1,20 @@
structure BuildGetPrefixStringSet =
struct
fun main () =
let
val endTrie =
Vector.foldl StringSet.insert StringSet.empty WordsList.words
val startTime = Time.now ()
val lst = StringSet.getPrefixList ("a", endTrie)
val finishTime = Time.now ()
val searchDuration = Time.- (finishTime, startTime)
val searchDuration = Time.toNanoseconds searchDuration
val searchDuration = LargeInt.toString searchDuration ^ " ns\n"
in
print searchDuration
end
end
val _ = BuildGetPrefixStringSet.main ()

BIN
bench/conv-words Executable file

Binary file not shown.

BIN
bench/insert-bro-tree Executable file

Binary file not shown.

BIN
bench/insert-string-set Executable file

Binary file not shown.