diff --git a/README.md b/README.md index 1d98b9f..d0de4bd 100644 --- a/README.md +++ b/README.md @@ -69,12 +69,80 @@ end The reason for implementing a new trie specialised to strings rather than using Chris Okasaki's IntMap data structure is to enable prefix searching, where it is possible to get a list of all keys matching a certain prefix. -# To-do +# Benchmarks -- [x] Add `foldl`, `foldr`, `foldlWithPrefix`, `foldrWithPrefix` functions to string set -- [ ] Benchmarks (possibly comparing to a set of strings in a balanced binary tree) -- [ ] Use unrolled linked list with zipper in trie, to limit size of vector as allocating large vectors repeatedly is expensive -- [ ] Implement StringMap, containing both keys and values +There are a few benchmarks in the `bench` folder, comparing three operations (insertion, lookup and retrieval of keys matching a prefix). + +The two data structures compared include: + +- An implementation of 1-2 Brother Trees described by Ralf Hinze +- The compressed string tries implemented in this repository, not based on an existing paper + +## Insertion + +- `bench/insert-string-set` + - 247.5 milliseconds +- `bench/insert-bro-tree` + - 183.9 milliseconds + +The `insertion` benchmarks inserting every word from `bench/words.sml` into the respective data structure, in order. + +StringSet is 1.3x slower than BroTree here. + +## Exists + +- `bench/build-exists-string-set` + - 48 milliseconds +- `bench/build-exists-bro-tree` + - 16 milliseconds + +These benchmarks involve: + +- Inserting every word from `bench/words.sml` to build a data structure with keys to look for +- Then testing to see if every key from `bench/words.sml` exists in the data structure + +The reported times only measure the time taken for the second bullet point; the first bullet point was already measured in the `insertion` benchmark. + +StringSet is 3x slower than BroTree here. + +## Get prefix list + +- `bench/build-get-prefix-string-set` + - 310,000 nanoseconds +- `bench/build-get-prefix-bro-tree` + - 3,477,000 nanoseconds + +These benchmarks involve: + +- Inserting every word from `bench/words.sml` to build a data structure with keys to look for +- Creating a list containing every word in the data structure that starts with "a" + +As with the `exists` benchmark, only the time for the second bullet point is measured. + +StringSet is 11x faster than BroTree here. + +This result shouldn't be a surprise. + +A binary tree needs to fold over every node in the tree, checking if the keys in node starts with the prefix. That takes O(n) time. + +A trie is smarter about this. It only needs to travel to a specific prefix and get the subtrie for that prefix. Then one can fold over the subtrie rather than the whole trie, which takes much less time. + +## Benchmarks conclusion + +The benchmarks have a clear similarity to those in [Chris Okasaki's paper on Fast Mergeable Integer Maps](https://ia600204.us.archive.org/0/items/djoyner-papers/SHA256E-s118221--efee082ebebce89bebdbc041ab9bf8cbd2bcb91e48809a204318e1a89bf15435.pdf). + +- The insertion and lookup/exists operations are both faster on balanced binary trees +- The trie-specific operation (in this repository: search by prefix, in the paper: merge tries together) is much faster for tries than for binary trees. + +Like the paper says, it's probably worth using a trie only if you care about using the trie-specific operation a lot. + +The description of Data.IntMap for Haskell seems to disagree with the first bullet point, stating: + +> my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map implementation (see Data.Map). + +This statement surprises me. It's not the case that IntMap was faster for insertion and lookup in the aforementioned paper, and an IntMap implementation I coded in F# was also slower for these operations. + +I would be interested in whether it is true for Haskell that these operations were faster. Lazy evaluation might help somehow, or the Haskell implementation might use tricks not described in the paper. # Credits diff --git a/bench/bro-tree.sml b/bench/bro-tree.sml index 55bafec..c86595d 100644 --- a/bench/bro-tree.sml +++ b/bench/bro-tree.sml @@ -78,9 +78,25 @@ struct end | _ => raise Match + fun helpStartsWith (pos, prefix, key) = + if pos = String.size prefix then + true + else + let + val prefixChr = String.sub (prefix, pos) + val keyChr = String.sub (key, pos) + in + if keyChr = prefixChr then helpStartsWith (pos + 1, prefix, key) + else false + end + + fun startsWith (prefix, key) = + if String.size prefix > String.size key then false + else helpStartsWith (0, prefix, key) + fun getPrefixList (prefix, tree) = foldr - ( (fn (k, acc) => if String.isSubstring prefix k then k :: acc else acc) + ( (fn (k, acc) => if startsWith (prefix, k) then k :: acc else acc) , [] , tree ) diff --git a/bench/build-exists-bro-tree b/bench/build-exists-bro-tree new file mode 100755 index 0000000..1504273 Binary files /dev/null and b/bench/build-exists-bro-tree differ diff --git a/bench/build-exists-bro-tree.sml b/bench/build-exists-bro-tree.sml index 8a23b5c..786d20c 100644 --- a/bench/build-exists-bro-tree.sml +++ b/bench/build-exists-bro-tree.sml @@ -23,7 +23,8 @@ struct val finishTime = Time.now () val searchDuration = Time.- (finishTime, startTime) - val searchDuration = Time.toString searchDuration ^ "\n" + val searchDuration = Time.toMilliseconds searchDuration + val searchDuration = LargeInt.toString searchDuration ^ "\n" in print searchDuration end diff --git a/bench/build-exists-string-set b/bench/build-exists-string-set new file mode 100755 index 0000000..9a00ba8 Binary files /dev/null and b/bench/build-exists-string-set differ diff --git a/bench/build-exists-string-set.sml b/bench/build-exists-string-set.sml index aab17e8..6351d86 100644 --- a/bench/build-exists-string-set.sml +++ b/bench/build-exists-string-set.sml @@ -24,7 +24,8 @@ struct val finishTime = Time.now () val searchDuration = Time.- (finishTime, startTime) - val searchDuration = Time.toString searchDuration ^ "\n" + val searchDuration = Time.toMilliseconds searchDuration + val searchDuration = LargeInt.toString searchDuration ^ "\n" in print searchDuration end diff --git a/bench/build-get-prefix-bro-tree b/bench/build-get-prefix-bro-tree new file mode 100755 index 0000000..8a1775f Binary files /dev/null and b/bench/build-get-prefix-bro-tree differ diff --git a/bench/build-get-prefix-bro-tree.mlb b/bench/build-get-prefix-bro-tree.mlb new file mode 100644 index 0000000..28e1ebd --- /dev/null +++ b/bench/build-get-prefix-bro-tree.mlb @@ -0,0 +1,10 @@ +$(SML_LIB)/basis/basis.mlb + +ann + "allowVectorExps true" +in + words.sml +end + +bro-tree.sml +build-get-prefix-bro-tree.sml diff --git a/bench/build-get-prefix-bro-tree.sml b/bench/build-get-prefix-bro-tree.sml new file mode 100644 index 0000000..4bae726 --- /dev/null +++ b/bench/build-get-prefix-bro-tree.sml @@ -0,0 +1,20 @@ +structure BuildGetPrefixBroTree = +struct + fun main () = + let + val endTrie = + Vector.foldl BroTree.insert BroTree.empty WordsList.words + + val startTime = Time.now () + val lst = BroTree.getPrefixList ("a", endTrie) + val finishTime = Time.now () + + val searchDuration = Time.- (finishTime, startTime) + val searchDuration = Time.toNanoseconds searchDuration + val searchDuration = LargeInt.toString searchDuration ^ " ns\n" + in + print searchDuration + end +end + +val _ = BuildGetPrefixBroTree.main () diff --git a/bench/build-get-prefix-string-set b/bench/build-get-prefix-string-set new file mode 100755 index 0000000..f15b4ac Binary files /dev/null and b/bench/build-get-prefix-string-set differ diff --git a/bench/build-get-prefix-string-set.mlb b/bench/build-get-prefix-string-set.mlb new file mode 100644 index 0000000..d3ea5e8 --- /dev/null +++ b/bench/build-get-prefix-string-set.mlb @@ -0,0 +1,10 @@ +$(SML_LIB)/basis/basis.mlb + +ann + "allowVectorExps true" +in + words.sml +end + +../src/string-set.sml +build-get-prefix-string-set.sml diff --git a/bench/build-get-prefix-string-set.sml b/bench/build-get-prefix-string-set.sml new file mode 100644 index 0000000..c97572e --- /dev/null +++ b/bench/build-get-prefix-string-set.sml @@ -0,0 +1,20 @@ +structure BuildGetPrefixStringSet = +struct + fun main () = + let + val endTrie = + Vector.foldl StringSet.insert StringSet.empty WordsList.words + + val startTime = Time.now () + val lst = StringSet.getPrefixList ("a", endTrie) + val finishTime = Time.now () + + val searchDuration = Time.- (finishTime, startTime) + val searchDuration = Time.toNanoseconds searchDuration + val searchDuration = LargeInt.toString searchDuration ^ " ns\n" + in + print searchDuration + end +end + +val _ = BuildGetPrefixStringSet.main () diff --git a/bench/conv-words b/bench/conv-words new file mode 100755 index 0000000..ad69a89 Binary files /dev/null and b/bench/conv-words differ diff --git a/bench/insert-bro-tree b/bench/insert-bro-tree new file mode 100755 index 0000000..0994a27 Binary files /dev/null and b/bench/insert-bro-tree differ diff --git a/bench/insert-string-set b/bench/insert-string-set new file mode 100755 index 0000000..083f756 Binary files /dev/null and b/bench/insert-string-set differ