Add 'string-tries-sml/' from commit 'd056e08ce768e014ab409c7f63e8fd0adfc1dff2'
git-subtree-dir: string-tries-sml git-subtree-mainline:dba78da7ecgit-subtree-split:d056e08ce7
This commit is contained in:
5
string-tries-sml/LICENSE
Normal file
5
string-tries-sml/LICENSE
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
Copyright (C) 2024 by Humza Shahid <humzasaur@gmail.com>
|
||||||
|
|
||||||
|
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
|
||||||
150
string-tries-sml/README.md
Normal file
150
string-tries-sml/README.md
Normal file
@@ -0,0 +1,150 @@
|
|||||||
|
# string-trie
|
||||||
|
|
||||||
|
This repository implements a set over strings in Standard ML using a trie/prefix tree.
|
||||||
|
|
||||||
|
The signature provided is:
|
||||||
|
|
||||||
|
```
|
||||||
|
signature STRING_SET =
|
||||||
|
sig
|
||||||
|
signature STRING_SET =
|
||||||
|
sig
|
||||||
|
(* the type of tries *)
|
||||||
|
type t
|
||||||
|
|
||||||
|
(* the empty trie *)
|
||||||
|
val empty: t
|
||||||
|
|
||||||
|
(* StringSet.isEmpty trie
|
||||||
|
* returns true if the trie is empty *)
|
||||||
|
val isEmpty: t -> bool
|
||||||
|
|
||||||
|
(* StringSet.fromString "hello world"
|
||||||
|
* creates a trie containing just a string *)
|
||||||
|
val fromString: string -> t
|
||||||
|
|
||||||
|
(* StringSet.exists ("hello world", trie)
|
||||||
|
* returns true if the key was inserted into the trie *)
|
||||||
|
val exists: string * t -> bool
|
||||||
|
|
||||||
|
(* StringSet.insert ("myNewString", trie)
|
||||||
|
* inserts a new string into the trie, returning a new trie *)
|
||||||
|
val insert: string * t -> t
|
||||||
|
|
||||||
|
(* StringSet.remove ("stringToRemove", trie)
|
||||||
|
* removes the key from the trie, returning a new trie *)
|
||||||
|
val remove: string * t -> t
|
||||||
|
|
||||||
|
(* StringSet.getPrefixList ("myPrefix", trie)
|
||||||
|
* returns a list of all keys matching the specified prefix *)
|
||||||
|
val getPrefixList: string * t -> string list
|
||||||
|
|
||||||
|
(* StringSet.toList trie
|
||||||
|
* returns a list containing all keys in the trie *)
|
||||||
|
val toList: t -> string list
|
||||||
|
|
||||||
|
(* StringSet.fromList ["hello", "world"]
|
||||||
|
* returns a trie containing all keys in the string list *)
|
||||||
|
val fromList: string list -> t
|
||||||
|
|
||||||
|
(* StringSet.foldl (fn (key, acc) => String.size key + acc) 0 trie
|
||||||
|
* folds a value through the trie, from lowest to highest. *)
|
||||||
|
val foldl: (string * 'b -> 'b) -> 'b -> t -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldlWithPrefix (fn (key, acc) => String.size key + acc) 0 trie "myPrefix"
|
||||||
|
* folds a value through a subset of the trie containing the specified prefix,
|
||||||
|
* from lowest to highest. *)
|
||||||
|
val foldlWithPrefix: (string * 'b -> 'b) -> 'b -> t -> string -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldr (fn (key, acc) => String.size key + acc) 0 trie
|
||||||
|
* folds a value through the trie, from highest to lowest. *)
|
||||||
|
val foldr: (string * 'b -> 'b) -> 'b -> t -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldrWithPrefix (fn (key, acc) => String.size key + acc) 0 trie "myPrefix"
|
||||||
|
* folds a value through a subset of the trie containing the specified prefix,
|
||||||
|
* from highest to lowest. *)
|
||||||
|
val foldrWithPrefix: (string * 'b -> 'b) -> 'b -> t -> string -> 'b
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The reason for implementing a new trie specialised to strings rather than using Chris Okasaki's IntMap data structure is to enable prefix searching, where it is possible to get a list of all keys matching a certain prefix.
|
||||||
|
|
||||||
|
# Benchmarks
|
||||||
|
|
||||||
|
There are a few benchmarks in the `bench` folder, comparing three operations (insertion, lookup and retrieval of keys matching a prefix).
|
||||||
|
|
||||||
|
The two data structures compared include:
|
||||||
|
|
||||||
|
- An implementation of 1-2 Brother Trees described by Ralf Hinze
|
||||||
|
- The compressed string tries implemented in this repository, not based on an existing paper
|
||||||
|
|
||||||
|
## Insertion
|
||||||
|
|
||||||
|
- `bench/insert-string-set`
|
||||||
|
- 247.5 milliseconds
|
||||||
|
- `bench/insert-bro-tree`
|
||||||
|
- 183.9 milliseconds
|
||||||
|
|
||||||
|
The `insertion` benchmarks inserting every word from `bench/words.sml` into the respective data structure, in order.
|
||||||
|
|
||||||
|
StringSet is 1.3x slower than BroTree here.
|
||||||
|
|
||||||
|
## Exists
|
||||||
|
|
||||||
|
- `bench/build-exists-string-set`
|
||||||
|
- 48 milliseconds
|
||||||
|
- `bench/build-exists-bro-tree`
|
||||||
|
- 16 milliseconds
|
||||||
|
|
||||||
|
These benchmarks involve:
|
||||||
|
|
||||||
|
- Inserting every word from `bench/words.sml` to build a data structure with keys to look for
|
||||||
|
- Then testing to see if every key from `bench/words.sml` exists in the data structure
|
||||||
|
|
||||||
|
The reported times only measure the time taken for the second bullet point; the first bullet point was already measured in the `insertion` benchmark.
|
||||||
|
|
||||||
|
StringSet is 3x slower than BroTree here.
|
||||||
|
|
||||||
|
## Get prefix list
|
||||||
|
|
||||||
|
- `bench/build-get-prefix-string-set`
|
||||||
|
- 310,000 nanoseconds
|
||||||
|
- `bench/build-get-prefix-bro-tree`
|
||||||
|
- 3,477,000 nanoseconds
|
||||||
|
|
||||||
|
These benchmarks involve:
|
||||||
|
|
||||||
|
- Inserting every word from `bench/words.sml` to build a data structure with keys to look for
|
||||||
|
- Creating a list containing every word in the data structure that starts with "a"
|
||||||
|
|
||||||
|
As with the `exists` benchmark, only the time for the second bullet point is measured.
|
||||||
|
|
||||||
|
StringSet is 11x faster than BroTree here.
|
||||||
|
|
||||||
|
This result shouldn't be a surprise.
|
||||||
|
|
||||||
|
A binary tree needs to fold over every node in the tree, checking if the keys in node starts with the prefix. That takes O(n) time.
|
||||||
|
|
||||||
|
A trie is smarter about this. It only needs to travel to a specific prefix and get the subtrie for that prefix. Then one can fold over the subtrie rather than the whole trie, which takes much less time.
|
||||||
|
|
||||||
|
## Benchmarks conclusion
|
||||||
|
|
||||||
|
The benchmarks have a clear similarity to those in [Chris Okasaki's paper on Fast Mergeable Integer Maps](https://ia600204.us.archive.org/0/items/djoyner-papers/SHA256E-s118221--efee082ebebce89bebdbc041ab9bf8cbd2bcb91e48809a204318e1a89bf15435.pdf).
|
||||||
|
|
||||||
|
- The insertion and lookup/exists operations are both faster on balanced binary trees
|
||||||
|
- The trie-specific operation (in this repository: search by prefix, in the paper: merge tries together) is much faster for tries than for binary trees.
|
||||||
|
|
||||||
|
Like the paper says, it's probably worth using a trie only if you care about using the trie-specific operation a lot.
|
||||||
|
|
||||||
|
The description of Data.IntMap for Haskell seems to disagree with the first bullet point, stating:
|
||||||
|
|
||||||
|
> my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map implementation (see Data.Map).
|
||||||
|
|
||||||
|
This statement surprises me. It's not the case that IntMap was faster for insertion and lookup in the aforementioned paper, and an IntMap implementation I coded in F# was also slower for these operations.
|
||||||
|
|
||||||
|
I would be interested in whether it is true for Haskell that these operations were faster. Lazy evaluation might help somehow, or the Haskell implementation might use tricks not described in the paper.
|
||||||
|
|
||||||
|
# Credits
|
||||||
|
|
||||||
|
- The tests in `tests/string-set-tests.sml` were ported from [kpol's Trie data structure in C3](https://github.com/kpol/trie).
|
||||||
|
- The words.txt dataset in `bench/words.txt` is from [this repository](https://github.com/dwyl/english-words).
|
||||||
103
string-tries-sml/bench/bro-tree.sml
Normal file
103
string-tries-sml/bench/bro-tree.sml
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
structure BroTree =
|
||||||
|
struct
|
||||||
|
(* implementation of 1-2 brother tree ported from:
|
||||||
|
* https://www.cl.cam.ac.uk/research/hvg/Isabelle/dist/library/HOL/HOL-Data_Structures/document.pdf
|
||||||
|
* *)
|
||||||
|
datatype bro =
|
||||||
|
N0
|
||||||
|
| N1 of bro
|
||||||
|
| N2 of bro * string * bro
|
||||||
|
| L2 of string
|
||||||
|
| N3 of bro * string * bro * string * bro
|
||||||
|
|
||||||
|
val empty = N0
|
||||||
|
|
||||||
|
fun n1 bro =
|
||||||
|
case bro of
|
||||||
|
L2 str => N2 (N0, str, N0)
|
||||||
|
| N3 (t1, a1, t2, a2, t3) => N2 (N2 (t1, a1, t2), a2, N1 t3)
|
||||||
|
| t => N1 t
|
||||||
|
|
||||||
|
fun n2Left (t1, str, t2) =
|
||||||
|
case (t1, str, t2) of
|
||||||
|
(L2 a1, a2, t) => N3 (N0, a2, N0, a2, t)
|
||||||
|
| (N3 (t1, a1, t2, a2, t3), a3, N1 t4) =>
|
||||||
|
N2 (N2 (t1, a1, t2), a2, N2 (t2, a3, t4))
|
||||||
|
| (N3 (t1, a1, t2, a2, t3), a3, t4) =>
|
||||||
|
N3 (N2 (t1, a1, t2), a2, N1 t3, a3, t4)
|
||||||
|
| (t1, a1, t2) => N2 (t1, a1, t2)
|
||||||
|
|
||||||
|
fun n2Right (t1, str, t2) =
|
||||||
|
case (t1, str, t2) of
|
||||||
|
(t1, a1, L2 a2) => N3 (t1, a2, N0, a2, N0)
|
||||||
|
| (N1 t1, a1, N3 (t2, a2, t3, a3, t4)) =>
|
||||||
|
N2 (N2 (t1, a1, t2), a2, N2 (t3, a3, t4))
|
||||||
|
| (t1, a1, N3 (t2, a2, t3, a3, t4)) =>
|
||||||
|
N3 (t1, a2, N1 t2, a2, N2 (t3, a3, t4))
|
||||||
|
| (t1, a1, t2) => N2 (t1, a1, t2)
|
||||||
|
|
||||||
|
fun ins (str, tree) =
|
||||||
|
case tree of
|
||||||
|
N0 => L2 str
|
||||||
|
| N1 t => n1 (ins (str, t))
|
||||||
|
| N2 (l, a, r) =>
|
||||||
|
if str < a then n2Left (ins (str, l), a, r)
|
||||||
|
else if str > a then n2Right (l, a, ins (str, r))
|
||||||
|
else N2 (l, a, r)
|
||||||
|
| _ => raise Match (*impossible case*)
|
||||||
|
|
||||||
|
fun insRoot tree =
|
||||||
|
case tree of
|
||||||
|
L2 str => N2 (N0, str, N0)
|
||||||
|
| N3 (t1, a1, t2, a2, t3) => N2 (N2 (t1, a2, t2), a2, N1 t3)
|
||||||
|
| tree => tree
|
||||||
|
|
||||||
|
fun insert (str, tree) =
|
||||||
|
insRoot (ins (str, tree))
|
||||||
|
|
||||||
|
fun exists (str, tree) =
|
||||||
|
case tree of
|
||||||
|
N0 => false
|
||||||
|
| N1 t => exists (str, t)
|
||||||
|
| N2 (l, k, r) =>
|
||||||
|
if str < k then exists (str, l)
|
||||||
|
else if str > k then exists (str, r)
|
||||||
|
else true
|
||||||
|
| _ => raise Match
|
||||||
|
|
||||||
|
fun foldr (f, acc, tree: bro) =
|
||||||
|
case tree of
|
||||||
|
N0 => acc
|
||||||
|
| N1 t => foldr (f, acc, t)
|
||||||
|
| N2 (l, k, r) =>
|
||||||
|
let
|
||||||
|
val acc = foldr (f, acc, r)
|
||||||
|
val acc = f (k, acc)
|
||||||
|
in
|
||||||
|
foldr (f, acc, l)
|
||||||
|
end
|
||||||
|
| _ => raise Match
|
||||||
|
|
||||||
|
fun helpStartsWith (pos, prefix, key) =
|
||||||
|
if pos = String.size prefix then
|
||||||
|
true
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val prefixChr = String.sub (prefix, pos)
|
||||||
|
val keyChr = String.sub (key, pos)
|
||||||
|
in
|
||||||
|
if keyChr = prefixChr then helpStartsWith (pos + 1, prefix, key)
|
||||||
|
else false
|
||||||
|
end
|
||||||
|
|
||||||
|
fun startsWith (prefix, key) =
|
||||||
|
if String.size prefix > String.size key then false
|
||||||
|
else helpStartsWith (0, prefix, key)
|
||||||
|
|
||||||
|
fun getPrefixList (prefix, tree) =
|
||||||
|
foldr
|
||||||
|
( (fn (k, acc) => if startsWith (prefix, k) then k :: acc else acc)
|
||||||
|
, []
|
||||||
|
, tree
|
||||||
|
)
|
||||||
|
end
|
||||||
BIN
string-tries-sml/bench/build-exists-bro-tree
Executable file
BIN
string-tries-sml/bench/build-exists-bro-tree
Executable file
Binary file not shown.
10
string-tries-sml/bench/build-exists-bro-tree.mlb
Normal file
10
string-tries-sml/bench/build-exists-bro-tree.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
bro-tree.sml
|
||||||
|
build-exists-bro-tree.sml
|
||||||
33
string-tries-sml/bench/build-exists-bro-tree.sml
Normal file
33
string-tries-sml/bench/build-exists-bro-tree.sml
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
structure BuildExistsBroTree =
|
||||||
|
struct
|
||||||
|
fun helpExists (pos, tree, acc) =
|
||||||
|
if pos = Vector.length WordsList.words then
|
||||||
|
acc
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val word = Vector.sub (WordsList.words, pos)
|
||||||
|
val newAcc = BroTree.exists (word, tree)
|
||||||
|
val acc = newAcc orelse acc
|
||||||
|
in
|
||||||
|
helpExists (pos + 1, tree, acc)
|
||||||
|
end
|
||||||
|
|
||||||
|
fun exists (tree) = helpExists (0, tree, true)
|
||||||
|
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTree = Vector.foldl BroTree.insert BroTree.empty WordsList.words
|
||||||
|
|
||||||
|
val startTime = Time.now ()
|
||||||
|
val wordsExist = exists endTree
|
||||||
|
val finishTime = Time.now ()
|
||||||
|
|
||||||
|
val searchDuration = Time.- (finishTime, startTime)
|
||||||
|
val searchDuration = Time.toMilliseconds searchDuration
|
||||||
|
val searchDuration = LargeInt.toString searchDuration ^ "\n"
|
||||||
|
in
|
||||||
|
print searchDuration
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = BuildExistsBroTree.main ()
|
||||||
BIN
string-tries-sml/bench/build-exists-string-set
Executable file
BIN
string-tries-sml/bench/build-exists-string-set
Executable file
Binary file not shown.
10
string-tries-sml/bench/build-exists-string-set.mlb
Normal file
10
string-tries-sml/bench/build-exists-string-set.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
../src/string-set.sml
|
||||||
|
build-exists-string-set.sml
|
||||||
34
string-tries-sml/bench/build-exists-string-set.sml
Normal file
34
string-tries-sml/bench/build-exists-string-set.sml
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
structure BuildExistsStringSet =
|
||||||
|
struct
|
||||||
|
fun helpExists (pos, trie, acc) =
|
||||||
|
if pos = Vector.length WordsList.words then
|
||||||
|
acc
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val word = Vector.sub (WordsList.words, pos)
|
||||||
|
val newAcc = StringSet.exists (word, trie)
|
||||||
|
val acc = newAcc orelse acc
|
||||||
|
in
|
||||||
|
helpExists (pos + 1, trie, acc)
|
||||||
|
end
|
||||||
|
|
||||||
|
fun exists trie = helpExists (0, trie, true)
|
||||||
|
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTrie =
|
||||||
|
Vector.foldl StringSet.insert StringSet.empty WordsList.words
|
||||||
|
|
||||||
|
val startTime = Time.now ()
|
||||||
|
val wordsExist = exists endTrie
|
||||||
|
val finishTime = Time.now ()
|
||||||
|
|
||||||
|
val searchDuration = Time.- (finishTime, startTime)
|
||||||
|
val searchDuration = Time.toMilliseconds searchDuration
|
||||||
|
val searchDuration = LargeInt.toString searchDuration ^ "\n"
|
||||||
|
in
|
||||||
|
print searchDuration
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = BuildExistsStringSet.main ()
|
||||||
BIN
string-tries-sml/bench/build-get-prefix-bro-tree
Executable file
BIN
string-tries-sml/bench/build-get-prefix-bro-tree
Executable file
Binary file not shown.
10
string-tries-sml/bench/build-get-prefix-bro-tree.mlb
Normal file
10
string-tries-sml/bench/build-get-prefix-bro-tree.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
bro-tree.sml
|
||||||
|
build-get-prefix-bro-tree.sml
|
||||||
20
string-tries-sml/bench/build-get-prefix-bro-tree.sml
Normal file
20
string-tries-sml/bench/build-get-prefix-bro-tree.sml
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
structure BuildGetPrefixBroTree =
|
||||||
|
struct
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTrie =
|
||||||
|
Vector.foldl BroTree.insert BroTree.empty WordsList.words
|
||||||
|
|
||||||
|
val startTime = Time.now ()
|
||||||
|
val lst = BroTree.getPrefixList ("a", endTrie)
|
||||||
|
val finishTime = Time.now ()
|
||||||
|
|
||||||
|
val searchDuration = Time.- (finishTime, startTime)
|
||||||
|
val searchDuration = Time.toNanoseconds searchDuration
|
||||||
|
val searchDuration = LargeInt.toString searchDuration ^ " ns\n"
|
||||||
|
in
|
||||||
|
print searchDuration
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = BuildGetPrefixBroTree.main ()
|
||||||
BIN
string-tries-sml/bench/build-get-prefix-string-set
Executable file
BIN
string-tries-sml/bench/build-get-prefix-string-set
Executable file
Binary file not shown.
10
string-tries-sml/bench/build-get-prefix-string-set.mlb
Normal file
10
string-tries-sml/bench/build-get-prefix-string-set.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
../src/string-set.sml
|
||||||
|
build-get-prefix-string-set.sml
|
||||||
20
string-tries-sml/bench/build-get-prefix-string-set.sml
Normal file
20
string-tries-sml/bench/build-get-prefix-string-set.sml
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
structure BuildGetPrefixStringSet =
|
||||||
|
struct
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTrie =
|
||||||
|
Vector.foldl StringSet.insert StringSet.empty WordsList.words
|
||||||
|
|
||||||
|
val startTime = Time.now ()
|
||||||
|
val lst = StringSet.getPrefixList ("a", endTrie)
|
||||||
|
val finishTime = Time.now ()
|
||||||
|
|
||||||
|
val searchDuration = Time.- (finishTime, startTime)
|
||||||
|
val searchDuration = Time.toNanoseconds searchDuration
|
||||||
|
val searchDuration = LargeInt.toString searchDuration ^ " ns\n"
|
||||||
|
in
|
||||||
|
print searchDuration
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = BuildGetPrefixStringSet.main ()
|
||||||
BIN
string-tries-sml/bench/conv-words
Executable file
BIN
string-tries-sml/bench/conv-words
Executable file
Binary file not shown.
59
string-tries-sml/bench/conv-words.sml
Normal file
59
string-tries-sml/bench/conv-words.sml
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
(* generate a words.sml file with a vector of strings,
|
||||||
|
* from a line-delimited words.txt file *)
|
||||||
|
val inIo = TextIO.openIn "words.txt"
|
||||||
|
val outIO = TextIO.openOut "words.sml"
|
||||||
|
|
||||||
|
fun consWordChrs (wordChrs, acc) =
|
||||||
|
case wordChrs of
|
||||||
|
[] => acc
|
||||||
|
| _ => (String.concat wordChrs) :: acc
|
||||||
|
|
||||||
|
fun helpTokeniseLine (pos, wordChrs, line, acc) =
|
||||||
|
if pos < 0 then
|
||||||
|
consWordChrs (wordChrs, acc)
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val chr = String.sub (line, pos)
|
||||||
|
(* using Char.toString is necessary because it escapes double-quotes.
|
||||||
|
* Without escaping, there's a chance we will produce an invalid .sml file
|
||||||
|
* *)
|
||||||
|
val sChr = Char.toString chr
|
||||||
|
in
|
||||||
|
if Char.isPrint chr andalso not (Char.isSpace chr) then
|
||||||
|
helpTokeniseLine (pos - 1, sChr :: wordChrs, line, acc)
|
||||||
|
else
|
||||||
|
helpTokeniseLine (pos - 1, [], line, consWordChrs (wordChrs, acc))
|
||||||
|
end
|
||||||
|
|
||||||
|
fun tokeniseLine (line, acc) =
|
||||||
|
helpTokeniseLine (String.size line - 1, [], line, acc)
|
||||||
|
|
||||||
|
fun readLines (inIo, acc) =
|
||||||
|
case TextIO.inputLine inIo of
|
||||||
|
SOME line => readLines (inIo, tokeniseLine (line, acc))
|
||||||
|
| NONE => List.rev acc
|
||||||
|
|
||||||
|
fun writeLines (outIO, lst) =
|
||||||
|
case lst of
|
||||||
|
[] => ()
|
||||||
|
| word :: tl =>
|
||||||
|
let
|
||||||
|
val isLast = tl = []
|
||||||
|
val word = if isLast then "\"" ^ word ^ "\"" else "\"" ^ word ^ "\",\n"
|
||||||
|
val _ = TextIO.output (outIO, word)
|
||||||
|
in
|
||||||
|
writeLines (outIO, tl)
|
||||||
|
end
|
||||||
|
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val lst = readLines (inIo, [])
|
||||||
|
val _ = TextIO.output
|
||||||
|
(outIO, "structure WordsList = \nstruct \n val words = #[\n")
|
||||||
|
val _ = writeLines (outIO, lst)
|
||||||
|
val _ = TextIO.output (outIO, "]\n end")
|
||||||
|
in
|
||||||
|
()
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = main ()
|
||||||
BIN
string-tries-sml/bench/insert-bro-tree
Executable file
BIN
string-tries-sml/bench/insert-bro-tree
Executable file
Binary file not shown.
10
string-tries-sml/bench/insert-bro-tree.mlb
Normal file
10
string-tries-sml/bench/insert-bro-tree.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
bro-tree.sml
|
||||||
|
insert-bro-tree.sml
|
||||||
11
string-tries-sml/bench/insert-bro-tree.sml
Normal file
11
string-tries-sml/bench/insert-bro-tree.sml
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
structure InsertBroTree =
|
||||||
|
struct
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTree = Vector.foldl BroTree.insert BroTree.empty WordsList.words
|
||||||
|
in
|
||||||
|
()
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = InsertBroTree.main ()
|
||||||
BIN
string-tries-sml/bench/insert-string-set
Executable file
BIN
string-tries-sml/bench/insert-string-set
Executable file
Binary file not shown.
10
string-tries-sml/bench/insert-string-set.mlb
Normal file
10
string-tries-sml/bench/insert-string-set.mlb
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
ann
|
||||||
|
"allowVectorExps true"
|
||||||
|
in
|
||||||
|
words.sml
|
||||||
|
end
|
||||||
|
|
||||||
|
../src/string-set.sml
|
||||||
|
insert-string-set.sml
|
||||||
11
string-tries-sml/bench/insert-string-set.sml
Normal file
11
string-tries-sml/bench/insert-string-set.sml
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
structure InsertStringSet =
|
||||||
|
struct
|
||||||
|
fun main () =
|
||||||
|
let
|
||||||
|
val endTrie = Vector.foldl StringSet.insert StringSet.empty WordsList.words
|
||||||
|
in
|
||||||
|
()
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = InsertStringSet.main ()
|
||||||
466553
string-tries-sml/bench/words.sml
Normal file
466553
string-tries-sml/bench/words.sml
Normal file
File diff suppressed because it is too large
Load Diff
466549
string-tries-sml/bench/words.txt
Normal file
466549
string-tries-sml/bench/words.txt
Normal file
File diff suppressed because it is too large
Load Diff
805
string-tries-sml/src/string-set.sml
Normal file
805
string-tries-sml/src/string-set.sml
Normal file
@@ -0,0 +1,805 @@
|
|||||||
|
signature STRING_SET =
|
||||||
|
sig
|
||||||
|
(* the type of tries *)
|
||||||
|
type t
|
||||||
|
|
||||||
|
(* the empty trie *)
|
||||||
|
val empty: t
|
||||||
|
|
||||||
|
(* StringSet.isEmpty trie
|
||||||
|
* returns true if the trie is empty *)
|
||||||
|
val isEmpty: t -> bool
|
||||||
|
|
||||||
|
(* StringSet.fromString "hello world"
|
||||||
|
* creates a trie containing just a string *)
|
||||||
|
val fromString: string -> t
|
||||||
|
|
||||||
|
(* StringSet.exists ("hello world", trie)
|
||||||
|
* returns true if the key was inserted into the trie *)
|
||||||
|
val exists: string * t -> bool
|
||||||
|
|
||||||
|
(* StringSet.insert ("myNewString", trie)
|
||||||
|
* inserts a new string into the trie, returning a new trie *)
|
||||||
|
val insert: string * t -> t
|
||||||
|
|
||||||
|
(* StringSet.remove ("stringToRemove", trie)
|
||||||
|
* removes the key from the trie, returning a new trie *)
|
||||||
|
val remove: string * t -> t
|
||||||
|
|
||||||
|
(* StringSet.getPrefixList ("myPrefix", trie)
|
||||||
|
* returns a list of all keys matching the specified prefix *)
|
||||||
|
val getPrefixList: string * t -> string list
|
||||||
|
|
||||||
|
(* StringSet.toList trie
|
||||||
|
* returns a list containing all keys in the trie *)
|
||||||
|
val toList: t -> string list
|
||||||
|
|
||||||
|
(* StringSet.fromList ["hello", "world"]
|
||||||
|
* returns a trie containing all keys in the string list *)
|
||||||
|
val fromList: string list -> t
|
||||||
|
|
||||||
|
(* StringSet.foldl (fn (key, acc) => String.size key + acc) 0 trie
|
||||||
|
* folds a value through the trie, from lowest to highest. *)
|
||||||
|
val foldl: (string * 'b -> 'b) -> 'b -> t -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldlWithPrefix (fn (key, acc) => String.size key + acc) 0 trie "myPrefix"
|
||||||
|
* folds a value through a subset of the trie containing the specified prefix,
|
||||||
|
* from lowest to highest. *)
|
||||||
|
val foldlWithPrefix: (string * 'b -> 'b) -> 'b -> t -> string -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldr (fn (key, acc) => String.size key + acc) 0 trie
|
||||||
|
* folds a value through the trie, from highest to lowest. *)
|
||||||
|
val foldr: (string * 'b -> 'b) -> 'b -> t -> 'b
|
||||||
|
|
||||||
|
(* StringSet.foldrWithPrefix (fn (key, acc) => String.size key + acc) 0 trie "myPrefix"
|
||||||
|
* folds a value through a subset of the trie containing the specified prefix,
|
||||||
|
* from highest to lowest. *)
|
||||||
|
val foldrWithPrefix: (string * 'b -> 'b) -> 'b -> t -> string -> 'b
|
||||||
|
end
|
||||||
|
|
||||||
|
structure StringSet: STRING_SET =
|
||||||
|
struct
|
||||||
|
datatype t =
|
||||||
|
CHILDREN of {keys: string vector, children: t vector}
|
||||||
|
| FOUND_WITH_CHILDREN of {keys: string vector, children: t vector}
|
||||||
|
| FOUND
|
||||||
|
|
||||||
|
val empty =
|
||||||
|
CHILDREN {keys = Vector.fromList [], children = Vector.fromList []}
|
||||||
|
|
||||||
|
fun isEmpty trie =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, ...} => Vector.length keys = 0
|
||||||
|
| _ => false
|
||||||
|
|
||||||
|
fun fromString str =
|
||||||
|
if String.size str > 0 then
|
||||||
|
CHILDREN
|
||||||
|
{keys = Vector.fromList [str], children = Vector.fromList [FOUND]}
|
||||||
|
else
|
||||||
|
empty
|
||||||
|
|
||||||
|
fun helpBinSearch (findChr, keyPos, children, low, high) =
|
||||||
|
if high >= low then
|
||||||
|
let
|
||||||
|
val mid = low + ((high - low) div 2)
|
||||||
|
val midStr = Vector.sub (children, mid)
|
||||||
|
val midChr = String.sub (midStr, keyPos)
|
||||||
|
in
|
||||||
|
if midChr = findChr then
|
||||||
|
SOME mid
|
||||||
|
else if midChr < findChr then
|
||||||
|
helpBinSearch (findChr, keyPos, children, mid + 1, high)
|
||||||
|
else
|
||||||
|
helpBinSearch (findChr, keyPos, children, low, mid - 1)
|
||||||
|
end
|
||||||
|
else
|
||||||
|
NONE
|
||||||
|
|
||||||
|
fun findBinSearch (findChr, keyPos, children) =
|
||||||
|
if Vector.length children > 0 then
|
||||||
|
helpBinSearch (findChr, keyPos, children, 0, Vector.length children - 1)
|
||||||
|
else
|
||||||
|
NONE
|
||||||
|
|
||||||
|
datatype search_string_match =
|
||||||
|
NO_SEARCH_MATCH
|
||||||
|
| FULL_SEARCH_MATCH
|
||||||
|
| SEARCH_KEY_CONTAINS_TRIE_KEY
|
||||||
|
| TRIE_KEY_CONTAINS_SEARCH_KEY
|
||||||
|
|
||||||
|
fun searchKeyMatch (searchKey, trieKey, keyPos) =
|
||||||
|
if
|
||||||
|
keyPos < String.size searchKey
|
||||||
|
then
|
||||||
|
if keyPos < String.size trieKey then
|
||||||
|
let
|
||||||
|
val searchChr = String.sub (searchKey, keyPos)
|
||||||
|
val trieChr = String.sub (trieKey, keyPos)
|
||||||
|
in
|
||||||
|
if searchChr = trieChr then
|
||||||
|
searchKeyMatch (searchKey, trieKey, keyPos + 1)
|
||||||
|
else
|
||||||
|
NO_SEARCH_MATCH
|
||||||
|
end
|
||||||
|
else
|
||||||
|
SEARCH_KEY_CONTAINS_TRIE_KEY
|
||||||
|
else if
|
||||||
|
keyPos = String.size searchKey
|
||||||
|
then
|
||||||
|
if keyPos < String.size trieKey then TRIE_KEY_CONTAINS_SEARCH_KEY
|
||||||
|
else if keyPos = String.size trieKey then FULL_SEARCH_MATCH
|
||||||
|
else SEARCH_KEY_CONTAINS_TRIE_KEY
|
||||||
|
else (* implicit: keyPos > String.size searchKey *) if
|
||||||
|
keyPos <= String.size trieKey
|
||||||
|
then
|
||||||
|
TRIE_KEY_CONTAINS_SEARCH_KEY
|
||||||
|
else
|
||||||
|
NO_SEARCH_MATCH
|
||||||
|
|
||||||
|
fun isFoundNode node =
|
||||||
|
case node of
|
||||||
|
FOUND => true
|
||||||
|
| FOUND_WITH_CHILDREN _ => true
|
||||||
|
| CHILDREN _ => false
|
||||||
|
|
||||||
|
fun checkChildrenExists (searchKey, keyPos, keys, children) =
|
||||||
|
let
|
||||||
|
val findChr = String.sub (searchKey, keyPos)
|
||||||
|
in
|
||||||
|
(case findBinSearch (findChr, keyPos, keys) of
|
||||||
|
SOME idx =>
|
||||||
|
let
|
||||||
|
val trieKey = Vector.sub (keys, idx)
|
||||||
|
in
|
||||||
|
(case searchKeyMatch (searchKey, trieKey, keyPos + 1) of
|
||||||
|
NO_SEARCH_MATCH => false
|
||||||
|
| FULL_SEARCH_MATCH =>
|
||||||
|
let val trieChild = Vector.sub (children, idx)
|
||||||
|
in isFoundNode trieChild
|
||||||
|
end
|
||||||
|
| SEARCH_KEY_CONTAINS_TRIE_KEY =>
|
||||||
|
let val trieChild = Vector.sub (children, idx)
|
||||||
|
in recurseExists (searchKey, String.size trieKey, trieChild)
|
||||||
|
end
|
||||||
|
| TRIE_KEY_CONTAINS_SEARCH_KEY => false)
|
||||||
|
end
|
||||||
|
| NONE => false)
|
||||||
|
end
|
||||||
|
|
||||||
|
and recurseExists (searchKey, keyPos, trie) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
checkChildrenExists (searchKey, keyPos, keys, children)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
checkChildrenExists (searchKey, keyPos, keys, children)
|
||||||
|
| FOUND =>
|
||||||
|
(*
|
||||||
|
* This case should only occur if we recurse in a node
|
||||||
|
* when there is a partial, but not full, string match.
|
||||||
|
* This is because of the isFoundNode helper function
|
||||||
|
* which is called by the parent.
|
||||||
|
* In other words, only the parent node returns true,
|
||||||
|
* by checking if the child is a found node and has
|
||||||
|
* a full string match.
|
||||||
|
*)
|
||||||
|
false
|
||||||
|
|
||||||
|
fun exists (searchKey, trie) =
|
||||||
|
if isEmpty trie orelse String.size searchKey = 0 then false
|
||||||
|
else recurseExists (searchKey, 0, trie)
|
||||||
|
|
||||||
|
datatype prefix_result =
|
||||||
|
PREFIX_FOUND of string * t
|
||||||
|
| NO_PREFIX_FOUND
|
||||||
|
| PREFIX_MATCHES_WHOLE_TRIE
|
||||||
|
|
||||||
|
fun helpGetPrefixSubtrieChildren (prefix, keyPos, keys, children, trie) =
|
||||||
|
let
|
||||||
|
val findChr = String.sub (prefix, keyPos)
|
||||||
|
in
|
||||||
|
case findBinSearch (findChr, keyPos, keys) of
|
||||||
|
SOME idx =>
|
||||||
|
let
|
||||||
|
val trieKey = Vector.sub (keys, idx)
|
||||||
|
in
|
||||||
|
(case searchKeyMatch (prefix, trieKey, keyPos + 1) of
|
||||||
|
NO_SEARCH_MATCH => NO_PREFIX_FOUND
|
||||||
|
| SEARCH_KEY_CONTAINS_TRIE_KEY =>
|
||||||
|
let
|
||||||
|
val trieChild = Vector.sub (children, idx)
|
||||||
|
in
|
||||||
|
helpGetPrefixSubtrie (prefix, String.size trieKey, trieChild)
|
||||||
|
end
|
||||||
|
| FULL_SEARCH_MATCH =>
|
||||||
|
let val node = Vector.sub (children, idx)
|
||||||
|
in PREFIX_FOUND (prefix, node)
|
||||||
|
end
|
||||||
|
| TRIE_KEY_CONTAINS_SEARCH_KEY =>
|
||||||
|
let val node = Vector.sub (children, idx)
|
||||||
|
in PREFIX_FOUND (trieKey, node)
|
||||||
|
end)
|
||||||
|
end
|
||||||
|
| NONE => NO_PREFIX_FOUND
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpGetPrefixSubtrie (prefix, keyPos, trie) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
helpGetPrefixSubtrieChildren (prefix, keyPos, keys, children, trie)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
helpGetPrefixSubtrieChildren (prefix, keyPos, keys, children, trie)
|
||||||
|
| FOUND => NO_PREFIX_FOUND
|
||||||
|
|
||||||
|
fun getPrefixSubtrie (prefix, trie) =
|
||||||
|
if String.size prefix > 0 then
|
||||||
|
if isEmpty trie then NO_PREFIX_FOUND
|
||||||
|
else helpGetPrefixSubtrie (prefix, 0, trie)
|
||||||
|
else
|
||||||
|
PREFIX_MATCHES_WHOLE_TRIE
|
||||||
|
|
||||||
|
fun helpFoldlTrieVector (f, pos, keys, children, acc) =
|
||||||
|
if pos = Vector.length children then
|
||||||
|
acc
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val curChild = Vector.sub (children, pos)
|
||||||
|
val acc = helpFoldl (f, curChild, acc)
|
||||||
|
val acc =
|
||||||
|
if isFoundNode curChild then f (Vector.sub (keys, pos), acc) else acc
|
||||||
|
in
|
||||||
|
helpFoldlTrieVector (f, pos + 1, keys, children, acc)
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpFoldl (f, trie, acc) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
helpFoldlTrieVector (f, 0, keys, children, acc)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
helpFoldlTrieVector (f, 0, keys, children, acc)
|
||||||
|
| FOUND => acc
|
||||||
|
|
||||||
|
fun foldl f initial trie = helpFoldl (f, trie, initial)
|
||||||
|
|
||||||
|
fun foldlWithPrefix f initial trie prefix =
|
||||||
|
case getPrefixSubtrie (prefix, trie) of
|
||||||
|
PREFIX_FOUND (prefix, subtrie) =>
|
||||||
|
let val acc = helpFoldl (f, subtrie, initial)
|
||||||
|
in if isFoundNode subtrie then f (prefix, acc) else acc
|
||||||
|
end
|
||||||
|
| NO_PREFIX_FOUND => initial
|
||||||
|
| PREFIX_MATCHES_WHOLE_TRIE => helpFoldl (f, trie, initial)
|
||||||
|
|
||||||
|
fun helpFoldrTrieVector (f, pos, keys, children, acc) =
|
||||||
|
if pos < 0 then
|
||||||
|
acc
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val curChild = Vector.sub (children, pos)
|
||||||
|
val acc = helpFoldr (f, curChild, acc)
|
||||||
|
val acc =
|
||||||
|
if isFoundNode curChild then f (Vector.sub (keys, pos), acc) else acc
|
||||||
|
in
|
||||||
|
helpFoldrTrieVector (f, pos - 1, keys, children, acc)
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpFoldr (f, trie, acc) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
helpFoldrTrieVector (f, Vector.length keys - 1, keys, children, acc)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
helpFoldrTrieVector (f, Vector.length keys - 1, keys, children, acc)
|
||||||
|
| FOUND => acc
|
||||||
|
|
||||||
|
fun foldr f initial trie = helpFoldr (f, trie, initial)
|
||||||
|
|
||||||
|
fun foldrWithPrefix f initial trie prefix =
|
||||||
|
case getPrefixSubtrie (prefix, trie) of
|
||||||
|
PREFIX_FOUND (prefix, subtrie) =>
|
||||||
|
let val acc = helpFoldr (f, subtrie, initial)
|
||||||
|
in if isFoundNode subtrie then f (prefix, acc) else acc
|
||||||
|
end
|
||||||
|
| NO_PREFIX_FOUND => initial
|
||||||
|
| PREFIX_MATCHES_WHOLE_TRIE => helpFoldr (f, trie, initial)
|
||||||
|
|
||||||
|
(* recurseHelpGetPrefixList and helpGetPrefixList are basically manually coded
|
||||||
|
* foldr functions over the trie, applying the accumuluator to every found
|
||||||
|
* node, from right to left.
|
||||||
|
* No need to recode it as a usage of a generic foldr function though,
|
||||||
|
* because lower dispatch cost this way. *)
|
||||||
|
fun recurseHelpGetPrefixList (pos, keys, children, acc) =
|
||||||
|
if pos < 0 then
|
||||||
|
acc
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val curChild = Vector.sub (children, pos)
|
||||||
|
val acc = helpGetPrefixList (curChild, acc)
|
||||||
|
val acc =
|
||||||
|
if isFoundNode curChild then Vector.sub (keys, pos) :: acc else acc
|
||||||
|
in
|
||||||
|
recurseHelpGetPrefixList (pos - 1, keys, children, acc)
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpGetPrefixList (trie, acc) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
recurseHelpGetPrefixList (Vector.length keys - 1, keys, children, acc)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
recurseHelpGetPrefixList (Vector.length keys - 1, keys, children, acc)
|
||||||
|
| FOUND => acc
|
||||||
|
|
||||||
|
fun getPrefixList (prefix, trie) =
|
||||||
|
case getPrefixSubtrie (prefix, trie) of
|
||||||
|
PREFIX_FOUND (prefix, subtrie) =>
|
||||||
|
let val lst = helpGetPrefixList (subtrie, [])
|
||||||
|
in if isFoundNode subtrie then prefix :: lst else lst
|
||||||
|
end
|
||||||
|
| NO_PREFIX_FOUND => []
|
||||||
|
| PREFIX_MATCHES_WHOLE_TRIE => helpGetPrefixList (trie, [])
|
||||||
|
|
||||||
|
fun toList trie = helpGetPrefixList (trie, [])
|
||||||
|
|
||||||
|
datatype insert_string_match =
|
||||||
|
NO_INSERT_MATCH
|
||||||
|
(* may need to split string if difference found but prefix matched *)
|
||||||
|
| DIFFERENCE_FOUND_AT of int
|
||||||
|
(* may not need to do anything if insert key matched,
|
||||||
|
* as this is a set where only strings are stored.
|
||||||
|
* however, if this is a non-found node, then I need to change
|
||||||
|
* the tag/case. *)
|
||||||
|
| FULL_INSERT_MATCH
|
||||||
|
(* if insert key contains trie key, may need to recurse down node *)
|
||||||
|
| INSERT_KEY_CONTAINS_TRIE_KEY
|
||||||
|
(* if trie key contains insert key, need to split node *)
|
||||||
|
| TRIE_KEY_CONTAINS_INSERT_KEY
|
||||||
|
|
||||||
|
fun insertKeyMatch (insertKey, trieKey, keyPos) =
|
||||||
|
if
|
||||||
|
keyPos < String.size insertKey
|
||||||
|
then
|
||||||
|
if keyPos < String.size trieKey then
|
||||||
|
let
|
||||||
|
val searchChr = String.sub (insertKey, keyPos)
|
||||||
|
val trieChr = String.sub (trieKey, keyPos)
|
||||||
|
in
|
||||||
|
if searchChr = trieChr then
|
||||||
|
insertKeyMatch (insertKey, trieKey, keyPos + 1)
|
||||||
|
else
|
||||||
|
DIFFERENCE_FOUND_AT keyPos
|
||||||
|
end
|
||||||
|
else
|
||||||
|
INSERT_KEY_CONTAINS_TRIE_KEY
|
||||||
|
else if
|
||||||
|
keyPos = String.size insertKey
|
||||||
|
then
|
||||||
|
if keyPos < String.size trieKey then TRIE_KEY_CONTAINS_INSERT_KEY
|
||||||
|
else if keyPos = String.size trieKey then FULL_INSERT_MATCH
|
||||||
|
else INSERT_KEY_CONTAINS_TRIE_KEY
|
||||||
|
else (* implicit: keyPos > String.size insertKey *) if
|
||||||
|
keyPos <= String.size trieKey
|
||||||
|
then
|
||||||
|
TRIE_KEY_CONTAINS_INSERT_KEY
|
||||||
|
else
|
||||||
|
NO_INSERT_MATCH
|
||||||
|
|
||||||
|
datatype insert_bin_search_result =
|
||||||
|
INSERT_NEW_CHILD of int
|
||||||
|
| FOUND_INSERT_POS of int
|
||||||
|
| APPEND_NEW_CHILD
|
||||||
|
|
||||||
|
fun linearSearch (findChr, keyPos, idx, children) =
|
||||||
|
if idx = Vector.length children then
|
||||||
|
APPEND_NEW_CHILD
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val curStr = Vector.sub (children, idx)
|
||||||
|
val curChr = String.sub (curStr, keyPos)
|
||||||
|
in
|
||||||
|
if curChr > findChr then INSERT_NEW_CHILD idx
|
||||||
|
else linearSearch (findChr, keyPos, idx + 1, children)
|
||||||
|
end
|
||||||
|
|
||||||
|
fun helpInsertBinSearch (findChr, keyPos, children, low, high) =
|
||||||
|
let
|
||||||
|
val mid = low + ((high - low) div 2)
|
||||||
|
in
|
||||||
|
if high >= low then
|
||||||
|
let
|
||||||
|
val midStr = Vector.sub (children, mid)
|
||||||
|
val midChr = String.sub (midStr, keyPos)
|
||||||
|
in
|
||||||
|
if midChr = findChr then
|
||||||
|
FOUND_INSERT_POS mid
|
||||||
|
else if midChr < findChr then
|
||||||
|
helpInsertBinSearch (findChr, keyPos, children, mid + 1, high)
|
||||||
|
else
|
||||||
|
helpInsertBinSearch (findChr, keyPos, children, low, mid - 1)
|
||||||
|
end
|
||||||
|
else
|
||||||
|
linearSearch (findChr, keyPos, if mid >= 0 then mid else 0, children)
|
||||||
|
end
|
||||||
|
|
||||||
|
fun insertBinSearch (findChr, keyPos, children) =
|
||||||
|
helpInsertBinSearch
|
||||||
|
(findChr, keyPos, children, 0, Vector.length children - 1)
|
||||||
|
|
||||||
|
fun insertDifferenceFoundAt
|
||||||
|
( insKey
|
||||||
|
, insIdx
|
||||||
|
, splitTrieKeyStart
|
||||||
|
, trieChild
|
||||||
|
, childKeys
|
||||||
|
, childChildren
|
||||||
|
, parentKeys
|
||||||
|
, parentChildren
|
||||||
|
, parentConstructor
|
||||||
|
) =
|
||||||
|
let
|
||||||
|
(* child node should always have CHILDREN case,
|
||||||
|
* because we are splitting prefix into two,
|
||||||
|
* when neither trieKey nor insKey match the prefix. *)
|
||||||
|
val childNode = CHILDREN {keys = childKeys, children = childChildren}
|
||||||
|
val keys =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (idx, key) => if idx <> insIdx then key else splitTrieKeyStart)
|
||||||
|
parentKeys
|
||||||
|
|
||||||
|
val children =
|
||||||
|
Vector.mapi (fn (idx, elt) => if idx <> insIdx then elt else childNode)
|
||||||
|
parentChildren
|
||||||
|
in
|
||||||
|
parentConstructor {keys = keys, children = children}
|
||||||
|
end
|
||||||
|
|
||||||
|
fun insertNewChild (keys, insIdx, insKey, children, constructor) =
|
||||||
|
let
|
||||||
|
val newLength = Vector.length keys + 1
|
||||||
|
val newKeys = Vector.tabulate (newLength, fn idx =>
|
||||||
|
if idx < insIdx then Vector.sub (keys, idx)
|
||||||
|
else if idx > insIdx then Vector.sub (keys, idx - 1)
|
||||||
|
else insKey)
|
||||||
|
|
||||||
|
val newChildren = Vector.tabulate (newLength, fn idx =>
|
||||||
|
if idx < insIdx then Vector.sub (children, idx)
|
||||||
|
else if idx > insIdx then Vector.sub (children, idx - 1)
|
||||||
|
else FOUND)
|
||||||
|
in
|
||||||
|
constructor {keys = newKeys, children = newChildren}
|
||||||
|
end
|
||||||
|
|
||||||
|
fun appendNewChild (keys, insKey, children, constructor) =
|
||||||
|
let
|
||||||
|
val newKeys = Vector.concat [keys, Vector.fromList [insKey]]
|
||||||
|
val newChildren = Vector.concat [children, Vector.fromList [FOUND]]
|
||||||
|
in
|
||||||
|
constructor {keys = newKeys, children = newChildren}
|
||||||
|
end
|
||||||
|
|
||||||
|
fun foundInsertPos (keys, children, keyPos, insKey, insIdx, trie, constructor) =
|
||||||
|
let
|
||||||
|
val trieKey = Vector.sub (keys, insIdx)
|
||||||
|
in
|
||||||
|
(case insertKeyMatch (insKey, trieKey, keyPos + 1) of
|
||||||
|
(* may need to split string if difference found but prefix matched *)
|
||||||
|
DIFFERENCE_FOUND_AT diffIdx =>
|
||||||
|
let
|
||||||
|
val splitTrieKeyStart = String.substring (trieKey, 0, diffIdx)
|
||||||
|
val trieChild = Vector.sub (children, insIdx)
|
||||||
|
in
|
||||||
|
if String.sub (trieKey, diffIdx) > String.sub (insKey, diffIdx) then
|
||||||
|
(* place insKey before trieKey *)
|
||||||
|
let
|
||||||
|
val childKeys = Vector.fromList [insKey, trieKey]
|
||||||
|
val childChildren = Vector.fromList [FOUND, trieChild]
|
||||||
|
in
|
||||||
|
insertDifferenceFoundAt
|
||||||
|
( insKey
|
||||||
|
, insIdx
|
||||||
|
, splitTrieKeyStart
|
||||||
|
, trieChild
|
||||||
|
, childKeys
|
||||||
|
, childChildren
|
||||||
|
, keys
|
||||||
|
, children
|
||||||
|
, constructor
|
||||||
|
)
|
||||||
|
end
|
||||||
|
else
|
||||||
|
(* place trieKey before insKey *)
|
||||||
|
let
|
||||||
|
val childKeys = Vector.fromList [trieKey, insKey]
|
||||||
|
val childChildren = Vector.fromList [trieChild, FOUND]
|
||||||
|
in
|
||||||
|
insertDifferenceFoundAt
|
||||||
|
( insKey
|
||||||
|
, insIdx
|
||||||
|
, splitTrieKeyStart
|
||||||
|
, trieChild
|
||||||
|
, childKeys
|
||||||
|
, childChildren
|
||||||
|
, keys
|
||||||
|
, children
|
||||||
|
, constructor
|
||||||
|
)
|
||||||
|
end
|
||||||
|
end
|
||||||
|
(* may not need to do anything if insert key matched,
|
||||||
|
* as this is a set where only strings are stored.
|
||||||
|
* however, if this is a non-found node, then I need to change
|
||||||
|
* the tag/case. *)
|
||||||
|
| FULL_INSERT_MATCH =>
|
||||||
|
(* in case of a full match,
|
||||||
|
* need to turn child into FOUND_WITH_CHILDREN
|
||||||
|
* or FOUND case, if it not already either
|
||||||
|
*)
|
||||||
|
(case Vector.sub (children, insIdx) of
|
||||||
|
CHILDREN {keys = childKeys, children = childChildren} =>
|
||||||
|
let
|
||||||
|
val newTrieChild =
|
||||||
|
FOUND_WITH_CHILDREN
|
||||||
|
{keys = childKeys, children = childChildren}
|
||||||
|
val newParentChildren =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (childIdx, elt) =>
|
||||||
|
if insIdx <> childIdx then elt else newTrieChild)
|
||||||
|
children
|
||||||
|
in
|
||||||
|
constructor {keys = keys, children = newParentChildren}
|
||||||
|
end
|
||||||
|
| _ => trie)
|
||||||
|
(* if insert key contains trie key, need to recurse down node *)
|
||||||
|
| INSERT_KEY_CONTAINS_TRIE_KEY =>
|
||||||
|
let
|
||||||
|
val trieChild = Vector.sub (children, insIdx)
|
||||||
|
val newTrieChild =
|
||||||
|
helpInsert (insKey, String.size trieKey, trieChild)
|
||||||
|
val newChildren =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (idx, elt) => if idx <> insIdx then elt else newTrieChild)
|
||||||
|
children
|
||||||
|
in
|
||||||
|
constructor {keys = keys, children = newChildren}
|
||||||
|
end
|
||||||
|
(* if trie key contains insert key, need to split node *)
|
||||||
|
| TRIE_KEY_CONTAINS_INSERT_KEY =>
|
||||||
|
let
|
||||||
|
val trieChild = Vector.sub (children, insIdx)
|
||||||
|
val newKeys =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (idx, key) => if idx <> insIdx then key else insKey) keys
|
||||||
|
|
||||||
|
(* newTrieChild should always be FOUND_WITH_CHILDREN,
|
||||||
|
* because previous part matches insert key,
|
||||||
|
* and esecond part matches trieKey *)
|
||||||
|
val newTrieChild = FOUND_WITH_CHILDREN
|
||||||
|
{ keys = Vector.fromList [trieKey]
|
||||||
|
, children = Vector.fromList [trieChild]
|
||||||
|
}
|
||||||
|
|
||||||
|
val newChildren =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (idx, elt) => if idx <> insIdx then elt else newTrieChild)
|
||||||
|
children
|
||||||
|
in
|
||||||
|
constructor {keys = newKeys, children = newChildren}
|
||||||
|
end
|
||||||
|
| NO_INSERT_MATCH => trie)
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpInsertChildren
|
||||||
|
(insKey, keyPos, keys, children, trie, parentConstructor) =
|
||||||
|
let
|
||||||
|
val findChr = String.sub (insKey, keyPos)
|
||||||
|
in
|
||||||
|
(case insertBinSearch (findChr, keyPos, keys) of
|
||||||
|
INSERT_NEW_CHILD insIdx =>
|
||||||
|
insertNewChild (keys, insIdx, insKey, children, parentConstructor)
|
||||||
|
| FOUND_INSERT_POS insIdx =>
|
||||||
|
foundInsertPos
|
||||||
|
(keys, children, keyPos, insKey, insIdx, trie, parentConstructor)
|
||||||
|
| APPEND_NEW_CHILD =>
|
||||||
|
appendNewChild (keys, insKey, children, parentConstructor))
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpInsert (insKey, keyPos, trie) : t =
|
||||||
|
case trie of
|
||||||
|
FOUND =>
|
||||||
|
if keyPos = String.size insKey then
|
||||||
|
FOUND
|
||||||
|
else
|
||||||
|
FOUND_WITH_CHILDREN
|
||||||
|
{ keys = Vector.fromList [insKey]
|
||||||
|
, children = Vector.fromList [FOUND]
|
||||||
|
}
|
||||||
|
| CHILDREN {keys, children} =>
|
||||||
|
helpInsertChildren (insKey, keyPos, keys, children, trie, CHILDREN)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
helpInsertChildren
|
||||||
|
(insKey, keyPos, keys, children, trie, FOUND_WITH_CHILDREN)
|
||||||
|
|
||||||
|
fun insert (insKey, trie) =
|
||||||
|
if String.size insKey > 0 then
|
||||||
|
if isEmpty trie then fromString insKey else helpInsert (insKey, 0, trie)
|
||||||
|
else
|
||||||
|
trie
|
||||||
|
|
||||||
|
fun helpAddList (str, acc) = insert (str, acc)
|
||||||
|
|
||||||
|
fun addList (lst, trie) =
|
||||||
|
List.foldl helpAddList trie lst
|
||||||
|
|
||||||
|
fun fromList (hd :: tl) =
|
||||||
|
let val trie = fromString hd
|
||||||
|
in addList (tl, trie)
|
||||||
|
end
|
||||||
|
| fromList ([]) = empty
|
||||||
|
|
||||||
|
datatype remove_result = UNCHANGED | MADE_EMPTY | CHANGED of t
|
||||||
|
|
||||||
|
(* should be called when there is a FULL_SEARCH_MATCH
|
||||||
|
* and child is a terminal FOUND node *)
|
||||||
|
fun removeWhenChildIsMadeEmpty
|
||||||
|
(idx, keys, children, isFoundWithChildren, parentConstructor) =
|
||||||
|
(* if child was made empty, then:
|
||||||
|
* - if the parent only has 1 child, it should be MADE_EMPTY too
|
||||||
|
* - otherwise, just remove the key and child at this idx from parent
|
||||||
|
* *)
|
||||||
|
if Vector.length keys > 1 then
|
||||||
|
if idx > 0 then
|
||||||
|
let
|
||||||
|
val newKeys = Vector.tabulate (Vector.length keys - 1, fn keyIdx =>
|
||||||
|
Vector.sub (keys, if keyIdx >= idx then keyIdx - 1 else keyIdx))
|
||||||
|
|
||||||
|
val newChildren =
|
||||||
|
Vector.tabulate (Vector.length keys - 1, fn childIdx =>
|
||||||
|
Vector.sub
|
||||||
|
(children, if childIdx >= idx then childIdx - 1 else childIdx))
|
||||||
|
|
||||||
|
val newNode =
|
||||||
|
parentConstructor {keys = newKeys, children = newChildren}
|
||||||
|
in
|
||||||
|
CHANGED newNode
|
||||||
|
end
|
||||||
|
else
|
||||||
|
(* if idx = 0, then have to slice first element off from vector *)
|
||||||
|
let
|
||||||
|
val keySlice = VectorSlice.slice (keys, 1, SOME
|
||||||
|
(Vector.length keys - 1))
|
||||||
|
val newKeys = VectorSlice.vector keySlice
|
||||||
|
|
||||||
|
val childrenSlice = VectorSlice.slice (children, 1, SOME
|
||||||
|
(Vector.length children - 1))
|
||||||
|
val newChildren = VectorSlice.vector childrenSlice
|
||||||
|
|
||||||
|
val newNode =
|
||||||
|
parentConstructor {keys = newKeys, children = newChildren}
|
||||||
|
in
|
||||||
|
CHANGED newNode
|
||||||
|
end
|
||||||
|
else (* if the caller was from the FOUND_WITH_CHILDREN case,
|
||||||
|
* then, instead of deleting this node entirely,
|
||||||
|
* we only delete this node's children, and mark this node
|
||||||
|
* as found.
|
||||||
|
* However, in general case where this is a CHILDREN node
|
||||||
|
* whose key was not inserted into the trie,
|
||||||
|
* we should fully delete this node as well. *) if isFoundWithChildren then
|
||||||
|
CHANGED FOUND
|
||||||
|
else
|
||||||
|
MADE_EMPTY
|
||||||
|
|
||||||
|
(* should be called when searchKeyMatch returns FULL_MATCH
|
||||||
|
*in helpRemove function *)
|
||||||
|
fun removeWhenFullMatch
|
||||||
|
(idx, keys, children, isFoundWithChildren, parentConstructor) =
|
||||||
|
(* matching over the child at this idx *)
|
||||||
|
case Vector.sub (children, idx) of
|
||||||
|
(* CHILDREN is a not-found case, so have to leave parent unchanged
|
||||||
|
* as there is no key to delete. *)
|
||||||
|
CHILDREN _ => UNCHANGED
|
||||||
|
(* FOUND_WITH_CHILDREN is a found case containing links to other nodes
|
||||||
|
* so we just need to change the tag from FOUND_WITH_CHILDREN to CHILDREN *)
|
||||||
|
| FOUND_WITH_CHILDREN {keys = childKeys, children = childChildren} =>
|
||||||
|
let
|
||||||
|
val newChild = CHILDREN {keys = childKeys, children = childChildren}
|
||||||
|
val newParentChildren =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (mapIdx, elt) => if mapIdx <> idx then elt else newChild)
|
||||||
|
children
|
||||||
|
|
||||||
|
val newParent =
|
||||||
|
parentConstructor {keys = keys, children = newParentChildren}
|
||||||
|
in
|
||||||
|
CHANGED newParent
|
||||||
|
end
|
||||||
|
| FOUND =>
|
||||||
|
removeWhenChildIsMadeEmpty
|
||||||
|
(idx, keys, children, isFoundWithChildren, parentConstructor)
|
||||||
|
|
||||||
|
fun removeWhenSearchKeyContainsTrieKey
|
||||||
|
(childResult, idx, keys, children, isFoundWithChildren, parentConstructor) =
|
||||||
|
case childResult of
|
||||||
|
(* if result is UNCHANGED, let UNCHANGED bubble to the top.
|
||||||
|
* At the top, can return same trie given as input as there was no
|
||||||
|
* change. *)
|
||||||
|
UNCHANGED => UNCHANGED
|
||||||
|
(* if child was changed, allocate new vector where child at this idx
|
||||||
|
* is updated with newChild, and use it in parent node. *)
|
||||||
|
| CHANGED newChild =>
|
||||||
|
let
|
||||||
|
val newChildren =
|
||||||
|
Vector.mapi
|
||||||
|
(fn (childIdx, elt) => if idx <> childIdx then elt else newChild)
|
||||||
|
children
|
||||||
|
|
||||||
|
val newNode = parentConstructor {keys = keys, children = newChildren}
|
||||||
|
in
|
||||||
|
CHANGED newNode
|
||||||
|
end
|
||||||
|
| MADE_EMPTY =>
|
||||||
|
removeWhenChildIsMadeEmpty
|
||||||
|
(idx, keys, children, isFoundWithChildren, parentConstructor)
|
||||||
|
|
||||||
|
fun helpRemoveChildren
|
||||||
|
(removeKey, keyPos, keys, children, isFoundWithChildren, parentConstructor) =
|
||||||
|
let
|
||||||
|
val findChr = String.sub (removeKey, keyPos)
|
||||||
|
in
|
||||||
|
(case findBinSearch (findChr, keyPos, keys) of
|
||||||
|
SOME idx =>
|
||||||
|
let
|
||||||
|
val trieKey = Vector.sub (keys, idx)
|
||||||
|
in
|
||||||
|
(case searchKeyMatch (removeKey, trieKey, keyPos + 1) of
|
||||||
|
(* no search match means nothing to delete *)
|
||||||
|
NO_SEARCH_MATCH => UNCHANGED
|
||||||
|
| FULL_SEARCH_MATCH =>
|
||||||
|
removeWhenFullMatch
|
||||||
|
( idx
|
||||||
|
, keys
|
||||||
|
, children
|
||||||
|
, isFoundWithChildren
|
||||||
|
, parentConstructor
|
||||||
|
)
|
||||||
|
| SEARCH_KEY_CONTAINS_TRIE_KEY =>
|
||||||
|
removeWhenSearchKeyContainsTrieKey
|
||||||
|
( helpRemove
|
||||||
|
( removeKey
|
||||||
|
, String.size trieKey
|
||||||
|
, Vector.sub (children, idx)
|
||||||
|
)
|
||||||
|
, idx
|
||||||
|
, keys
|
||||||
|
, children
|
||||||
|
, isFoundWithChildren
|
||||||
|
, parentConstructor
|
||||||
|
)
|
||||||
|
| TRIE_KEY_CONTAINS_SEARCH_KEY => UNCHANGED)
|
||||||
|
end
|
||||||
|
| NONE => UNCHANGED)
|
||||||
|
end
|
||||||
|
|
||||||
|
and helpRemove (removeKey, keyPos, trie) =
|
||||||
|
case trie of
|
||||||
|
CHILDREN {keys, children} =>
|
||||||
|
helpRemoveChildren (removeKey, keyPos, keys, children, false, CHILDREN)
|
||||||
|
| FOUND_WITH_CHILDREN {keys, children} =>
|
||||||
|
helpRemoveChildren
|
||||||
|
(removeKey, keyPos, keys, children, true, FOUND_WITH_CHILDREN)
|
||||||
|
| FOUND =>
|
||||||
|
(*
|
||||||
|
* This case should only occur if we recurse in a node
|
||||||
|
* when there is a partial, but not full, string match.
|
||||||
|
* Since there was no full string match,
|
||||||
|
* key doesn't exist in trie and so we must leave it unchanged.
|
||||||
|
*)
|
||||||
|
UNCHANGED
|
||||||
|
|
||||||
|
fun remove (removeKey, trie) =
|
||||||
|
if String.size removeKey = 0 orelse isEmpty trie then
|
||||||
|
trie
|
||||||
|
else
|
||||||
|
case helpRemove (removeKey, 0, trie) of
|
||||||
|
CHANGED trie => trie
|
||||||
|
| MADE_EMPTY => empty
|
||||||
|
| UNCHANGED => trie
|
||||||
|
end
|
||||||
202
string-tries-sml/src/ternary-string-set.sml
Normal file
202
string-tries-sml/src/ternary-string-set.sml
Normal file
@@ -0,0 +1,202 @@
|
|||||||
|
(* An attempt at implementing terneary search trees,
|
||||||
|
* except that each node contains a full string instead of a char *)
|
||||||
|
structure TernaryStringSet =
|
||||||
|
struct
|
||||||
|
datatype t =
|
||||||
|
NODE of {left: t, key: string, follow: t, right: t}
|
||||||
|
(* follow is if insString contains nodeString *)
|
||||||
|
| FOUND_NODE of {left: t, key: string, follow: t, right: t}
|
||||||
|
| LEAF of string
|
||||||
|
| EMPTY
|
||||||
|
|
||||||
|
val empty = EMPTY
|
||||||
|
|
||||||
|
fun isEmpty t = t = EMPTY
|
||||||
|
|
||||||
|
fun fromString str =
|
||||||
|
if String.size str > 0 then LEAF str else EMPTY
|
||||||
|
|
||||||
|
val nodeStringContainsInsString = ~1
|
||||||
|
val insStringContainsNodeString = ~2
|
||||||
|
|
||||||
|
fun getBreakPos (pos, insString, nodeString) =
|
||||||
|
if pos = String.size insString andalso pos = String.size nodeString then
|
||||||
|
pos
|
||||||
|
else if pos = String.size insString then
|
||||||
|
(* if nodeString contains insString, return ~1 *)
|
||||||
|
nodeStringContainsInsString
|
||||||
|
else if pos = String.size nodeString then
|
||||||
|
(* if insString contains nodeString, return ~2 *)
|
||||||
|
insStringContainsNodeString
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val insChr = String.sub (insString, pos)
|
||||||
|
val nodeChr = String.sub (nodeString, pos)
|
||||||
|
in
|
||||||
|
if insChr = nodeChr then
|
||||||
|
(* continue *)
|
||||||
|
getBreakPos (pos + 1, insString, nodeString)
|
||||||
|
else
|
||||||
|
(* return break position *)
|
||||||
|
pos
|
||||||
|
end
|
||||||
|
|
||||||
|
fun helpIns (pos, insString, t) =
|
||||||
|
case t of
|
||||||
|
NODE {left, key = nodeString, follow, right} =>
|
||||||
|
let
|
||||||
|
val breakPos = getBreakPos (pos, insString, nodeString)
|
||||||
|
in
|
||||||
|
if breakPos = String.size insString then
|
||||||
|
(* change node tag *)
|
||||||
|
FOUND_NODE
|
||||||
|
{left = left, key = nodeString, follow = follow, right = right}
|
||||||
|
else if breakPos = nodeStringContainsInsString then
|
||||||
|
(* add new node at current position,
|
||||||
|
* making current node a child of the new node *)
|
||||||
|
FOUND_NODE
|
||||||
|
{left = EMPTY, key = insString, follow = t, right = EMPTY}
|
||||||
|
else if breakPos = insStringContainsNodeString then
|
||||||
|
(* follow *)
|
||||||
|
let
|
||||||
|
val follow = helpIns (breakPos, insString, follow)
|
||||||
|
in
|
||||||
|
NODE
|
||||||
|
{left = left, key = nodeString, follow = follow, right = right}
|
||||||
|
end
|
||||||
|
else
|
||||||
|
(* we have a difference: do we want to go left or right? *)
|
||||||
|
let
|
||||||
|
val insChr = String.sub (insString, breakPos)
|
||||||
|
val nodeChr = String.sub (nodeString, breakPos)
|
||||||
|
in
|
||||||
|
if insChr < nodeChr then
|
||||||
|
let
|
||||||
|
val left = helpIns (breakPos, insString, left)
|
||||||
|
in
|
||||||
|
NODE
|
||||||
|
{ left = left
|
||||||
|
, key = nodeString
|
||||||
|
, follow = follow
|
||||||
|
, right = right
|
||||||
|
}
|
||||||
|
end
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val right = helpIns (breakPos, insString, right)
|
||||||
|
in
|
||||||
|
NODE
|
||||||
|
{ left = left
|
||||||
|
, key = nodeString
|
||||||
|
, follow = follow
|
||||||
|
, right = right
|
||||||
|
}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
||||||
|
| FOUND_NODE {left, key = nodeString, follow, right} =>
|
||||||
|
let
|
||||||
|
val breakPos = getBreakPos (pos, insString, nodeString)
|
||||||
|
in
|
||||||
|
if breakPos = String.size insString then
|
||||||
|
(* return original tree as nothing to do, since key already exists *)
|
||||||
|
t
|
||||||
|
else if breakPos = nodeStringContainsInsString then
|
||||||
|
(* add new node at current position,
|
||||||
|
* making current node a child of the new node *)
|
||||||
|
FOUND_NODE
|
||||||
|
{left = EMPTY, key = insString, follow = t, right = EMPTY}
|
||||||
|
else if breakPos = insStringContainsNodeString then
|
||||||
|
(* follow *)
|
||||||
|
let
|
||||||
|
val follow = helpIns (breakPos, insString, follow)
|
||||||
|
in
|
||||||
|
FOUND_NODE
|
||||||
|
{left = left, key = nodeString, follow = follow, right = right}
|
||||||
|
end
|
||||||
|
else
|
||||||
|
(* we have a difference: do we want to go left or right? *)
|
||||||
|
let
|
||||||
|
val insChr = String.sub (insString, breakPos)
|
||||||
|
val nodeChr = String.sub (nodeString, breakPos)
|
||||||
|
in
|
||||||
|
if insChr < nodeChr then
|
||||||
|
let
|
||||||
|
val left = helpIns (breakPos, insString, left)
|
||||||
|
in
|
||||||
|
FOUND_NODE
|
||||||
|
{ left = left
|
||||||
|
, key = nodeString
|
||||||
|
, follow = follow
|
||||||
|
, right = right
|
||||||
|
}
|
||||||
|
end
|
||||||
|
else
|
||||||
|
let
|
||||||
|
val right = helpIns (breakPos, insString, right)
|
||||||
|
in
|
||||||
|
FOUND_NODE
|
||||||
|
{ left = left
|
||||||
|
, key = nodeString
|
||||||
|
, follow = follow
|
||||||
|
, right = right
|
||||||
|
}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
||||||
|
| EMPTY => LEAF insString
|
||||||
|
| LEAF nodeString =>
|
||||||
|
let
|
||||||
|
val breakPos = getBreakPos (pos, insString, nodeString)
|
||||||
|
in
|
||||||
|
if breakPos = String.size insString then
|
||||||
|
(* no change as user tried to insert string that already exists *)
|
||||||
|
t
|
||||||
|
else if breakPos = nodeStringContainsInsString then
|
||||||
|
(* add new node at current position,
|
||||||
|
* making current node a child of the new node *)
|
||||||
|
FOUND_NODE
|
||||||
|
{left = EMPTY, key = insString, follow = t, right = EMPTY}
|
||||||
|
else if breakPos = insStringContainsNodeString then
|
||||||
|
(* transform this node to a FOUND_NODE,
|
||||||
|
* and add a new LEAF at follow position *)
|
||||||
|
let
|
||||||
|
val newLeaf = LEAF insString
|
||||||
|
in
|
||||||
|
FOUND_NODE
|
||||||
|
{ key = nodeString
|
||||||
|
, follow = newLeaf
|
||||||
|
, left = EMPTY
|
||||||
|
, right = EMPTY
|
||||||
|
}
|
||||||
|
end
|
||||||
|
else
|
||||||
|
(* we have a difference: break string and add a new NODE.
|
||||||
|
* The way we insert has implications for searching.
|
||||||
|
* Search should descend down on all three children
|
||||||
|
* if searchKey contains nodeKey.
|
||||||
|
* *)
|
||||||
|
let
|
||||||
|
val breakString = String.substring (insString, 0, breakPos)
|
||||||
|
val insChr = String.sub (insString, breakPos)
|
||||||
|
val nodeChr = String.sub (nodeString, breakPos)
|
||||||
|
in
|
||||||
|
if insChr < nodeChr then
|
||||||
|
(* insString on left *)
|
||||||
|
NODE
|
||||||
|
{ left = LEAF insString
|
||||||
|
, right = t
|
||||||
|
, key = breakString
|
||||||
|
, follow = EMPTY
|
||||||
|
}
|
||||||
|
else
|
||||||
|
(* insString on right *)
|
||||||
|
NODE
|
||||||
|
{ left = t
|
||||||
|
, right = LEAF insString
|
||||||
|
, key = breakString
|
||||||
|
, follow = EMPTY
|
||||||
|
}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
||||||
4
string-tries-sml/tests/string-set-tests.mlb
Normal file
4
string-tries-sml/tests/string-set-tests.mlb
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
$(SML_LIB)/basis/basis.mlb
|
||||||
|
|
||||||
|
../src/string-set.sml
|
||||||
|
string-set-tests.sml
|
||||||
209
string-tries-sml/tests/string-set-tests.sml
Normal file
209
string-tries-sml/tests/string-set-tests.sml
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
structure StringSetTests =
|
||||||
|
struct
|
||||||
|
fun assertTrue (isTrue, msg) =
|
||||||
|
if isTrue then ()
|
||||||
|
else
|
||||||
|
(print (msg ^ "\n"); raise Empty)
|
||||||
|
|
||||||
|
fun assertFalse (isFalse, msg) = assertTrue (not isFalse, msg)
|
||||||
|
|
||||||
|
(* below tests ported from https://github.com/kpol/trie/blob/master/src/KTrie.Tests/TrieTests.cs *)
|
||||||
|
|
||||||
|
fun testExists () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc", "abde"]
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc should exist")
|
||||||
|
val _ = assertTrue (StringSet.exists ("abde", trie), "abde should exist")
|
||||||
|
|
||||||
|
val _ = assertFalse (StringSet.exists ("a", trie), "a should not exist")
|
||||||
|
val _ = assertFalse (StringSet.exists ("ab", trie), "ab should not exist")
|
||||||
|
val _ = assertFalse (StringSet.exists ("abcd", trie), "abcd should not exist")
|
||||||
|
val _ = assertFalse (StringSet.exists ("abce", trie), "abce should not exist")
|
||||||
|
val _ = assertFalse (StringSet.exists ("x", trie), "x should not exist")
|
||||||
|
in
|
||||||
|
print "StringSet.exists passed all tests\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun testGetPrefixList () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc", "abde", "abx", "abxx"]
|
||||||
|
|
||||||
|
val aMatches = StringSet.getPrefixList ("a", trie) = ["abc", "abde", "abx", "abxx"]
|
||||||
|
val _ = assertTrue (aMatches, "a matches")
|
||||||
|
|
||||||
|
val xMatches = StringSet.getPrefixList ("x", trie) = []
|
||||||
|
val _ = assertTrue (xMatches, "x matches")
|
||||||
|
|
||||||
|
val abMatches = StringSet.getPrefixList ("ab", trie) = ["abc", "abde", "abx", "abxx"]
|
||||||
|
val _ = assertTrue (abMatches, "ab matches")
|
||||||
|
|
||||||
|
val abcMatches = StringSet.getPrefixList ("abc", trie) = ["abc"]
|
||||||
|
val _ = assertTrue (abcMatches, "abc matches")
|
||||||
|
|
||||||
|
val abccMatches = StringSet.getPrefixList ("abcc", trie) = []
|
||||||
|
val _ = assertTrue (abccMatches, "abcc matches")
|
||||||
|
|
||||||
|
val abxMatches = StringSet.getPrefixList ("abx", trie) = ["abx", "abxx"]
|
||||||
|
val _ = assertTrue (abxMatches, "abx matches")
|
||||||
|
in
|
||||||
|
print "StringSet.getPrefixList passed all tests\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun remove1 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["a", "ab", "abc"]
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("a", trie), "a exists before remove1")
|
||||||
|
|
||||||
|
val trie = StringSet.remove ("a", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("a", trie), "a does not exist after remove1")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("ab", trie), "ab still exists after remove1")
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc still exists after remove1")
|
||||||
|
in
|
||||||
|
print "StringSet.remove: passed remove1\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun remove2 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["a", "ab", "abc", "abd"]
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("ab", trie), "ab exists before remove2")
|
||||||
|
val trie = StringSet.remove ("ab", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("ab", trie), "ab no longer exists after remove2")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("a", trie), "remove2 contains a")
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "remove2 contains abc")
|
||||||
|
val _ = assertTrue (StringSet.exists ("abd", trie), "remove2 contains abd")
|
||||||
|
in
|
||||||
|
print "StringSet.remove: passed remove2\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun remove3 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc"]
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists before remove3")
|
||||||
|
val trie = StringSet.remove ("abc", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc no longer exists after remove3")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.isEmpty trie, "trie is empty after remove3")
|
||||||
|
|
||||||
|
val trie = StringSet.insert ("abc", trie)
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists after insertion in remove3")
|
||||||
|
in
|
||||||
|
print "StringSet.remove: passed remove3\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun remove4 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc", "abcd"]
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists before remove4")
|
||||||
|
|
||||||
|
val trie = StringSet.remove ("abc", trie)
|
||||||
|
val _ = assertTrue (StringSet.exists ("abcd", trie), "abcd still exists after removing abc in remove4")
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc doesn't exist after remove in remove4")
|
||||||
|
in
|
||||||
|
print "StringSet.remove: passed remove4\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun remove5 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc", "ab", "ade", "abcde", "x"]
|
||||||
|
|
||||||
|
val trie2 = StringSet.remove ("xy", trie)
|
||||||
|
val _ = assertTrue (trie = trie2, "removing key (xy) which doesn't exist in trie returns same trie")
|
||||||
|
|
||||||
|
val trie3 = StringSet.remove ("abcd", trie)
|
||||||
|
val _ = assertTrue (trie = trie3, "removing key (abcd) which doesn't exist in trie returns same trie")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("abcde", trie), "abcde exists before remove in remove5")
|
||||||
|
val trie = StringSet.remove ("abcde", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("abcde", trie), "abcde does not exist after remove in remove5")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("x", trie), "x exists before remove in remove5")
|
||||||
|
val trie = StringSet.remove ("x", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("x", trie), "x does not exist after remove in remove5")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists before remove in remove5")
|
||||||
|
val trie = StringSet.remove ("abc", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc does not exist after remove in remove5")
|
||||||
|
|
||||||
|
val _ = assertTrue (StringSet.exists ("ab", trie), "trie still contains ab after removals in remove5")
|
||||||
|
val _ = assertTrue (StringSet.exists ("ade", trie), "trie still contains ade after removals in remove5")
|
||||||
|
in
|
||||||
|
print "StringSet.remove: passed remove5\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun insert1 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.empty
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc does not exist before it is added to trie")
|
||||||
|
|
||||||
|
val trie = StringSet.insert ("abc", trie)
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists after being addedd to trie")
|
||||||
|
in
|
||||||
|
print "StringSet.insert: passed insert1\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun insert2 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["abc"]
|
||||||
|
val trie2 = StringSet.insert ("abc", trie)
|
||||||
|
|
||||||
|
val _ = assertTrue (trie = trie2, "trie handles duplicate insertion by returning same trie")
|
||||||
|
in
|
||||||
|
print "StringSet.insert: passed insert2\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun insert3 () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.empty
|
||||||
|
val _ = assertFalse (StringSet.exists ("abcd", trie), "empty trie does not contain abcd")
|
||||||
|
|
||||||
|
val trie = StringSet.insert ("abcd", trie)
|
||||||
|
val _ = assertTrue (StringSet.exists ("abcd", trie), "abcd exists after inserting it into empty trie")
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc does not exist before inserting it")
|
||||||
|
|
||||||
|
val trie = StringSet.insert ("abc", trie)
|
||||||
|
val _ = assertTrue (StringSet.exists ("abc", trie), "abc exists after inserting it into trie")
|
||||||
|
|
||||||
|
val trie = StringSet.remove ("abc", trie)
|
||||||
|
val _ = assertFalse (StringSet.exists ("abc", trie), "abc does not exist after removing it from trie")
|
||||||
|
in
|
||||||
|
print "StringSet.insert: passed insert3\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun getPrefixList () =
|
||||||
|
let
|
||||||
|
val trie = StringSet.fromList ["z", "xc", "x", "zzz", "abc", "abcde"]
|
||||||
|
val lst = StringSet.getPrefixList ("ab", trie)
|
||||||
|
|
||||||
|
val _ = assertTrue (lst = ["abc", "abcde"], "prefix list contains abc and abcde when prefix is 'ab'")
|
||||||
|
in
|
||||||
|
print "StringSet.getPrefixList: passed\n"
|
||||||
|
end
|
||||||
|
|
||||||
|
fun run () =
|
||||||
|
let
|
||||||
|
val _ = testExists ()
|
||||||
|
val _ = testGetPrefixList ()
|
||||||
|
|
||||||
|
val _ = remove1 ()
|
||||||
|
val _ = remove2 ()
|
||||||
|
val _ = remove3 ()
|
||||||
|
val _ = remove4 ()
|
||||||
|
val _ = remove5 ()
|
||||||
|
|
||||||
|
val _ = insert1 ()
|
||||||
|
val _ = insert2 ()
|
||||||
|
val _ = insert3 ()
|
||||||
|
|
||||||
|
val _ = getPrefixList ()
|
||||||
|
in
|
||||||
|
()
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
val _ = StringSetTests.run ()
|
||||||
Reference in New Issue
Block a user