use larger set of words in words.txt dataset

This commit is contained in:
2024-09-09 15:19:18 +01:00
parent 7d707c9af4
commit fc0fd6923b
6 changed files with 886133 additions and 69250 deletions

View File

@@ -78,4 +78,5 @@ The reason for implementing a new trie specialised to strings rather than using
# Credits # Credits
The tests in `tests/string-set-tests.sml` were ported from [kpol's Trie data structure in C3](https://github.com/kpol/trie), although this is not true for the files in the `src/` directory. - The tests in `tests/string-set-tests.sml` were ported from [kpol's Trie data structure in C3](https://github.com/kpol/trie).
- The words.txt dataset in `bench/words.txt` is from [this repository](https://github.com/dwyl/english-words).

View File

@@ -1,3 +1,5 @@
(* generate a words.sml file with a vector of strings,
* from a line-delimited words.txt file *)
val inIo = TextIO.openIn "words.txt" val inIo = TextIO.openIn "words.txt"
val outIO = TextIO.openOut "words.sml" val outIO = TextIO.openOut "words.sml"
@@ -11,11 +13,14 @@ fun writeLines (outIO, lst) =
[] => () [] => ()
| word :: tl => | word :: tl =>
let let
val word = String.substring (word, 0, String.size word - 2) (* remove \r and \n from the word *)
val isLast = tl = [] val word = Substring.full word
val word = val word =
if isLast then "\"" ^ word ^ "\"" Substring.dropr (fn chr => chr = #"\n" orelse chr = #"\r") word
else "\"" ^ word ^ "\",\n" val word = Substring.string word
val isLast = tl = []
val word = if isLast then "\"" ^ word ^ "\"" else "\"" ^ word ^ "\",\n"
val _ = TextIO.output (outIO, word) val _ = TextIO.output (outIO, word)
in in
writeLines (outIO, tl) writeLines (outIO, tl)

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff