reorganise repository

This commit is contained in:
2024-05-27 13:28:09 +01:00
parent 1bc468238e
commit b5c70772fa
14 changed files with 3 additions and 459833 deletions

2
.gitignore vendored
View File

@@ -8,4 +8,4 @@
/examples.du /examples.du
/examples.ud /examples.ud
/out /bench/out

View File

@@ -2,44 +2,6 @@
## Introduction ## Introduction
Standard ML port of [this](https://github.com/hummy123/brolib) rope implementation. Implementations of various data structures for manipulating text (inserting into the data structure and removing from it).
This particular rope uses the balancing scheme described in the [Purely Functional 1-2 Brother Trees paper authored by Ralph Hinze](https://www.cs.ox.ac.uk/ralf.hinze/publications/Brother12.pdf). It tries to keep the number of nodes to a minimum by joining the strings in adjacent leaf nodes, if joining would not be too expensive. Currently experimenting; not for public use.
## Usage
The two files are `rope.sml` and `tiny_rope.sml`.
`rope.sml` contains a rope that tracks line metadata (which has a small performance and memory penalty). This is useful if you have line-based operations in mind.
`tiny_rope.sml` doesn't track line metadata, and is useful when line-queries aren't needed.
Except for the line-based operations `appendLine` and `foldLines`, all functions are the same between the two (aside from `verifyLines` which is just for testing purposes).
Examples of usage can be found in [`examples.sml`](https://github.com/hummy123/brolib-sml/blob/main/examples.sml).
## Performance
These two ropes are both quite fast.
I compared the OCaml port with the other text data structures in OCaml, and it beat those handily when processing the datasets from [here](https://github.com/josephg/editing-traces) which just test insertion and deletion. It was also faster at performing substrings than the others.
I don't know other Standard ML libraries to compare it to, but with MLton, this rope implementation beats [the fastest ropes in Rust](https://github.com/josephg/jumprope-rs#benchmarks) at insertion and deletion quite easily, never going 1 ms in the slowest dataset.
I don't know how to explain this surprising result, but most of the credit must go to the MLton compiler. This result might also be explained by some entirely untested theories that may or may not be true:
- MLton may have optimised the data set (which is pure Standard ML)
- These benchmarks have an unfair advantage because the datasets are cache-friendly vectors/arrays.
- These ropes are likely slower on queries (those Rust ropes use B-Trees which are more cache-friendly).
- The other ropes may track more metadata (like UTF-8/16/32 indices) which would add take a little more time.
Here are some numbers in nanoseconds, running on a single core with a Raspberry Pi 5 that has 8 GB of RAM:
| Dataset | rope.sml time | tiny_rope.sml time |
|-----------------|---------------|--------------------|
| automerge-paper | 10,018 ns | 9,726 ns |
| rustcode | 79,896 ns | 74,479 ns |
| sveltecomponent | 280,654 ns | 250,744 ns |
| seph-blog1 | 703,868 ns | 589,501 ns |
The relevant Rust rope libraries have benchmarks [here](https://github.com/josephg/jumprope-rs/blob/master/README.md#benchmarks) for reference.

259783
automerge.sml

File diff suppressed because it is too large Load Diff

View File

@@ -1,17 +0,0 @@
$(SML_LIB)/basis/basis.mlb
ann
"allowVectorExps true"
in
svelte.sml
rust.sml
seph.sml
automerge.sml
end
tiny_rope.sml
rope.sml
gap_buffer.sml
utils.sml

View File

@@ -1,6 +0,0 @@
$(SML_LIB)/basis/basis.mlb
tiny_rope.sml
tiny_rope23.sml
rope.sml
examples.sml

View File

@@ -1,139 +0,0 @@
(* An empty rope, containing no strings. *)
val rope = Rope.empty;
(* Initialise rope from a string.
*
* You probably want to avoid initialising the rope with very long strings,
* because a rope is meant to represent a long string
* by holding nodes that contain smaller strings in a binary tree.
* The implementation avoids building strings that are ever larger than 1024,
* but that was done in an attempt to find the ideal length for performance.
* A user shouldn't notice any delays in larger lengths like 65535 either.
*
* In their text buffer (a piece-tree, which is slower than a rope),
* the VS Code team had other issues with excessively large strings.
* https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation#_avoid-the-string-concatenation-trap *)
val rope = Rope.fromString "hello, world!\n";
(* Convert a rope to a string.
*
* This may involve allocating an extremely large string in some cases,
* which should be avoided for the reason mentioned in the above comment. *)
val str = Rope.toString rope;
(* Insert a string into the rope.
*
* There isn't any validation to check that you inserted at a reasonable
* position.
* If you insert at an index lower than 0, your inserted string is just
* prepended to the start.
* If you insert at an index greater than the length, your inserted string is
* just appended to the end.
*
* One thing to watch out for if you are using the line-rope is making sure
* that you don't insert in the middle of a \r\n pair, separating \r from \n.
* That would mess up the line metadata the rope contains and make the line
* metadata invalid. *)
val rope = Rope.insert (14, "goodbye, world!", rope);
(* Append a string into the rope. *)
val rope = Rope.append ("hello again\n", rope);
(* Append a string into the rope, providing line metadata with it.
*
* The point of this function is for performance: the other insertion functions
* calculate the line metadata by scanning the string itself, but in some cases
* this is already known. The larger example below is such a case. *)
val rope = Rope.appendLine ("my new line", Vector.fromList [], rope);
(** Second larger example motivating String.appendLine below. *)
(*** Returns the start index of a line,
*** returning the index of \r if line ends with a \r\n pair. *)
fun getLineStart line =
let
val lastIdx = String.size line - 1
val lastChr = String.sub (line, lastIdx)
in
if lastChr = #"\n" andalso lastIdx - 1 >= 0 then
if String.sub (line, lastIdx - 1) = #"\r" then lastIdx - 1 else lastIdx
else
lastIdx
end;
(*** Appends the lines in a file to a rope. *)
fun readLines (rope, file) =
case TextIO.inputLine file of
SOME line =>
let
(* Don't need to scan string to find line breaks,
* because we already know. *)
val lineIdx = getLineStart (line)
val vec = Vector.fromList [lineIdx]
val rope = Rope.appendLine (line, vec, rope)
in
readLines (rope, file)
end
| NONE => rope;
val licenseRope = readLines (Rope.empty, TextIO.openIn "LICENSE");
(* Deletes the given range from rope, from the start index to the end index.
*
* As with insert, one should make sure they don't corrupt the line metadata.
* Specifically, in a \r\n pair, the line metadata points to \r.
* Deleting \r would corrupt it, but deleting \n would be fine.
* In general, if you want to delete a line break, you would want to delete both
* \r and \n. The user thinks of the \r\n pair as a single character so they are
* expecting the whole line break to be deleted. *)
(** Initialise new rope. *)
val rope = Rope.fromString "hello, world!";
(** New rope contains "hello world!" without comma. *)
val rope = Rope.delete (5, 1, rope);
(* Folds over the characters in a rope, starting from the given index.
*
* This is meant to be an alternative to queries for a specific line or a
* substring.
* If a rope is meant to avoid allocating large strings, then it seems more
* performant to query its contents through higher-order functions rather than
* allocating substrings and querying the substring. *)
val rope = Rope.fromString "hello!";;
fun apply (chr, lst) = chr :: lst;
(** val result = [#"!",#"o",#"l",#"l",#"e"] : char list *)
val result = Rope.foldFromIdx (apply, 1, rope, []);
(* Folds over the characters in a rope, accepting a predicate function
* that terminates the fold when it returns true. *)
fun apply (chr, acc) =
(print (Char.toString chr); acc + 1);
fun term acc = acc = 3;
(** Below function prints first three letters, "hel",
** and then steops folding. *)
val _ = Rope.foldFromIdxTerm (apply, term, 0, rope, 0);
(* Folds over the characters in a rope, starting from the given line number.
*
* This is just like the foldFromIdxTerm function, except that it starts folding
* from the given line number instead. *)
val rope = Rope.fromString "hello, world!\ngoodbye, world!\nhello again!";
fun apply (chr, _) =
print (Char.toString chr);
fun term _ = false;
(** Below line prints the whole string, one character at a time. *)
Rope.foldLines (apply, term, 0, rope, ());
(** Prints starting from #"g" in "goodbye". *)
Rope.foldLines (apply, term, 1, rope, ());
(** Prints the very last line. *)
Rope.foldLines (apply, term, 2, rope, ());
(** Prints the whole string if specifying a line before 0, which doesn't exist. *)
Rope.foldLines (apply, term, ~3, rope, ());
(** Raises a subscript exception: there is no corresponding line in the rope. *)
Rope.foldLines (apply, term, 4, rope, ());

41081
rust.sml

File diff suppressed because one or more lines are too long

138556
seph.sml

File diff suppressed because one or more lines are too long

19994
svelte.sml

File diff suppressed because one or more lines are too long

216
utils.sml
View File

@@ -1,216 +0,0 @@
fun runTxns arr =
Vector.foldl
(fn ((pos, delNum, insStr), rope) =>
let
val rope =
if delNum > 0 then GapBuffer.delete (pos, delNum, rope) else rope
val strSize = String.size insStr
val rope =
if strSize > 0 then GapBuffer.insert (pos, insStr, rope) else rope
in
rope
end) GapBuffer.empty arr
fun runTxnsTime arr =
let
val startTime = Time.now ()
val startTime = Time.toMilliseconds startTime
val x = runTxns arr
val endTime = Time.now ()
val endTime = Time.toMilliseconds endTime
val timeDiff = endTime - startTime
val timeDiff = LargeInt.toString timeDiff
val timeTook = String.concat ["took ", timeDiff, " ms\n"]
val _ = (print timeTook)
in
x
end
fun compareTxns arr =
Vector.foldli
(fn (idx, (pos, delNum, insStr), (rope, gapBuffer)) =>
let
val oldRope = rope
val strSize = String.size insStr
val rope =
if delNum > 0 then TinyRope.delete (pos, delNum, rope) else rope
val rope =
if strSize > 0 then TinyRope.insert (pos, insStr, rope) else rope
val gapBuffer =
if delNum > 0 then GapBuffer.delete (pos, delNum, gapBuffer)
else gapBuffer
val gapBuffer =
if strSize > 0 then GapBuffer.insert (pos, insStr, gapBuffer)
else gapBuffer
val ropeString = TinyRope.toString rope
val gapBufferString = GapBuffer.toString gapBuffer
in
if ropeString = gapBufferString then
(rope, gapBuffer)
else
let
val _ = print
("difference detected at txn number: " ^ (Int.toString idx)
^ "\n")
val txn = String.concat
[ "offending txn: \n"
, "pos: "
, Int.toString pos
, ", delNum: "
, Int.toString delNum
, ", insStr: |"
, insStr
, "|\n"
]
val _ = print txn
val _ = print "before offending string: \n"
val _ = print (TinyRope.toString oldRope)
val _ = print "\n"
val _ = print "rope string: \n"
val _ = print (ropeString ^ "\n")
val _ = print "gap string: \n"
val _ = print (gapBufferString ^ "\n")
val _ = raise Empty
in
(rope, gapBuffer)
end
end) (TinyRope.empty, GapBuffer.empty) arr
fun runToString rope = GapBuffer.toString rope
fun writeFile filename acc =
let
val str = String.concatWith "," acc
val fd = TextIO.openOut filename
val _ = TextIO.output (fd, str) handle e => (TextIO.closeOut fd; raise e)
val _ = TextIO.closeOut fd
in
()
end
fun write (fileName, rope) =
let
val str = GapBuffer.toString rope
val io = TextIO.openOut fileName
val _ = TextIO.output (io, str)
val _ = TextIO.closeOut io
in
()
end
fun runTxnsStats (ins, del, empty, arr) =
Vector.foldl
(fn ((pos, delNum, insStr), (buffer, lst)) =>
let
val startTime = Time.now ()
val startTime = Time.toMilliseconds startTime
val buffer = if delNum > 0 then del (pos, delNum, buffer) else buffer
val strSize = String.size insStr
val buffer = if strSize > 0 then ins (pos, insStr, buffer) else buffer
val endTime = Time.now ()
val endTime = Time.toMilliseconds endTime
val timeDiff = endTime - startTime
val lst = timeDiff :: lst
in
(buffer, lst)
end) (empty, []) arr
fun printListStats (lst, min, max) =
case lst of
[] =>
let
val _ = print ("minimum time: " ^ LargeInt.toString min ^ "\n")
val _ = print ("maximum time: " ^ LargeInt.toString max ^ "\n")
val _ = print "\n"
in
()
end
| hd :: tl =>
let
val min = LargeInt.min (min, hd)
val max = LargeInt.max (max, hd)
in
printListStats (tl, min, max)
end
fun runTxnsAndGetStats (ins, del, empty, arr) =
let
val (buffer, lst) = runTxnsStats (ins, del, empty, arr)
val _ = printListStats (lst, LargeInt.fromInt 1000, LargeInt.fromInt ~1000)
in
buffer
end
fun printBufferStats (title, ins, del, empty) =
let
val _ = print (title ^ "\n")
val _ = runTxnsAndGetStats (ins, del, empty, SvelteComponent.txns)
val _ = runTxnsAndGetStats (ins, del, empty, RustCode.txns)
val _ = runTxnsAndGetStats (ins, del, empty, SephBlog.txns)
val _ = runTxnsAndGetStats (ins, del, empty, AutomergePaper.txns)
in
()
end
fun main () =
let
(* Timing benchmarks. *)
val svelte = runTxnsTime SvelteComponent.txns
val rust = runTxnsTime RustCode.txns
val seph = runTxnsTime SephBlog.txns
val automerge = runTxnsTime AutomergePaper.txns
val _ = print "\n"
val _ =
printBufferStats
( "GAP BUFFER STATS: "
, GapBuffer.insert
, GapBuffer.delete
, GapBuffer.empty
)
val _ =
printBufferStats
("TINY ROPE STATS: ", TinyRope.insert, TinyRope.delete, TinyRope.empty)
(* Tests for correctness; will fail if incorrect. *)
(** Tests for insertion correctness (compare against rope). **)
val _ = compareTxns SvelteComponent.txns
val _ = print "svelte test passed\n"
val _ = compareTxns RustCode.txns
val _ = print "rust test passed\n"
val _ = compareTxns SephBlog.txns
val _ = print "seph test passed\n"
val _ = compareTxns AutomergePaper.txns
val _ = print "automerge test passed\n"
(* Tests for line metadata. *)
(*
val _ = Rope.verifyLines svelte
val _ = Rope.verifyLines rust
val _ = Rope.verifyLines seph
val _ = Rope.verifyLines automerge
*)
val _ = write ("out/svelte_gap.txt", svelte)
val _ = write ("out/rust23_gap.txt", rust)
val _ = write ("out/seph23_gap.txt", seph)
val _ = write ("out/automerge_gap.txt", automerge)
in
()
end
val _ = main ()