CCParse
Very Simple Parser Combinators
These combinators can be used to write very simple parsers, for example to extract data from a line-oriented file, or as a replacement to Scanf
.
Some more advanced example(s) can be found in the /examples
directory.
open CCParse;;
type tree = L of int | N of tree * tree;;
let mk_leaf x = L x
let mk_node x y = N(x,y)
let ptree = fix @@ fun self ->
skip_space *>
( (char '(' *> (pure mk_node <*> self <*> self) <* char ')')
<|>
(U.int >|= mk_leaf) )
;;
parse_string_exn ptree "(1 (2 3))" ;;
parse_string_exn ptree "((1 2) (3 (4 5)))" ;;
open Containers.Parse;;
let p = U.list ~sep:"," U.word;;
parse_string_exn p "[abc , de, hello ,world ]";;
This makes a list of 100_000 integers, prints it and parses it back.
let p = CCParse.(U.list ~sep:"," U.int);;
let l = CCList.(1 -- 100_000);;
let l_printed =
CCFormat.(to_string (within "[" "]" (list ~sep:(return ",@,") int))) l;;
let l' = CCParse.parse_string_exn p l_printed;;
assert (l=l');;
Some functions are marked "experimental" and are still subject to change.
module Position : sig ... end
module Error : sig ... end
type +'a or_error = ('a, Error.t) Stdlib.result
'a or_error
is either Ok x
for some result x : 'a
, or an error Error.t
.
See stringify_result
and Error.to_string
to print the error message.
exception ParseError of Error.t
val return : 'a -> 'a t
Always succeeds, without consuming its input.
bind f p
results in a new parser which behaves as p
then, in case of success, applies f
to the result.
val eoi : unit t
Expect the end of input, fails otherwise.
val empty : unit t
Succeed with ()
.
val fail : string -> 'a t
fail msg
fails with the given message. It can trigger a backtrack.
parsing s p
behaves the same as p
, with the information that we are parsing s
, if p
fails. The message s
is added to the error, it does not replace it, not does the location change (the error still points to the same location as in p
).
set_error_message msg p
behaves like p
, but if p
fails, set_error_message msg p
fails with msg
instead and at the current position. The internal error message of p
is just discarded.
with_pos p
behaves like p
, but returns the (starting) position along with p
's result.
EXPERIMENTAL
val any_char : char t
any_char
parses any character. It still fails if the end of input was reached.
val any_char_n : int -> string t
any_char_n len
parses exactly len
characters from the input. Fails if the input doesn't contain at least len
chars.
val char : char -> char t
char c
parses the character c
and nothing else.
A slice of the input, as returned by some combinators such as split_1
or split_list
or take
.
The idea is that one can use some parsers to cut the input into slices, e.g. split into lines, or split a line into fields (think CSV or TSV). Then a variety of parsers can be used on each slice to extract data from it using recurse
.
Slices contain enough information to make it possible for recurse slice p
to report failures (if p
fails) using locations from the original input, not relative to the slice. Therefore, even after splitting the input into lines using, say, each_line
, a failure to parse the 500th line will be reported at line 500 and not at line 1.
EXPERIMENTAL
module Slice : sig ... end
Functions on slices.
recurse slice p
parses the slice
(most likely obtained via another combinator, such as split_1
or split_list
), using p
.
The slice contains a position which is used to relocate error messages to their position in the whole input, not just relative to the slice.
EXPERIMENTAL
set_current_slice slice
replaces the parser's state with slice
.
EXPERIMENTAL
val chars_fold :
f:
('acc ->
char ->
[ `Continue of 'acc
| `Consume_and_stop of 'acc
| `Stop of 'acc
| `Fail of string ]) ->
'acc ->
('acc * slice) t
chars_fold f acc0
folds over characters of the input. Each char c
is passed, along with the current accumulator, to f
; f
can either:
`Stop acc
. In this case the final accumulator acc
is returned, and c
is not consumed.`Consume_and_stop acc
.`Fail msg
. In this case the parser fails with the given message.`Continue acc
. The parser continues to the next char with the new accumulator.This is a generalization of of chars_if
that allows one to transform characters on the fly, skip some, handle escape sequences, etc. It can also be useful as a base component for a lexer.
val chars_fold_transduce :
f:
('acc ->
char ->
[ `Continue of 'acc
| `Yield of 'acc * char
| `Consume_and_stop
| `Stop
| `Fail of string ]) ->
'acc ->
('acc * string) t
Same as chars_fold
but with the following differences:
`Continue _
. The string is built from characters returned by `Yield
.`Yield (acc, c)
adds c
to the returned string and continues parsing with acc
.take_until_success p
accumulates characters of the input into a slice, until p
successfully parses a value x
; then it returns slice, x
.
NOTE performance wise, if p
does a lot of work at each position, this can be costly (thing naive substring search if p
is string "very long needle"
).
take len
parses exactly len
characters from the input. Fails if the input doesn't contain at least len
chars.
take_if f
takes characters as long as they satisfy the predicate f
.
take1_if f
takes characters as long as they satisfy the predicate f
. Fails if no character satisfies f
.
val char_if : ?descr:string -> (char -> bool) -> char t
char_if f
parses a character c
if f c = true
. Fails if the next char does not satisfy f
.
val chars_if : (char -> bool) -> string t
chars_if f
parses a string of chars that satisfy f
. Cannot fail.
val chars1_if : ?descr:string -> (char -> bool) -> string t
Like chars_if
, but accepts only non-empty strings. chars1_if p
fails if the string accepted by chars_if p
is empty. chars1_if p
is equivalent to take1_if p >|= Slice.to_string
.
val endline : char t
Parse '\n'.
val space : char t
Tab or space.
val white : char t
Tab or space or newline.
val skip_chars : (char -> bool) -> unit t
Skip 0 or more chars satisfying the predicate.
val skip_space : unit t
Skip ' ' and '\t'.
val skip_white : unit t
Skip ' ' and '\t' and '\n'.
suspend f
is the same as f ()
, but evaluates f ()
only when needed.
A practical use case is to implement recursive parsers manually, as described in fix
. The parser is let rec p () = …
, and suspend p
can be used in the definition to use p
.
val string : string -> string t
string s
parses exactly the string s
, and nothing else.
many p
parses p
repeatedly, until p
fails, and collects the results into a list.
optional p
tries to parse p
, and return ()
whether it succeeded or failed. Cannot fail itself. It consumes input if p
succeeded (as much as p
consumed), but consumes not input if p
failed.
try_ p
is just like p
(it used to play a role in backtracking semantics but no more).
try_opt p
tries to parse using p
, and return Some x
if p
succeeded with x
(and consumes what p
consumed). Otherwise it returns None
and consumes nothing. This cannot fail.
many_until ~until p
parses as many p
as it can until the until
parser successfully returns. If p
fails before that then many_until ~until p
fails as well. Typically until
can be a closing ')' or another termination condition, and what is consumed by until
is also consumed by many_until ~until p
.
EXPERIMENTAL
try_or p1 ~f ~else_:p2
attempts to parse x
using p1
, and then becomes f x
. If p1
fails, then it becomes p2
. This can be useful if f
is expensive but only ever works if p1
matches (e.g. after an opening parenthesis or some sort of prefix).
try_or_l ?else_ l
tries each pair (test, p)
in order. If the n-th test
succeeds, then try_or_l l
behaves like n-th p
, whether p
fails or not. If test
consumes input, the state is restored before calling p
. If they all fail, and else_
is defined, then it behaves like else_
. If all fail, and else_
is None
, then it fails as well.
This is a performance optimization compared to (<|>)
. We commit to a branch if the test succeeds, without backtracking at all. It can also provide better error messages, because failures in the parser will not be reported as failures in try_or_l
.
See lookahead_ignore
for a convenient way of writing the test conditions.
or_ p1 p2
tries to parse p1
, and if it fails, tries p2
from the same position.
both a b
parses a
, then b
, then returns the pair of their results.
many1 p
is like many p
excepts it fails if the list is empty (i.e. it needs p
to succeed at least once).
skip p
parses zero or more times p
and ignores its result. It is eager, meaning it will continue as long as p
succeeds. As soon as p
fails, skip p
stops consuming any input.
Same as sep
but stop when until
parses successfully.
lookahead p
behaves like p
, except it doesn't consume any input.
EXPERIMENTAL
lookahead_ignore p
tries to parse input with p
, and succeeds if p
succeeds. However it doesn't consume any input and returns ()
, so in effect its only use-case is to detect whether p
succeeds, e.g. in try_or_l
.
EXPERIMENTAL
Fixpoint combinator. fix (fun self -> p)
is the parser p
, in which self
refers to the parser p
itself (which is useful to parse recursive structures.
An alternative, manual implementation to let p = fix (fun self -> q)
is:
let rec p () =
let self = suspend p in
q
val line_str : string t
line_str
is line >|= Slice.to_string
. It parses the next line and turns the slice into a string. The state points to the character immediately after the '\n'
character.
split_1 ~on_char
looks for on_char
in the input, and returns a pair sl1, sl2
, where:
sl1
is the slice of the input the precedes the first occurrence of on_char
, or the whole input if on_char
cannot be found. It does not contain on_char
.sl2
is the slice that comes after on_char
, or None
if on_char
couldn't be found. It doesn't contain the first occurrence of on_char
(if any).The parser is now positioned at the end of the input.
EXPERIMENTAL
split_list ~on_char
splits the input on all occurrences of on_char
, returning a list of slices.
EXPERIMENTAL
split_list_at_most ~on_char n
applies split_1 ~on_char
at most n
times, to get a list of n+1
elements. The last element might contain on_char
. This is useful to limit the amount of work done by split_list
.
EXPERIMENTAL
split_2 ~on_char
splits the input into exactly 2 fields, and fails if the split yields less or more than 2 items. EXPERIMENTAL
split_list_map ~on_char p
uses split_list ~on_char
to split the input, then parses each chunk of the input thus obtained using p
.
The difference with sep ~by:(char on_char) p
is that sep
calls p
first, and only tries to find on_char
after p
returns. While it is more flexible, this technique also means p
has to be careful not to consume on_char
by error.
A useful specialization of this is each_line
, which is basically each_split ~on_char:'\n' p
.
EXPERIMENTAL
all
returns all the unconsumed input as a slice, and consumes it. Use Slice.to_string
to turn it into a string.
Note that lookahead all
can be used to peek at the rest of the input without consuming anything.
val all_str : string t
all_str
accepts all the remaining chars and extracts them into a string. Similar to all
but with a string.
EXPERIMENTAL
Memoize the parser. memo p
will behave like p
, but when called in a state (read: position in input) it has already processed, memo p
returns a result directly. The implementation uses an underlying hashtable. This can be costly in memory, but improve the run time a lot if there is a lot of backtracking involving p
.
Do not call memo
inside other functions, especially with (>>=)
, map
, etc. being so prevalent. Instead the correct way to use it is in a toplevel definition:
let my_expensive_parser = memo (foo *> bar >>= fun i -> …)
This function is not thread-safe.
module Infix : sig ... end
include module type of Infix
Alias to map
. p >|= f
parses an item x
using p
, and returns f x
.
Alias to bind
. p >>= f
results in a new parser which behaves as p
then, in case of success, applies f
to the result.
a <* b
parses a
into x
, parses b
and ignores its result, and returns x
.
a *> b
parses a
, then parses b
into x
, and returns x
. The result of a
is ignored.
Alias to or_
.
a <|> b
tries to parse a
, and if a
fails without consuming any input, backtracks and tries to parse b
, otherwise it fails as a
.
a <?> msg
behaves like a
, but if a
fails, a <?> msg
fails with msg
instead. Useful as the last choice in a series of <|>
. For example: a <|> b <|> c <?> "expected one of a, b, c"
.
Alias to both
. a ||| b
parses a
, then b
, then returns the pair of their results.
Let operators on OCaml >= 4.08.0, nothing otherwise
val stringify_result : 'a or_error -> ('a, string) Stdlib.result
Turn a Error.t
-oriented result into a more basic string result.
val parse_string : 'a t -> string -> ('a, string) Stdlib.result
Parse a string using the parser.
Version of parse_string
that returns a more detailed error.
val parse_string_exn : 'a t -> string -> 'a
val parse_file : 'a t -> string -> ('a, string) Stdlib.result
parse_file p filename
parses file named filename
with p
by opening the file and reading it whole.
Version of parse_file
that returns a more detailed error.
val parse_file_exn : 'a t -> string -> 'a
Same as parse_file
, but
module U : sig ... end
module Debug_ : sig ... end
Debugging utils. EXPERIMENTAL