Module CCParse

Very Simple Parser Combinators

These combinators can be used to write very simple parsers, for example to extract data from a line-oriented file, or as a replacement to Scanf.

A few examples

Some more advanced example(s) can be found in the /examples directory.

Parse a tree
open CCParse;;

type tree = L of int | N of tree * tree;;

let mk_leaf x = L x
let mk_node x y = N(x,y)

let ptree = fix @@ fun self ->
  skip_space *>
    ( (char '(' *> (pure mk_node <*> self <*> self) <* char ')')
      <|>
        (U.int >|= mk_leaf) )
;;

parse_string_exn ptree "(1 (2 3))" ;;
parse_string_exn ptree "((1 2) (3 (4 5)))" ;;
Parse a list of words
open Containers.Parse;;
let p = U.list ~sep:"," U.word;;
parse_string_exn p "[abc , de, hello ,world  ]";;
Stress Test

This makes a list of 100_000 integers, prints it and parses it back.

let p = CCParse.(U.list ~sep:"," U.int);;

let l = CCList.(1 -- 100_000);;
let l_printed =
  CCFormat.(to_string (within "[" "]" (list ~sep:(return ",@,") int))) l;;

let l' = CCParse.parse_string_exn p l_printed;;

assert (l=l');;

Stability guarantees

Some functions are marked "experimental" and are still subject to change.

type position

A position in the input. Typically it'll point at the beginning of an error location.

module Position : sig ... end
module Error : sig ... end
type +'a or_error = ('a, Error.t) Stdlib.result

'a or_error is either Ok x for some result x : 'a, or an error Error.t.

See stringify_result and Error.to_string to print the error message.

exception ParseError of Error.t

Input

Combinators

type 'a t

The abstract type of parsers that return a value of type 'a (or fail).

  • since 3.6 the type is private.
val return : 'a -> 'a t

Always succeeds, without consuming its input.

val pure : 'a -> 'a t

Synonym to return.

val map : ('a -> 'b) -> 'a t -> 'b t
val map2 : ('a -> 'b -> 'c) -> 'a t -> 'b t -> 'c t
val map3 : ('a -> 'b -> 'c -> 'd) -> 'a t -> 'b t -> 'c t -> 'd t
val bind : ('a -> 'b t) -> 'a t -> 'b t

bind f p results in a new parser which behaves as p then, in case of success, applies f to the result.

  • since 3.6
val ap : ('a -> 'b) t -> 'a t -> 'b t

Applicative.

  • since 3.6
val eoi : unit t

Expect the end of input, fails otherwise.

val empty : unit t

Succeed with ().

  • since 3.6
val fail : string -> 'a t

fail msg fails with the given message. It can trigger a backtrack.

val failf : ('a, unit, string, 'b t) Stdlib.format4 -> 'a

Format.sprintf version of fail.

val fail_lazy : (unit -> string) -> 'a t

Like fail, but only produce an error message on demand.

  • since 3.6
val parsing : string -> 'a t -> 'a t

parsing s p behaves the same as p, with the information that we are parsing s, if p fails. The message s is added to the error, it does not replace it, not does the location change (the error still points to the same location as in p).

val set_error_message : string -> 'a t -> 'a t

set_error_message msg p behaves like p, but if p fails, set_error_message msg p fails with msg instead and at the current position. The internal error message of p is just discarded.

  • since 3.6
val pos : position t

pos returns the current position in the buffer.

EXPERIMENTAL

  • since 3.7
val with_pos : 'a t -> ('a * position) t

with_pos p behaves like p, but returns the (starting) position along with p's result.

EXPERIMENTAL

  • since 3.6
val any_char : char t

any_char parses any character. It still fails if the end of input was reached.

  • since 3.6
val any_char_n : int -> string t

any_char_n len parses exactly len characters from the input. Fails if the input doesn't contain at least len chars.

  • since 3.6
val char : char -> char t

char c parses the character c and nothing else.

type slice

A slice of the input, as returned by some combinators such as split_1 or split_list or take.

The idea is that one can use some parsers to cut the input into slices, e.g. split into lines, or split a line into fields (think CSV or TSV). Then a variety of parsers can be used on each slice to extract data from it using recurse.

Slices contain enough information to make it possible for recurse slice p to report failures (if p fails) using locations from the original input, not relative to the slice. Therefore, even after splitting the input into lines using, say, each_line, a failure to parse the 500th line will be reported at line 500 and not at line 1.

EXPERIMENTAL

  • since 3.6
module Slice : sig ... end

Functions on slices.

val recurse : slice -> 'a t -> 'a t

recurse slice p parses the slice (most likely obtained via another combinator, such as split_1 or split_list), using p.

The slice contains a position which is used to relocate error messages to their position in the whole input, not just relative to the slice.

EXPERIMENTAL

  • since 3.6
val set_current_slice : slice -> unit t

set_current_slice slice replaces the parser's state with slice.

EXPERIMENTAL

  • since 3.6
val chars_fold : f: ('acc -> char -> [ `Continue of 'acc | `Consume_and_stop of 'acc | `Stop of 'acc | `Fail of string ]) -> 'acc -> ('acc * slice) t

chars_fold f acc0 folds over characters of the input. Each char c is passed, along with the current accumulator, to f; f can either:

  • stop, by returning `Stop acc. In this case the final accumulator acc is returned, and c is not consumed.
  • consume char and stop, by returning `Consume_and_stop acc.
  • fail, by returning `Fail msg. In this case the parser fails with the given message.
  • continue, by returning `Continue acc. The parser continues to the next char with the new accumulator.

This is a generalization of of chars_if that allows one to transform characters on the fly, skip some, handle escape sequences, etc. It can also be useful as a base component for a lexer.

  • returns

    a pair of the final accumular, and the slice matched by the fold.

  • since 3.6
val chars_fold_transduce : f: ('acc -> char -> [ `Continue of 'acc | `Yield of 'acc * char | `Consume_and_stop | `Stop | `Fail of string ]) -> 'acc -> ('acc * string) t

Same as chars_fold but with the following differences:

  • returns a string along with the accumulator, rather than the slice of all the characters accepted by `Continue _. The string is built from characters returned by `Yield.
  • new case `Yield (acc, c) adds c to the returned string and continues parsing with acc.
  • since 3.6
val take_until_success : 'a t -> (slice * 'a) t

take_until_success p accumulates characters of the input into a slice, until p successfully parses a value x; then it returns slice, x.

NOTE performance wise, if p does a lot of work at each position, this can be costly (thing naive substring search if p is string "very long needle").

  • since 3.12
val take : int -> slice t

take len parses exactly len characters from the input. Fails if the input doesn't contain at least len chars.

  • since 3.6
val take_if : (char -> bool) -> slice t

take_if f takes characters as long as they satisfy the predicate f.

  • since 3.6
val take1_if : ?descr:string -> (char -> bool) -> slice t

take1_if f takes characters as long as they satisfy the predicate f. Fails if no character satisfies f.

  • parameter descr

    describes what kind of character was expected, in case of error

  • since 3.6
val char_if : ?descr:string -> (char -> bool) -> char t

char_if f parses a character c if f c = true. Fails if the next char does not satisfy f.

  • parameter descr

    describes what kind of character was expected, in case of error

val chars_if : (char -> bool) -> string t

chars_if f parses a string of chars that satisfy f. Cannot fail.

val chars1_if : ?descr:string -> (char -> bool) -> string t

Like chars_if, but accepts only non-empty strings. chars1_if p fails if the string accepted by chars_if p is empty. chars1_if p is equivalent to take1_if p >|= Slice.to_string.

  • parameter descr

    describes what kind of character was expected, in case of error

val endline : char t

Parse '\n'.

val space : char t

Tab or space.

val white : char t

Tab or space or newline.

val skip_chars : (char -> bool) -> unit t

Skip 0 or more chars satisfying the predicate.

val skip_space : unit t

Skip ' ' and '\t'.

val skip_white : unit t

Skip ' ' and '\t' and '\n'.

val is_alpha : char -> bool

Is the char a letter?

val is_num : char -> bool

Is the char a digit?

val is_alpha_num : char -> bool

Is the char a letter or a digit?

val is_space : char -> bool

True on ' ' and '\t'.

val is_white : char -> bool

True on ' ' and '\t' and '\n'.

val suspend : (unit -> 'a t) -> 'a t

suspend f is the same as f (), but evaluates f () only when needed.

A practical use case is to implement recursive parsers manually, as described in fix. The parser is let rec p () = …, and suspend p can be used in the definition to use p.

val string : string -> string t

string s parses exactly the string s, and nothing else.

val exact : string -> string t

Alias to string.

  • since 3.6
val many : 'a t -> 'a list t

many p parses p repeatedly, until p fails, and collects the results into a list.

val optional : _ t -> unit t

optional p tries to parse p, and return () whether it succeeded or failed. Cannot fail itself. It consumes input if p succeeded (as much as p consumed), but consumes not input if p failed.

  • since 3.6
val try_ : 'a t -> 'a t

try_ p is just like p (it used to play a role in backtracking semantics but no more).

  • deprecated

    since 3.6 it can just be removed. See try_opt if you want to detect failure.

val try_opt : 'a t -> 'a option t

try_opt p tries to parse using p, and return Some x if p succeeded with x (and consumes what p consumed). Otherwise it returns None and consumes nothing. This cannot fail.

  • since 3.6
val many_until : until:_ t -> 'a t -> 'a list t

many_until ~until p parses as many p as it can until the until parser successfully returns. If p fails before that then many_until ~until p fails as well. Typically until can be a closing ')' or another termination condition, and what is consumed by until is also consumed by many_until ~until p.

EXPERIMENTAL

  • since 3.6
val try_or : 'a t -> f:('a -> 'b t) -> else_:'b t -> 'b t

try_or p1 ~f ~else_:p2 attempts to parse x using p1, and then becomes f x. If p1 fails, then it becomes p2. This can be useful if f is expensive but only ever works if p1 matches (e.g. after an opening parenthesis or some sort of prefix).

  • since 3.6
val try_or_l : ?msg:string -> ?else_:'a t -> (unit t * 'a t) list -> 'a t

try_or_l ?else_ l tries each pair (test, p) in order. If the n-th test succeeds, then try_or_l l behaves like n-th p, whether p fails or not. If test consumes input, the state is restored before calling p. If they all fail, and else_ is defined, then it behaves like else_. If all fail, and else_ is None, then it fails as well.

This is a performance optimization compared to (<|>). We commit to a branch if the test succeeds, without backtracking at all. It can also provide better error messages, because failures in the parser will not be reported as failures in try_or_l.

See lookahead_ignore for a convenient way of writing the test conditions.

  • parameter msg

    error message if all options fail

    EXPERIMENTAL

  • since 3.6
val or_ : 'a t -> 'a t -> 'a t

or_ p1 p2 tries to parse p1, and if it fails, tries p2 from the same position.

  • since 3.6
val both : 'a t -> 'b t -> ('a * 'b) t

both a b parses a, then b, then returns the pair of their results.

  • since 3.6
val many1 : 'a t -> 'a list t

many1 p is like many p excepts it fails if the list is empty (i.e. it needs p to succeed at least once).

val skip : _ t -> unit t

skip p parses zero or more times p and ignores its result. It is eager, meaning it will continue as long as p succeeds. As soon as p fails, skip p stops consuming any input.

val sep : by:_ t -> 'a t -> 'a list t

sep ~by p parses a list of p separated by by.

val sep_until : until:_ t -> by:_ t -> 'a t -> 'a list t

Same as sep but stop when until parses successfully.

  • since 3.6
val sep1 : by:_ t -> 'a t -> 'a list t

sep1 ~by p parses a non empty list of p, separated by by.

val lookahead : 'a t -> 'a t

lookahead p behaves like p, except it doesn't consume any input.

EXPERIMENTAL

  • since 3.6
val lookahead_ignore : 'a t -> unit t

lookahead_ignore p tries to parse input with p, and succeeds if p succeeds. However it doesn't consume any input and returns (), so in effect its only use-case is to detect whether p succeeds, e.g. in try_or_l.

EXPERIMENTAL

  • since 3.6
val fix : ('a t -> 'a t) -> 'a t

Fixpoint combinator. fix (fun self -> p) is the parser p, in which self refers to the parser p itself (which is useful to parse recursive structures.

An alternative, manual implementation to let p = fix (fun self -> q) is:

let rec p () =
 let self = suspend p in
 q
val line : slice t

Parse a line, '\n' excluded, and position the cursor after the '\n'.

  • since 3.6
val line_str : string t

line_str is line >|= Slice.to_string. It parses the next line and turns the slice into a string. The state points to the character immediately after the '\n' character.

  • since 3.6
val each_line : 'a t -> 'a list t

each_line p runs p on each line of the input. EXPERIMENTAL

  • since 3.6
val split_1 : on_char:char -> (slice * slice option) t

split_1 ~on_char looks for on_char in the input, and returns a pair sl1, sl2, where:

  • sl1 is the slice of the input the precedes the first occurrence of on_char, or the whole input if on_char cannot be found. It does not contain on_char.
  • sl2 is the slice that comes after on_char, or None if on_char couldn't be found. It doesn't contain the first occurrence of on_char (if any).

The parser is now positioned at the end of the input.

EXPERIMENTAL

  • since 3.6
val split_list : on_char:char -> slice list t

split_list ~on_char splits the input on all occurrences of on_char, returning a list of slices.

EXPERIMENTAL

  • since 3.6
val split_list_at_most : on_char:char -> int -> slice list t

split_list_at_most ~on_char n applies split_1 ~on_char at most n times, to get a list of n+1 elements. The last element might contain on_char. This is useful to limit the amount of work done by split_list.

EXPERIMENTAL

  • since 3.6
val split_2 : on_char:char -> (slice * slice) t

split_2 ~on_char splits the input into exactly 2 fields, and fails if the split yields less or more than 2 items. EXPERIMENTAL

  • since 3.6
val split_3 : on_char:char -> (slice * slice * slice) t

See split_2 EXPERIMENTAL

  • since 3.6
val split_4 : on_char:char -> (slice * slice * slice * slice) t

See split_2 EXPERIMENTAL

  • since 3.6
val each_split : on_char:char -> 'a t -> 'a list t

split_list_map ~on_char p uses split_list ~on_char to split the input, then parses each chunk of the input thus obtained using p.

The difference with sep ~by:(char on_char) p is that sep calls p first, and only tries to find on_char after p returns. While it is more flexible, this technique also means p has to be careful not to consume on_char by error.

A useful specialization of this is each_line, which is basically each_split ~on_char:'\n' p.

EXPERIMENTAL

  • since 3.6
val all : slice t

all returns all the unconsumed input as a slice, and consumes it. Use Slice.to_string to turn it into a string.

Note that lookahead all can be used to peek at the rest of the input without consuming anything.

  • since 3.6
val all_str : string t

all_str accepts all the remaining chars and extracts them into a string. Similar to all but with a string.

EXPERIMENTAL

  • since 3.6
val memo : 'a t -> 'a t

Memoize the parser. memo p will behave like p, but when called in a state (read: position in input) it has already processed, memo p returns a result directly. The implementation uses an underlying hashtable. This can be costly in memory, but improve the run time a lot if there is a lot of backtracking involving p.

Do not call memo inside other functions, especially with (>>=), map, etc. being so prevalent. Instead the correct way to use it is in a toplevel definition:

let my_expensive_parser = memo (foo *> bar >>= fun i -> …)

This function is not thread-safe.

val fix_memo : ('a t -> 'a t) -> 'a t

Like fix, but the fixpoint is memoized.

Infix

module Infix : sig ... end
include module type of Infix
val (>|=) : 'a t -> ('a -> 'b) -> 'b t

Alias to map. p >|= f parses an item x using p, and returns f x.

val (>>=) : 'a t -> ('a -> 'b t) -> 'b t

Alias to bind. p >>= f results in a new parser which behaves as p then, in case of success, applies f to the result.

val (<*>) : ('a -> 'b) t -> 'a t -> 'b t

Applicative.

val (<*) : 'a t -> _ t -> 'a t

a <* b parses a into x, parses b and ignores its result, and returns x.

val (*>) : _ t -> 'a t -> 'a t

a *> b parses a, then parses b into x, and returns x. The result of a is ignored.

val (<|>) : 'a t -> 'a t -> 'a t

Alias to or_.

a <|> b tries to parse a, and if a fails without consuming any input, backtracks and tries to parse b, otherwise it fails as a.

val (<?>) : 'a t -> string -> 'a t

a <?> msg behaves like a, but if a fails, a <?> msg fails with msg instead. Useful as the last choice in a series of <|>. For example: a <|> b <|> c <?> "expected one of a, b, c".

val (|||) : 'a t -> 'b t -> ('a * 'b) t

Alias to both. a ||| b parses a, then b, then returns the pair of their results.

  • since 3.6
val let+ : 'a t -> ('a -> 'b) -> 'b t
val and+ : 'a t -> 'b t -> ('a * 'b) t
val let* : 'a t -> ('a -> 'b t) -> 'b t
val and* : 'a t -> 'b t -> ('a * 'b) t

Parse input

val stringify_result : 'a or_error -> ('a, string) Stdlib.result

Turn a Error.t-oriented result into a more basic string result.

  • since 3.6
val parse_string : 'a t -> string -> ('a, string) Stdlib.result

Parse a string using the parser.

val parse_string_e : 'a t -> string -> 'a or_error

Version of parse_string that returns a more detailed error.

val parse_string_exn : 'a t -> string -> 'a
val parse_file : 'a t -> string -> ('a, string) Stdlib.result

parse_file p filename parses file named filename with p by opening the file and reading it whole.

val parse_file_e : 'a t -> string -> 'a or_error

Version of parse_file that returns a more detailed error.

val parse_file_exn : 'a t -> string -> 'a

Same as parse_file, but

module U : sig ... end
module Debug_ : sig ... end

Debugging utils. EXPERIMENTAL