read
read
This manual introduces parsing in MIT/GNU Scheme by giving a tutorial on how to parse string literals.
Impatient readers can skip to the entire source code file of the final parser implementation in parse-string.scm. The tutorial could then function to help explain confusing aspects of the source code file.
Copyleft © 2005 Aaron S. Hawley
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with the Front-Cover Texts being “Free Documentation,” and with the Back-Cover Texts as in (a) below. A copy of the license is available from the Free Software Foundation Web site at http://www.fsf.org/licenses/fdl.html.(a) The Back-Cover Text is: “You have freedom to copy and modify this free document, as you would free software.”
The document was typeset with
GNU Texinfo.
The document's source file is parsing-strings.texi.
It is available from
<http://agave.garden.org/~aaronh/scheme/>.
$Date: 2005/10/06 15:52:06 $
$Revision: 1.12 $
The article begins and concludes with arguments for writing a parser in the Scheme programming language. After the tutorial's scope and intended audience are also introduced, pointers are given to technical and other relevant sources of documentation useful for understanding the tutorial. Information on executing the examples is provided for readers viewing the tutorial from inside an Emacs editor.
The fun in using the Scheme programming language doesn't come from only repeating the same tired pedagogical examples, like writing a recursive solution for printing Fibonacci's numbers or writing yet another Scheme interpreter. Scheme can accomplish worthwhile tasks. And counter to the rumor, some Scheme implementations come with libraries to help programmers complete such tasks. These libraries are often as high-level, graceful and useful as the Scheme language itself. One such Scheme implementation—that also comes documented quite well—is MIT/GNU Scheme (hereafter “MIT Scheme”).
The following is a tutorial to writing a parser in MIT Scheme. Parsers are defined in computer science as software that can extract information from a stream of data. For example, reading the words in a sentence requires a parser. Reading the structur and keywords of computer software code requires a parser. Besides MIT Scheme, the human brain has parsing facilities for reading words and for identifying objects (like shapes or colors).
Most introductions to parsing start with infix arithmetic expressions like 1 + 1. Instead, we shall introduce parsing a simpler construct, the string literal. String literals are any number of characters found between two quotation marks (“"”). Special scenarios usually exist for representing a quotation mark character inside of string literals. Those are explained later.
By parsing only string literals, the tutorial covers what is more often termed lexing by classical computer science. A lexer lexes by converting a stream of data into lexemes. Lexemes are the individual tokens matched by a parser. In this tutorial, the lexemes are string literals. When a literate human brain reads sentences, it lexes words. This distinction exists likely because conventional programming languages needed the tasks separated to handle parsing and to handle left recursion for parsing languages that use infix notation. The distinction between lexing and parsing is not acknowledged any further in this document.
To create a parser there must be a defined grammar. A grammar describes some desired thing that can be parsed. The desired thing may be a valid sentence in the english language, but is a string literal in this tutorial. Only with a well-defined grammar, the task of designing a parser can be accomplished. Writing a parser
The tutorial assumes you understand the basic concepts of a grammar and how to represent them. Two common and related notations for grammars, Regular Expressions and Backus-Naur Form (BNF), are useful to know in this tutorial.
It is assumed you are familiar with programming in Scheme, too.
Information on the Scheme programming language available from MIT Scheme and how to install and use MIT Scheme is available elsewhere as free documentation. See Overview (MIT/GNU Scheme Reference Manual), or Introduction (MIT/GNU Scheme Reference Manual), respectively.
Developed since at least the 1980s, MIT Scheme is a complete programming environment. It is now distributed as free software, was adopted as a GNU package, and has been used for decades to teach programming. More information on MIT Scheme is available at the MIT Scheme Web site.
Readers reading the tutorial in the GNU Info format
from within GNU Emacs or the Edwin editor (the Emacs-like
editor that comes with MIT Scheme) are encouraged to
evaluate each piece of example code with the command C-x C-e.
For Emacs, evaluating Scheme expressions in Info-mode
requires running an inferior Scheme process in a separate buffer
followed by binding the key to the scheme-send-last-sexp
command.
M-x run-scheme RET
C-x b RET
M-x local-set-key RET C-x C-e scheme-send-last-sexp RET
M-x local-set-key RET C-M-x scheme-send-definition RET
Typing C-x C-e with the point (cursor) at the end of the last parenthesis of a Scheme expression evaluates the expression in the *scheme* buffer and print its evaluated value. Typing C-M-x anywhere inside a Scheme definition evaluates the definition.
In MIT Scheme, the parser and matcher are provided as separate libraries to avoid conflicting with the default standard Scheme language provided by MIT Scheme. This is the scenario for creating parsers in other programming languages as well.
Writing parsers in other programming languages often requires writing in a separate language that must be run through a separate tool. In MIT Scheme, a parser is written entirely in a Scheme-like syntax, does not require any separate utilities and can be combined with any other MIT Scheme code or features of MIT Scheme.
The parsing and matching libraries of MIT Scheme are called the
star-parser (*parser) and star-matcher
(*matcher). The syntax for the *parser and
*matcher are inspired by and therefore similar to Regular
Expressions and BNF. This makes MIT Scheme's
parser and matcher languages high-level and simple to use. The
matcher returns true (#t) on success and false (#f) for
a failure to match. The parser differs from this by returning a
vector containing each successfully found token. Besides the parsers
enhancements to modify the parsed value, the syntaxes of the parser
and matcher systems are almost entirely interchangeable.
To use the parser in MIT Scheme you must load the
*parser using the following:
(load-option '*parser)
The string literal is an object in Scheme and many other programming languages to represent strings of characters. Here are four examples of string literals (comments are inline at the right of each example beyond semicolons as Scheme comments):
"foo"
"foo bar"
"" ;; The empty string
"\"foo\"" ;; The string "foo"
Looking at these examples, a parser of string literals would need to to find characters with a start quotation mark character and end with a quotation mark character. The actual string literal between the quotation marks could contain zero or more characters.
Notice that putting quotation mark characters in a string literal requires escaping them with the backslash character (\). This is an idiom understood in most programming languages including Scheme, and will later be understood by the parser.
At first, the parser overlooks the escape character issue and therefore ignore string literals containing quotation marks. Instead, any characters between quotation marks is accepted by the parser. We could describe this simple parser as a procedure that:
This parser is described with MIT Scheme in the following definition:
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(seq
(match "\"")
(* (match (not-char #\")))
(match "\""))))
Notice that parse-string is not defined as a typical Scheme
function with specified function arguments (historically called a
“defun”) nor assigned a lambda expression. It is not
immediately clear how to use the parse-string function created
with the *parser. Instead of a function, parse-string
looks like a variable assigned the value resulting from the
*parser-syntax. The single argument to *parser is
called a “syntax” even though it appears *parser is a
function taking a single argument. According to the documentation in
MIT Scheme Documentation (MIT/GNU Scheme Reference Manual), *parser is
implemented as a macro.
What does it mean that parse-string is a macro? The parser
definition parse-string could be thought of as a description of
another parser. The *parser macro uses the parser language
syntax in the sub-expression to create a corresponding larger (thus
“macro”), parser function. The macro parser created
handles the specific details of reading and matching input, and is
applied to a parser buffer—a buffer of data that parser
functions are capable of operating on (This is what is hinted in the
source code comment before the definitions of parse-string).
Intelligently, all of these complicated details of actually parsing
the data in the buffer are abstracted and hidden. Before explaining
how to use the parse-string parser function on an parser buffer
and testing it, the parser language syntax is explained first.
Inside the above *parser expression is a seq expression.
The seq expression guarantees the sub-expressions are matched
sequentially on the data. The seq used above matches
a quotation mark followed by anything but a quotation mark followed by
a closing quotation mark. The order of this sequence is critical.
The match expressions, as intuition hints, match their
sub-expression. If the sub-expression is a string, then the data are
matched against the string, but other *matcher expressions are
allowed in match. For instance, the not-char expression
matches any character other than the character provided to
not-char.
The star (*) expression matches its sub-expression zero or more
times. In the parser above, the * expression matches any
character—that is not a quotation mark—zero or more times.
For those familiar with BNF, the source code for the parser maps quite closely to a BNF representation of the grammar. The syntax differs with its “reverse Polish notation” as adopted generally by Scheme.
<string> ::= "\"" <not-quote>* "\""
<not-quote> ::= "a" | "b" | "c" | ... | (anything but a "\"" .....)
Not even Extended BNF (EBNF) can easily capture
the grammar the parse-string parser accepts, without adding the
ellipsis “hack”.
The small parser above attempts to match a simple string literal.
Testing the parser requires having a parser buffer. Fortunately, the
string->parser-buffer procedure can create parser buffers from
strings. The parser buffer created by string->parser-buffer
can then be passed as an argument to the parse-string
procedure.
(parse-string (string->parser-buffer "\"foo\""))
=> #("\"" "f" "o" "o" "\"") ;; "foo"
(parse-string (string->parser-buffer "foo"))
=> #f
(parse-string (string->parser-buffer "foo\""))
=> #f
(parse-string (string->parser-buffer "\"foo"))
=> #f
(parse-string (string->parser-buffer "\"\""))
=> #("\"" "\"") ;; ""
(parse-string (string->parser-buffer ""))
=> #f
(parse-string (string->parser-buffer "\"foo\" \"foo\""))
=> #("\"" "f" "o" "o" "\"") ;; "foo"
(parse-string (string->parser-buffer "bar \"foo\""))
=> #f
(parse-string (string->parser-buffer "\"foo\" bar"))
=> #("\"" "f" "o" "o" "\"") ;; "foo"
The tests of the initial parser displays some noteworthy behavior.
When a leading quotation mark was not found right away or a closing
quotation mark was never found the parser fails and returns false.
When there were extra characters, including string literals, after the
string literal, the parsing would succeed and ignore the trailing
characters. On success, the parser function returns a vector an
element for each successful match with the match function. The
element returned is each character matched by the three uses of
match in parse-string. To the right of the result in
Scheme comments is a more readable version of the result.
These tests, although maybe not exhaustive, have verified that the
parse-string parser works for simple string literals. The
tutorial will improve the parser by having it accept a broader
definition of string literals and tune the values returned from
parsing successfully.
The next logical step is to allow quotation marks to exist in a string literal. To have the parser accept quotation characters in strings, we need to match “\"”—an escaped quotation mark. When writing an escaped quotation mark in Scheme, quotation marks and backslashes need to be escaped. The escaped quotation mark is represented in a string value as “\\\"” in Scheme—two backslashes to make a backslash, and another backlash to escape the quotation mark.
Where the parser used to match zero or more non-quotation mark characters
needs to match both escaped-quotation characters and non-quotation
characters. The “both” relation can be handled with the alt
parser expression. Unlike seq, alt allows either
subexpression to match and satisfies the first one that matches.
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(seq
(match "\"")
(* (alt (match "\\\"")
(match (not-char #\"))))
(match "\""))))
Tests of the new definition confirm that it works.
(parse-string (string->parser-buffer "\"\\\"foo\\\"\""))
=> #("\"" "\\\"" "f" "o" "o" "\\\"" "\"") ;; "\"foo\""
Running the above test on the old definition of parse-string
would return a different value and an incorrect parsing of the string
literal.
(parse-string (string->parser-buffer "\"\\\"foo\\\"\"\"))
=> #("\"" "\\" "\"") ;; "\"
Incorrectly, it returns only the starting quotation mark, the escaping backslash, and then ends prematurely on the escaped quotation mark of the string literal.
The parser matches the leading and ending quotation mark characters
and returns a vector value containing the delimiting quotation marks.
Really, only the contained string of characters should be returned.
To avoid this annoyance we can use the noise expression. The
noise parser expression is equivalent to the match
expression, except the match isn't included in the returned vector
value. The use of match is replaced with noise for the
start end ending quotation marks.
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(seq
(noise "\"")
(* (alt (match "\\\"")
(match (not-char #\"))))
(noise "\""))))
Here are some tests to make sure it worked.
(parse-string (string->parser-buffer "\"foo\""))
=> #("f" "o" "o") ;; foo
(parse-string (string->parser-buffer "\"\\\"foo\\\"\""))
=> #("\\\"" "f" "o" "o" "\\\"") ;; \"foo\"
The parser now allows the quotation mark and removes the delimiting quotation marks, but it is returning any escaped quotation marks (\\\") with slashes. The backslash before the quote should be removed in the returned value.
This should even be generalized to all escaped characters. For instance, the backslash character even needs to be escaped by another backslash character. This is not generalized for all characters. Some backslash character sequences represent special characters with specific meanings in computing. These include the newline (\n) and the tab (\t) characters.
To accept the sequence of a backslash character followed by either
another backslash character or a quotation mark, the escaping
backslash character is discarded using noise.
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(seq
(noise "\"")
(* (alt (seq (noise "\\")
(alt (match "\\")
(match "\"")))
(match (not-char #\"))))
(noise "\""))))
In the following tests, the resulting vector has strings printed by MIT Scheme with escaped quotation mark characters The backslashes used for escaping in the original string shall no longer be present in the parsed string.
(parse-string (string->parser-buffer "\"\\\"foo\\\"\""))
=> #("\"" "f" "o" "o" "\") ;; "foo"
(parse-string (string->parser-buffer "\"\\\\\"\""))
=> #("\\") ;; \
Another common expectation of a parser is the ability to determine the
returned value of the parsed input. The parser currently returns a
vector composed of each individually matched element. This is not a
useful return value for use by other programs. Usually, the atomic
“token” of the grammar should be the return value, not some
representation determined by the underlying MIT Scheme
parsing system. In this tutorial, a token should be the matched
string literal, not each result coming from the match call.
Provided by the *parser, the encapsulate parser
expression can modify a return value by applying a function to the
vector returned by a parser function. For parsing string literals,
encapsulate will need a function for creating a single string
from each of the matched string elements in the vector.
To convert a vector of strings, the vector needs to be converted to a
list with the function vector->list. Then, the list of strings
created from the vector are converted into a single string. Lists of
strings can't be converted into strings automatically in Scheme (
Lists of characters can). Reducing a list of strings to a single
string is done with the higher-order function reduce. The
reduce procedure is often introduced with the example of adding
a list of numbers together.
(reduce + 0 '(1 2 3 4))
=> 10
Instead of addition, the list of strings created with
vector->list will be concatenated by the reduce function
to “sum” the strings into a single final string.
(lambda (v) (reduce string-append "" (vector->list v)))
This function can be inserted in the parser as part of the
encapsulate expression.
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(encapsulate
(lambda (v) (reduce string-append "" (vector->list v)))
(seq
(noise "\"")
(* (alt (seq (noise "\\")
(alt (match "\\")
(match "\"")))
(match (not-char #\"))))
(noise "\"")))))
This allows parse-string to give a clear result.
(parse-string (string->parser-buffer "\"foo\""))
=> #("foo")
The procedure to encapsulate the vector can be alternatively
defined as a function. It will be called
vector-string->string.
;; vector-string->string : string vector -> string
(define (vector-string->string v)
(reduce string-append "" (vector->list v)))
With this definition we can define a more concise parser composed of reusable parts.
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(encapsulate vector-string->string
(seq
(noise "\"")
(* (alt (seq (noise "\\")
(alt (match "\\")
(match "\"")))
(match (not-char #\"))))
(noise "\"")))))
readIronically, the utility of this parsing tutorial emphasized in the introduction (see Introduction) was misleading. The example parser can successfully parse string literals. The parser is not an astonishing breakthrough. It is actually duplicating what is is already available in Scheme.
Programming languages, including Scheme, parse string literals all the time. Scheme, unlike other programming languages, makes the internal Scheme parser available to the user, to even use in their program. Therefore, somewhere in Scheme there exists the string literal parser, the exact capability of this tutorial's parser.
The read procedure available in MIT Scheme and all
Scheme implementations can parse Scheme “objects” (a string literal
is a Scheme object). Instead of reading from a parser buffer,
read reads from a port. A port can be created from a string
with MIT Scheme's open-input-string procedure. So,
the tutorial's parser is entirely unoriginal research.
(read (open-input-string "\"foo\""))
=> "foo"
The parser defined in this tutorial is still useful, though. For
instance, some programming tasks require parsers for parsing
only strings (or numbers or some other value). Scheme's
read function doesn't satisfy such a requirement. It
accepts objects other than strings, including symbols, characters,
numbers, lists and all other valid Scheme values and expressions. It
also returns the objects as their Scheme types (strings as string type,
symbols as symbol type, ...) and not as a vector of strings.
(read (open-input-string "foo"))
=> foo
A string literal parser written with MIT Scheme's
*parser syntax would return false if it met these objects.
Further, read does not return false for failed matches but on
occasion will give a parse error.
(read (open-input-string ")"))
=> PARSE-ERROR: Unmatched close paren #\)
The parser for literal strings could be rewritten using read.
We've added the code for doing just that at the end of this tutorial
in Parsing with read.
A parser written with MIT Scheme's *parser syntax and
not with Scheme's read is a useful exercise.
The parser is taken further by having it match and return all string literals available in the parser buffer, allowing whitespace characters to exist between each string literal.
To match all string literals, the entire parser need only match one or
more string literals. The plus (+) expression is similar to
the star (*) expression used elsewhere in the parser. Except,
+ must match one or more subexpressions rather than
zero or more.
The + expression is added before the encapsulate
expression. If it is put after, then the encapsulate
expression would concatenate together all the matched string literals
into one string literal rather than keeping them separate. Also, the
matching of whitespace is also outside of the encapsulate
command to also avoid being “encapsulated” into the returned value.
;; parse-all-strings : parser-buffer -> ( vector | #f )
(define parse-all-strings
(*parser
(+
(alt
(noise " ")
(encapsulate vector-string->string
(seq
(noise "\"")
(* (alt (seq (noise "\\")
(alt (match "\\")
(match "\"")))
(match (not-char #\"))))
(noise "\"")))))))
The following tests show how parse-all-strings returns a vector
element for each individual literal string matched.
(parse-all-strings (string->parser-buffer "\"\""))
=> #("")
(parse-all-strings (string->parser-buffer ""))
=> #f
(parse-all-strings (string->parser-buffer "\"foo\"\"foo\""))
=> #("foo" "foo")
(parse-all-strings (string->parser-buffer "\"foo\" \"bar\""))
=> #("foo" "foo")
Really, we don't need to duplicate the entire code of
parse-string in parse-all-strings. The modularity
of the MIT Scheme parser language, as is accomplished with the
Scheme programming language generally, allow us to reuse parsers as
recursive calls.
;; parse-all-strings : parser-buffer -> ( vector | #f )
(define parse-all-strings
(*parser
(+
parse-string)))
The parsing capabilities in MIT Scheme are powerful, yet succinct and they are useful. Below are some of the qualities parsers created with MIT Scheme exhibit.
The following is the entire source code for the final version of the string literal parser written in MIT Scheme.
(load-option '*parser)
;; vector-string->string : string vector -> string
(define (vector-string->string v)
(reduce string-append "" (vector->list v)))
;; parse-string : parser-buffer -> ( vector | #f )
(define parse-string
(*parser
(encapsulate vector-string->string
(seq
(noise "\"")
(* (alt (seq (noise "\\")
(alt (match "\\")
(match "\"")))
(match (not-char #\"))))
(noise "\"")))))
;; parse-all-strings : parser-buffer -> ( vector | #f )
(define parse-all-strings
(*parser
(+
parse-string)))
readHere is how to define a string literal parser by using the read
function provided by Scheme as mention in Comparing the Parser with `read'.
(define (parse-string #!optional port)
(let* ((port (if (input-port? port) port
current-input-port))
(object (read port)))
(if (string? object) object
#f)))
A version of the function parse-all-strings that was introduced
in Parsing All Strings can also be rewritten using read.
It is left to the reader as an exercise.