i've been studying david turner's language SASL, the
ancestor of miranda and other FPLs. for some time now
i've wondered what a lazy vector language would look
like, so i decided to try my hand at an implementation
of K which used combinator reduction techniques as
the inner evaluation engine.
at this point, i've implemented the tokenizer, the
parser, and the compiler, and i'm working on the
evaluation engine. having hacked a fair number of
tokenizers over the last few years, none of which
i found entirely satisfying, i decided to write one
which was fast, small, general, and completely data-
driven.
the script is here:
http://www.nsl.com/k/sasl/t.k
in what follows, i'll explain my approach. there's
nothing conceptually novel here, but the technique
might come in handy for others who are writing their
own languages. at the very least, the problem seems
to recommend itself as a candidate for benchmarking.
i define the fundamental character classes:
A:_ci(97+!26),65+!26 / a-zA-Z
N:_ci 48+!9 / 0-9
Q:"\"" / the quote character "
X:"\\" / the escape character \
S:"`" / the symbol prefix `
C:":" / colon :
D:"." / dot .
K:"~!@#$%^&*-_=+|<,>?'/(){}[];" / K operators
(the K operators include X, C, and D - see O below)
U:(;A;D;N;Q;X;K;C;S) / the character universe
U[0]:_ci(!256)_dvl 0,_ic,/1_ U / U[0] is everything else
U is a list of the character classes. everything in U[0]
will be treated as equivalent to blank (" ").
the strategy is to define a state-transition matrix M.
applying M to the input string s will produce an integer
vector r s.t. r[i] is the index of the state occupied
by M when s[i] was detected.
the names of the states of the SASL-K tokenizing machine
are represented by a character vector (i.e. a string) V:
V:" ab0.9,ocxy`_+:-;"
the states are:
<blank> i.e. anything in U[0]
a initial A (i.e. alpha)
b subsequent A's
0 digit - integer part
. decimal point
9 digit - decimal part
, blank following a digit
o opening quote
c closing quote
x escape within a quotation
y character immediately following an escape
` symbol or non-blank in a symbol-sequence
_ blank immediately following a symbol-sequence
+ initial K operator
: colon immediately following initial K operator
- K operator immediately following an initial K operator
; colon immediately following a non-initial K operator
the result of processing the input string will be an
integer vector r s.t. r[i] is an index into the states.
for example, the result of processing
"12.3 abc"
will be
4 4 5 6 7 2 3 3
i.e. the indices of the states "00.9,abb". so we need a
way of grouping the different substates:
4 4 5 6 7 2 3 3
--------- -----
number name
the grouping can then be used to cut the input string
into tokens:
("12.3 ";"abc")
The endpoints of the substate groups are:
" b,y_:"
i.e. blank, non-initial-alpha, blank-following-numeric,
character-following-escape, blank-following-symbol-
sequence, and colon-following-initial-operator. (the
end-point of the last substate group is implied.)
the endpoints are represented as an integer vector:
I:0,1+V?/:" b,y_:"
I
0 1 3 7 11 13 15
the state-transition matrix O is a states x classes
character matrix. it is constructed in two stages. the
idea is that we want to be able to edit the matrix
in terms of V and U, but we need to convert that to
a states x 256 integer matrix M, where the character
classes are converted to their integer representations
and mapped to multiple columns of M.
O is:
O:"
a+0o+++`
a+0o+++`
b+bo+++`
b+bo+++`
,a.0o+++`
,a+9o+++`
,a+9o+++`
,a+0o+++`
oooocxo+o
a+0o+++`
yyyyyyyyy
ooooooooo
_````````
_a+0o+++`
a-0o--:`
a-0o---`
a+0o++;`
a+Oo+++`"
M starts in state 0. if it sees a blank it goes to
the blank state; if it sees an alpha it goes to state a;
if it sees a . it goes to state +; and so forth.
M is computed this way:
fsm:{{.[x;(;y);:;z]}/[(#x)#,&256;_ic y;+x]}
M:fsm[1+V?/:/:1_'(&O="\n")_ O]U
fsm is a function which takes an integer matrix x and
the character-class universe y, and returns an integer
matrix.
there are 17 states:
#V
17
M is 18 x 256. the extra state is 0, the start state.
there are no transitions back to that state.
the tokenizing code is then quite simple:
tokens:{dbt cut[x]0 M\_ic x}"", / compute transitions
dbt:{x _di&x[;0]_lin*U} / delete blank tokens
cut:{(&(~=)':I _binl y)_ x} / cut by endpoints
a simple example:
tokens"1 2 3+*/4"
("1 2 3"
,"+"
,"*"
,"/"
,"4")
notice that "1 2 3" is a single token. this is why the
machine contains the , state. the target language K
supports vector notation for integers, floats, and symbols,
so it would be wrong to split this into three tokens.
x is the argument to the 'tokens' function, and contains
the input string:
x:"1 2 3+*/4"
_ic is a primitive function which gives the ascii value for
an array of characters of any dimensionality:
_ic x
49 32 50 32 51 43 42 47 52
the heart of the tokenizer is
0 M\_ic x
M\ is two-dimensional pointer-chasing. the expression
k:M[i;j]
where i is the current-state-index and j the ascii index of
the current-character. the expression returns k, the next-
state-index. i.e. in state i, if you see character j, go to
state k.
0 M\_ic x
supplies the initial state-index i = 0, picks off the
first character index j, returns k. i is set to k,
j is set to the next character index, and a new k is
computed. &c.
the result of 0 M\_ic x is a vector of the state-indices
visited in the course of execution:
0 M\_ic x
0 4 7 4 7 4 14 16 14 4
the 'cut' function takes the input string x and the
state-index vector y and cuts the string at the state-
endpoints I:
cut[x]0 M\_ic x
("1 2 3"
,"+"
,"*"
,"/"
,"4")
finally, the 'dbt' function discards all tokens whose
first element is in U[0], the class of characters
equivalent to blank.
on my 3 ghz machine, 'tokens' processes 2 million chars/
sec:
x:2000000#x
\t tokens x
1062 / ms