stevan apter — 2005-05-19 15:45:42

i've been studying david turner's language SASL, the
ancestor of miranda and other FPLs. for some time now
i've wondered what a lazy vector language would look
like, so i decided to try my hand at an implementation
of K which used combinator reduction techniques as
the inner evaluation engine.

at this point, i've implemented the tokenizer, the
parser, and the compiler, and i'm working on the
evaluation engine. having hacked a fair number of
tokenizers over the last few years, none of which
i found entirely satisfying, i decided to write one
which was fast, small, general, and completely data-
driven.

the script is here:

http://www.nsl.com/k/sasl/t.k

in what follows, i'll explain my approach. there's
nothing conceptually novel here, but the technique
might come in handy for others who are writing their
own languages. at the very least, the problem seems
to recommend itself as a candidate for benchmarking.

i define the fundamental character classes:

A:_ci(97+!26),65+!26 / a-zA-Z
N:_ci 48+!9 / 0-9
Q:"\"" / the quote character "
X:"\\" / the escape character \
S:"`" / the symbol prefix `
C:":" / colon :
D:"." / dot .
K:"~!@#$%^&*-_=+|<,>?'/(){}[];" / K operators

(the K operators include X, C, and D - see O below)

U:(;A;D;N;Q;X;K;C;S) / the character universe
U[0]:_ci(!256)_dvl 0,_ic,/1_ U / U[0] is everything else

U is a list of the character classes. everything in U[0]
will be treated as equivalent to blank (" ").

the strategy is to define a state-transition matrix M.

applying M to the input string s will produce an integer
vector r s.t. r[i] is the index of the state occupied
by M when s[i] was detected.

the names of the states of the SASL-K tokenizing machine
are represented by a character vector (i.e. a string) V:

V:" ab0.9,ocxy`_+:-;"

the states are:

<blank> i.e. anything in U[0]
a initial A (i.e. alpha)
b subsequent A's
0 digit - integer part
. decimal point
9 digit - decimal part
, blank following a digit
o opening quote
c closing quote
x escape within a quotation
y character immediately following an escape
` symbol or non-blank in a symbol-sequence
_ blank immediately following a symbol-sequence
+ initial K operator
: colon immediately following initial K operator
- K operator immediately following an initial K operator
; colon immediately following a non-initial K operator

the result of processing the input string will be an
integer vector r s.t. r[i] is an index into the states.
for example, the result of processing

"12.3 abc"

will be

4 4 5 6 7 2 3 3

i.e. the indices of the states "00.9,abb". so we need a
way of grouping the different substates:

4 4 5 6 7 2 3 3
--------- -----
number name

the grouping can then be used to cut the input string
into tokens:

("12.3 ";"abc")

The endpoints of the substate groups are:

" b,y_:"

i.e. blank, non-initial-alpha, blank-following-numeric,
character-following-escape, blank-following-symbol-
sequence, and colon-following-initial-operator. (the
end-point of the last substate group is implied.)

the endpoints are represented as an integer vector:

I:0,1+V?/:" b,y_:"

I
0 1 3 7 11 13 15

the state-transition matrix O is a states x classes
character matrix. it is constructed in two stages. the
idea is that we want to be able to edit the matrix
in terms of V and U, but we need to convert that to
a states x 256 integer matrix M, where the character
classes are converted to their integer representations
and mapped to multiple columns of M.

O is:

O:"
a+0o+++`
a+0o+++`
b+bo+++`
b+bo+++`
,a.0o+++`
,a+9o+++`
,a+9o+++`
,a+0o+++`
oooocxo+o
a+0o+++`
yyyyyyyyy
ooooooooo
_````````
_a+0o+++`
a-0o--:`
a-0o---`
a+0o++;`
a+Oo+++`"

M starts in state 0. if it sees a blank it goes to
the blank state; if it sees an alpha it goes to state a;
if it sees a . it goes to state +; and so forth.

M is computed this way:

fsm:{{.[x;(;y);:;z]}/[(#x)#,&256;_ic y;+x]}
M:fsm[1+V?/:/:1_'(&O="\n")_ O]U

fsm is a function which takes an integer matrix x and
the character-class universe y, and returns an integer
matrix.

there are 17 states:

#V
17

M is 18 x 256. the extra state is 0, the start state.
there are no transitions back to that state.

the tokenizing code is then quite simple:

tokens:{dbt cut[x]0 M\_ic x}"", / compute transitions
dbt:{x _di&x[;0]_lin*U} / delete blank tokens
cut:{(&(~=)':I _binl y)_ x} / cut by endpoints

a simple example:

tokens"1 2 3+*/4"
("1 2 3"
,"+"
,"*"
,"/"
,"4")

notice that "1 2 3" is a single token. this is why the
machine contains the , state. the target language K
supports vector notation for integers, floats, and symbols,
so it would be wrong to split this into three tokens.

x is the argument to the 'tokens' function, and contains
the input string:

x:"1 2 3+*/4"

_ic is a primitive function which gives the ascii value for
an array of characters of any dimensionality:

_ic x
49 32 50 32 51 43 42 47 52

the heart of the tokenizer is

0 M\_ic x

M\ is two-dimensional pointer-chasing. the expression

k:M[i;j]

where i is the current-state-index and j the ascii index of
the current-character. the expression returns k, the next-
state-index. i.e. in state i, if you see character j, go to
state k.

0 M\_ic x

supplies the initial state-index i = 0, picks off the
first character index j, returns k. i is set to k,
j is set to the next character index, and a new k is
computed. &c.

the result of 0 M\_ic x is a vector of the state-indices
visited in the course of execution:

0 M\_ic x
0 4 7 4 7 4 14 16 14 4

the 'cut' function takes the input string x and the
state-index vector y and cuts the string at the state-
endpoints I:

cut[x]0 M\_ic x
("1 2 3"
,"+"
,"*"
,"/"
,"4")

finally, the 'dbt' function discards all tokens whose
first element is in U[0], the class of characters
equivalent to blank.

on my 3 ghz machine, 'tokens' processes 2 million chars/
sec:

x:2000000#x
\t tokens x
1062 / ms

phimvt@lurac.latrobe.edu.au — 2005-05-20 09:05:17

On Thu, 19 May 2005, stevan apter wrote:

> i've been studying david turner's language SASL, the
> ancestor of miranda and other FPLs. for some time now
> i've wondered what a lazy vector language would look
> like, so i decided to try my hand at an implementation
> of K which used combinator reduction techniques as
> the inner evaluation engine.

I remember SASL, although I never had access to one, or to Miranda. I have
toyed with Huggs (or Gofer?), an early implementation of (a subset) of
Haskell. But I did read Peyton-Jones "The implementation of functional
programming languages" (or some such) with awe and bewilderment.
Gradually things began to make sense, though. What I remember to be the
crucial thing was that the Schoenfinkel/Curry combinators + Turner's
optimisation relied on the fact that evaluation is based on rewriting a
graph in such a way that it is automatically lazy: an expression might be
used several times over, but it will be evaluated only once. The first
evaluation rewrites the expression, so that any later access to it will
find the value there. It is as if we had definitions of expressions like

DEFINE two+three == 2 3 + .

and later uses (her written over several line for the comments

two+three # when evaluated, also changes the definition to 5
10
*
two+three # just 5
+
==> 55

This example relied on a definition; others do the same thing
when some expression such as 2 3 + occurs just once somewhere
but will be encountered many times (in a loop, or by the map
combinator, for example). But I suppose you know all this already.
Or maybe not. Here is a nice example to sharpen your teeth on:

[2 3 +] dup put writes [2 3 +]
dup i put writes 5
put writes [5]

This is what a lazy Joy would look like.

If I remember correctly, the abstract SASL/Miranda/Haskell machine
keeps the internal code as a tree (or is it a graph). Are you doing
something similar? Such a tree would be rather different from
anything vector-like, but for your sake I hope that I am wrong.
You are a brave man to attempt this.

- Manfred

[..]

William Tanksley, Jr — 2005-05-20 13:23:45

phimvt@... <phimvt@...> wrote:

> If I remember correctly, the abstract SASL/Miranda/Haskell machine
> keeps the internal code as a tree (or is it a graph). Are you doing
> something similar? Such a tree would be rather different from
> anything vector-like, but for your sake I hope that I am wrong.
> You are a brave man to attempt this.

Interesting. This is what Wippler is doing for his 'vlerq' project,
which is centered around a concatenative, vector-based language -- I
don't know whether he's studied SASL, and I'm pretty sure he didn't
get the idea from it (I was an innocent bystander watching as he
generated the idea during a conversation...).

http://www.vlerq.com

> - Manfred

-Billy

William Tanksley, Jr — 2005-05-20 13:28:06

stevan apter <sa@...> wrote:

> i've wondered what a lazy vector language would look
> like, so i decided to try my hand at an implementation
> of K which used combinator reduction techniques as
> the inner evaluation engine.

Ambitious.

A question: when you say "an implementation of K", do you mean "an
open source reimplementation of K written in some freely distributable
language", or do you mean "an extension of K written in K"? If you
mean the first, there will be great rejoicing.

-Billy

sa@dfa.com — 2005-05-20 14:24:57

concatenative@yahoogroups.com wrote on 05/20/2005 09:28:06 AM:

> stevan apter <sa@...> wrote:
> > i've wondered what a lazy vector language would look
> > like, so i decided to try my hand at an implementation
> > of K which used combinator reduction techniques as
> > the inner evaluation engine.
>
> Ambitious.
>
> A question: when you say "an implementation of K", do you mean "an
> open source reimplementation of K written in some freely distributable
> language", or do you mean "an extension of K written in K"? If you
> mean the first, there will be great rejoicing.

then i'm afraid there must be wailing and lamentation and
gnashing of teeth, since i mean the second.

btw, here's the compiler.

given

x = a list of variables to abstract
e = parse tree for an expression (in SASL/K)
w = list of parse trees for 'where' clauses (local definitions)

the 'compile' function produces the appropriate combinator
tree. i use turner's combinator optimizations, so the target
set is:

K I B* B C* C S* S Y U

the evaluator has to do common subexpression matching and
"structure sharing", but if i'm right, all that should just
fall out of the representation, so (i'm hoping that) the
evaluator itself will only be a few lines of code. but
we'll see.

the language itself is the functional part of K + if-then-else
+ where.

-- compiler --

compile:{[x;e;w]abs/[e;|x]}

abs:{[e;x]:[`~*e;exp;"I"~*e;var;con][e]x}
con:{[c;x](`;`K;c)}
var:{[v;x]:[x=v;`I;(`;`K;v)]}
exp:{[e;x]opt .(1_ e)abs\:x}

opt:{
kx:``K~2#x
ky:``K~2#y
bx::[`=*x;``B~2#x 1;0]
by::[`=*y;``B~2#y 1;0]
iy:`I~y
:[kx&ky ;K
kx&iy ;I
kx&by ;B_
kx ;B
bx&ky ;C_
ky ;C
bx ;S_
S][x;y]}

K:{(`;`K;(`;x 2;y 2))}
I:{y;x 2}
B_:{(`;(`;(`;`B_;x 2);y[1]1);y 2)}
B:{(`;(`;`B;x 2);y)}
C_:{(`;(`;(`;`C_;x[1]2);x 2);y 2)}
C:{(`;(`;`C;x 2);y 2)}
S_:{(`;(`;(`;`S_;x[1]2);x 2);y)}
S:{(`;(`;`S;x);y)}

wh_:{[e;w]:[4:*w;wh1[e]. w;whn[e].+w]}
wh1:{[e;f;E]:[@f;wh0;f[0]_in,//E;why;whx][e;f]E}
wh0:{[e;f;E](`;abs[e]f;E)}
whx:{[e;f;E](`;abs[e]f 0;abs/[E;|1_ f])}
why:{[e;f;E](`;abs[e]f 0;(`;`Y;abs/[E;|f]))}
whn:{[e;f;E]:[(*:'f)_lin,//E;whY;whX][e;f]E}
whX:{[e;f;E](`;(`;`K;e)whU/|f;whL[f]E)}
whL:{[f;E]list@{abs/[x;|1_ y]}'[E;f]}
whU:{[e;f](`;`U;abs[e]f 0)}
whY:{[e;f;E](`;(`;`K;e)whU/|f;(`;`Y;(`;`K;whL[f]E)whU/|f))}

list:{(){(`;(`;:;y);x)}/|x,()}

>
> -Billy
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

tokenizing

stevan apter — 2005-05-19 15:45:42

phimvt@lurac.latrobe.edu.au — 2005-05-20 09:05:17

William Tanksley, Jr — 2005-05-20 13:23:45

William Tanksley, Jr — 2005-05-20 13:25:00

William Tanksley, Jr — 2005-05-20 13:28:06

sa@dfa.com — 2005-05-20 14:24:57