Re: [stack] tokenizing

stevan apter — 2005-05-19 16:47:15

explaining this sort of thing sure does concentrate
the mind - i realized that (i) i wasn't handling the
empty case, and (ii) i can dispense with the start
state, which simplifies the code in certain respects.

a bit more detail on performance:

worst case: one token per character:

x:2000000#"-x+y" / 2m chars -> 2m tokens

\t tokens x / total time
1343

\t a:_ic x / ascii values - no time
0

\t 0 M\a / transitions
296

\t c:t cut[x]b / cut into tokens - ouch!
687

\t d:dbt c / delete blank tokens
187

it's hardly surprising that most of the time goes
into chopping the string into a list of strings.

---------------

U:( / character classes:
_ci,/97 65+\:!26 / alpha
"." / decimal/dmend
_ci 48+!9 / numeric
"\"" / quote
"\\" / escape/scan
"~!@#$%^&*-_=+|<,>?'/(){}[];" / K
":" / cons/monadify
"`") / symbol

U[0]:_ci(!256)_dvl 0,_ic,/1_ U / ~ blank

V:" ab0.9,ocxy`_+:-;" / states
I:V?/:" b,y_:" / endpoints

T:1_"
a+0o+++`
a+0o+++`
b+bo+++`
b+bo+++`
,a.0o+++`
,a+9o+++`
,a+9o+++`
,a+0o+++`
oooocxo+o
a+0o+++`
yyyyyyyyy
ooooooooo
_````````
_a+0o+++`
a-0o--:`
a-0o---`
a+0o++;`
a+Oo+++`" / states x classes

fsm:{{.[x;(;y);:;z]}/[(#x)#,&256;_ic y;+x]} / finite state machine
M:fsm[V?/:/:1_'(&T="\n")_ T]U / transition matrix

tokens:{dbt cut[x]0 M\_ic x}"", / compute transitions
dbt:{:[#x;x _di&x[;0]_lin*U;x]} / delete blank tokens
cut:{(&(~=)':I _binl y)_ x} / cut by endpoints


----- Original Message -----
From: "stevan apter" <sa@...>
To: <concatenative@yahoogroups.com>
Sent: Thursday, May 19, 2005 11:45 AM
Subject: [stack] tokenizing


> i've been studying david turner's language SASL, the
> ancestor of miranda and other FPLs. for some time now
> i've wondered what a lazy vector language would look
> like, so i decided to try my hand at an implementation
> of K which used combinator reduction techniques as
> the inner evaluation engine.
>
> at this point, i've implemented the tokenizer, the
> parser, and the compiler, and i'm working on the
> evaluation engine. having hacked a fair number of
> tokenizers over the last few years, none of which
> i found entirely satisfying, i decided to write one
> which was fast, small, general, and completely data-
> driven.
>
> the script is here:
>
> http://www.nsl.com/k/sasl/t.k
>
> in what follows, i'll explain my approach. there's
> nothing conceptually novel here, but the technique
> might come in handy for others who are writing their
> own languages. at the very least, the problem seems
> to recommend itself as a candidate for benchmarking.
>
> i define the fundamental character classes:
>
> A:_ci(97+!26),65+!26 / a-zA-Z
> N:_ci 48+!9 / 0-9
> Q:"\"" / the quote character "
> X:"\\" / the escape character \
> S:"`" / the symbol prefix `
> C:":" / colon :
> D:"." / dot .
> K:"~!@#$%^&*-_=+|<,>?'/(){}[];" / K operators
>
> (the K operators include X, C, and D - see O below)
>
> U:(;A;D;N;Q;X;K;C;S) / the character universe
> U[0]:_ci(!256)_dvl 0,_ic,/1_ U / U[0] is everything else
>
> U is a list of the character classes. everything in U[0]
> will be treated as equivalent to blank (" ").
>
> the strategy is to define a state-transition matrix M.
>
> applying M to the input string s will produce an integer
> vector r s.t. r[i] is the index of the state occupied
> by M when s[i] was detected.
>
> the names of the states of the SASL-K tokenizing machine
> are represented by a character vector (i.e. a string) V:
>
> V:" ab0.9,ocxy`_+:-;"
>
> the states are:
>
> <blank> i.e. anything in U[0]
> a initial A (i.e. alpha)
> b subsequent A's
> 0 digit - integer part
> . decimal point
> 9 digit - decimal part
> , blank following a digit
> o opening quote
> c closing quote
> x escape within a quotation
> y character immediately following an escape
> ` symbol or non-blank in a symbol-sequence
> _ blank immediately following a symbol-sequence
> + initial K operator
> : colon immediately following initial K operator
> - K operator immediately following an initial K operator
> ; colon immediately following a non-initial K operator
>
> the result of processing the input string will be an
> integer vector r s.t. r[i] is an index into the states.
> for example, the result of processing
>
> "12.3 abc"
>
> will be
>
> 4 4 5 6 7 2 3 3
>
> i.e. the indices of the states "00.9,abb". so we need a
> way of grouping the different substates:
>
> 4 4 5 6 7 2 3 3
> --------- -----
> number name
>
> the grouping can then be used to cut the input string
> into tokens:
>
> ("12.3 ";"abc")
>
> The endpoints of the substate groups are:
>
> " b,y_:"
>
> i.e. blank, non-initial-alpha, blank-following-numeric,
> character-following-escape, blank-following-symbol-
> sequence, and colon-following-initial-operator. (the
> end-point of the last substate group is implied.)
>
> the endpoints are represented as an integer vector:
>
> I:0,1+V?/:" b,y_:"
>
> I
> 0 1 3 7 11 13 15
>
> the state-transition matrix O is a states x classes
> character matrix. it is constructed in two stages. the
> idea is that we want to be able to edit the matrix
> in terms of V and U, but we need to convert that to
> a states x 256 integer matrix M, where the character
> classes are converted to their integer representations
> and mapped to multiple columns of M.
>
> O is:
>
> O:"
> a+0o+++`
> a+0o+++`
> b+bo+++`
> b+bo+++`
> ,a.0o+++`
> ,a+9o+++`
> ,a+9o+++`
> ,a+0o+++`
> oooocxo+o
> a+0o+++`
> yyyyyyyyy
> ooooooooo
> _````````
> _a+0o+++`
> a-0o--:`
> a-0o---`
> a+0o++;`
> a+Oo+++`"
>
> M starts in state 0. if it sees a blank it goes to
> the blank state; if it sees an alpha it goes to state a;
> if it sees a . it goes to state +; and so forth.
>
> M is computed this way:
>
> fsm:{{.[x;(;y);:;z]}/[(#x)#,&256;_ic y;+x]}
> M:fsm[1+V?/:/:1_'(&O="\n")_ O]U
>
> fsm is a function which takes an integer matrix x and
> the character-class universe y, and returns an integer
> matrix.
>
> there are 17 states:
>
> #V
> 17
>
> M is 18 x 256. the extra state is 0, the start state.
> there are no transitions back to that state.
>
> the tokenizing code is then quite simple:
>
> tokens:{dbt cut[x]0 M\_ic x}"", / compute transitions
> dbt:{x _di&x[;0]_lin*U} / delete blank tokens
> cut:{(&(~=)':I _binl y)_ x} / cut by endpoints
>
> a simple example:
>
> tokens"1 2 3+*/4"
> ("1 2 3"
> ,"+"
> ,"*"
> ,"/"
> ,"4")
>
> notice that "1 2 3" is a single token. this is why the
> machine contains the , state. the target language K
> supports vector notation for integers, floats, and symbols,
> so it would be wrong to split this into three tokens.
>
> x is the argument to the 'tokens' function, and contains
> the input string:
>
> x:"1 2 3+*/4"
>
> _ic is a primitive function which gives the ascii value for
> an array of characters of any dimensionality:
>
> _ic x
> 49 32 50 32 51 43 42 47 52
>
> the heart of the tokenizer is
>
> 0 M\_ic x
>
> M\ is two-dimensional pointer-chasing. the expression
>
> k:M[i;j]
>
> where i is the current-state-index and j the ascii index of
> the current-character. the expression returns k, the next-
> state-index. i.e. in state i, if you see character j, go to
> state k.
>
> 0 M\_ic x
>
> supplies the initial state-index i = 0, picks off the
> first character index j, returns k. i is set to k,
> j is set to the next character index, and a new k is
> computed. &c.
>
> the result of 0 M\_ic x is a vector of the state-indices
> visited in the course of execution:
>
> 0 M\_ic x
> 0 4 7 4 7 4 14 16 14 4
>
> the 'cut' function takes the input string x and the
> state-index vector y and cuts the string at the state-
> endpoints I:
>
> cut[x]0 M\_ic x
> ("1 2 3"
> ,"+"
> ,"*"
> ,"/"
> ,"4")
>
> finally, the 'dbt' function discards all tokens whose
> first element is in U[0], the class of characters
> equivalent to blank.
>
> on my 3 ghz machine, 'tokens' processes 2 million chars/
> sec:
>
> x:2000000#x
> \t tokens x
> 1062 / ms
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

stevan apter — 2005-05-22 14:02:49

----- Original Message -----
From: <phimvt@...>
>
> If I remember correctly, the abstract SASL/Miranda/Haskell machine
> keeps the internal code as a tree (or is it a graph). Are you doing
> something similar? Such a tree would be rather different from
> anything vector-like, but for your sake I hope that I am wrong.
> You are a brave man to attempt this.

or foolhardy, which i suppose is often mistaken for courage. :)

no doubt you're recalling something like this: the reduction
rule for the S combinator is

S@f@g@x -> f@x@(g@x)

but x must not be evaluated more than once - once evaluated, it
must "remember" its value. so the implementation of the rule must
be something like this:

S@f@g@x -> f@<x>@(g@<x>)

where <x> is a reference to x.

at the moment, i'm mulling over how best to implement this
kind of behavior. the compiled object is indeed a tree -- a
nested list, e.g.

[@ [ @ [@ S f] g] x]

i did manage to locate a copy of peyton-jones' book a few
weeks ago, following your mention of it, but i think i'll
defer study of his (very detailed) discussion until i get
something up and running. conceptually, none of this stuff
is very difficult. most of the work is just visualizing
what's supposed to happen at each step, and then finding the
simplest implementation.

btw, k itself has a certain pseudo-lazy native construct:

a:10
b..d:"a+10"

the variable b is associated with the expression "a+10".
before b is evaluated, it is in the "invalid" state.
reference to invalid b causes b..d to be evaluated, the
result is stored in b, and b is marked "valid":

b+1
21
b
20

if a ever changes, b is remarked "invalid":

b
20
a:3 / b is marked invalid
b
13

this behavior is only defined for global names. as you
point out, in a fully lazy language the behavior exists
implicitly for all expressions.

>
> - Manfred
>
> [..]
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

phimvt@lurac.latrobe.edu.au — 2005-05-30 04:29:12

I'll be away in Europe for five weeks. I won't have easy access to my
mail unless I venture to some internet cafe - but no guarantees.

I had hoped to respond in detail to replies by Stevan and Billy,
however I am running out of time. My apologies to both of you.

See you all in a while ..

- Manfred

sa@dfa.com — 2005-05-31 15:54:09

i suppose this is all a bit off-topic, since we're here to
discuss matters concatenative, but i thought i'd post a status-
report on my project to embed k in a lazy execution framework.
unfortunately, "lazy k" has been usurped by an interloper
(http://homepages.cwi.nl/~tromp/cl/lazy-k.html), so, for now,
i'm calling my project "SLACK" (Stevan's LAzy Compiled K),
partly to credit turner's SASL (Single ASsignment Language),
and partly because i just watched "team america", and nearly
everything seems funny right now.

at the moment, the code, test-files, and a few documents
found elsewhere on the net are here:

http://www.nsl.com/k/lk/

the implementation consists of a top-level script

http://www.nsl.com/k/lk/lk.k

which will eventually house the interactive shell for SLACK.

currently,

k lk

causes the script "test.lk" to be read and evaluated. the
results are returned to the k console in "raw" form.

the tokenizer:

http://www.nsl.com/k/lk/t.k

the parser:

http://www.nsl.com/k/lk/p.k

which uses a pairwise-binding parser to create the parse-
tree for k expressions:

http://www.nsl.com/k/lk/k.k

the compiler:

http://www.nsl.com/k/lk/c.k

which converts the SLACK parse-tree into a variable-free
execution-tree.

the lazy evaluation engine:

http://www.nsl.com/k/lk/e.k

one piece is missing from the engine, which i hope to write
over the next few days, viz. an algorithm to prime the cache
Z with common subexpressions in the execution-tree. currently,
Z is updated dynamically by a few combinators, e.g. S:

S f g x -> f[x]g[x]

x is cached in Z, and computed exactly once.

i was correct in thinking that e would be short, given a
proper representation for the execution-tree. over the last
few days, i've experimented with several different methods
for reducing the e-tree, all of which used some form of
stack to control execution. indeed, this is what the big
boys all use. but i finally settled on a recursive traversal
of the e-tree, which gets dynamically converted into a graph
in the course of execution (e.g. in the case of S above,
x is replaced by <x>, which is a pointer into Z, which effectively
turns the tree into a graph.)

is it lazy? you bet:

!-1

causes a k error, because ! takes only non-negative integers.

*x

takes the first of the list x.

*[10;!-1]

returns 10. i.e. !-1 is not evaluated.

SLACK consists of 19 of the 20 k primitives (following convention,
i use : for cons,) the six k adverbs / \ ' /: \:, and all the
datatypes except "lambda". square brackets are used for lists,
and there is new syntactic category (new for k, that is, but
found in both j and APL), viz. conjunctions. since k expressions
associate to the right (and there is no precedence for functions),
i've added the conjunction {, which is used as follows:

f{g x <-> (f g)x

i.e. f{g binds tighter than application.

SLACK should add little overhead to native k, and might even
be faster given common subexpression matching (to do) and caching
(done).

in any case, it should be fun to experiment with.

finally, i'm thinking that it might not be difficult to replace
the applicative parser with a concatenative one. i.e. map a
concatenative syntax to the SLACK parse-tree.