Language Notes
This page is a new idea of mine. The idea here is that I'll post whenever I have a hard-won piece of information to share. Feel free to email me if you spot an error or see an opportunity for clarification, I'll be happy to fix the page and give you credit.
Table of Contents
C
How to Write a Unicode-Sensitive Parser
So, you're thinking about writing a parser in C. Good! First step: ditch flex. It doesn't do Unicode now and it never will. I'm sure there's a tool out there that will generate a Unicode scanner for you, but Flex can't. Fun note: Plan 9, the origin of and a pervasive user of Unicode, never managed to make lex(1) Unicode-compliant (see the BUGS section). The alternative proffered by Russ Cox et al.? Hand-code your scanner.
Step 1: Unicode boiler plate
#include <wchar.h>
…
int main()
{
setlocale(LC_ALL, "");
…
The wchar.h include file gets you "wide characters." These are not guaranteed to be Unicode-compatible, but in practice they are, especially if you compile in C99 mode, which you should do anyway because you don't care about supporting HPUX or OpenVMS. If you don't call setlocale() before doing anything, nothing is going to work.
# Makefile
CFLAGS:=$(CFLAGS) -std=c99 -fms-extensions -finput-charset=UTF-8 -Wall
You're going to want the ms-extensions so you can create tagged unions without needing to name the internal union. The rest has to do with Unicode or showing errors.
Step 2: Unicode in Your C
Now you can write C with embedded Unicode. Your lexer main function is probably going to look something like this:
int yylex()
{
wint_t next;
while ((next = fgetwc(stdin)) != WEOF)
{
// simple tokens
if (next == L'+') return PLUS;
else if (next == L'-') return MINUS;
else if (next == L'×' || next == L'*') return TIMES;
else if (next == L'÷' || next == L'/') return DIVIDES;
else if (next == L'↑' || next == L'^') return RAISED;
…
Unfortunately, I don't have a lot of great advice for manually writing a lexer that can handle keywords, but if I come to it I'll be sure and share it here.
Step 3: Use Wide Character Functions
All your favorite <string.h> functions are gone, but there are replacements available to you defined in <wchar.h> and <wctype.h>:
| ASCII | Unicode |
|---|---|
| char* | wchar_t*` |
| isalpha() | iswalpha() |
| ungetc() | ungetwc() |
| strcmp() | wcscmp() |
If you dutifully use iswalpha et. al. to define your tokens rather than checking for A-Z, your language will support Unicode identifiers and so forth with no additional effort on your part.
ML
I haven't done much of import with ML, by which I mean Standard ML, mostly because I am terrible at doing things, but also partly because it is basically impossible to get all the pieces together in the right places, no community of happy ML users hoping to convert you away from Haskell, and six dozen implementations in varying degrees of bit-rot.
It is, in short, heaven.
Choosing an Implementation
Oh boy, so many choices! For starters, try to get SML/NJ. It's practically the standard Standard ML, and it's usually the one people are talking about. While you're starting out, rely on use and don't ask how you're going to compile things. We'll discuss compilation in a minute.
We are blessed with lots of other options. Fortunately, because Standard ML has formally defined semantics, most of your program is going to port just fine between implementations. There's also a formally defined standard library, called the SML Basis which is mostly supported by most implementations. So here's a quickie breakdown of the different ML implementations and why you may want one of them over the others.
- SML/NJ
- "I don't care" or "I want the popular one"
- MLton
- "Performance is priority 1" or "I hate REPLs"
- Moscow ML
- "I love separate compilation" or "I want a small implementation"
- Poly/ML
- "I love Windows" or "I love image-based persistence"
- MLKit
- "I'm doing real-time programming" or "I hate GC pauses"
There are plenty of non-standard MLs out there; the big ones are Alice and SML# (no relation to .NET). Alice is fairly divergent, but the main new features are futures, constraints and pickling. SML# is a bit more modest, mainly improving polymorphism and adding type-checked database integration.
To mis-quote Hamdy, "we'll not discuss the lousy [O]C[aml]."
Of course, if you want a language like ML but not much like Standard ML, you probably are actually in the market for Haskell, in which case the implementation you want is GHC.
WTF is CM? How do I compile?
So, you're stuck with SML/NJ, because I or someone like me told you to use it, and now you want to compile a program. Well, there's good news and bad news. The bad news is that, you may or may not be able to produce a binary executable. The good news is that even if we can't, we can arrange for an ML program to be executed outside the context of the interpreter.
CM
CM is the build system for SML/NJ. I don't think any other ML uses it, though MLton comes with a converter. All you need to know for the basics is, make a file named sources.cm and put this in it:
Group is
file1.ml
file2.ml
…
$/basis.cm
$/smlnj-lib.cm
Replace file1.ml with your source names. Make sure one of your structures has a command that takes two arguments, one for the program name, one for the list of arguments, to serve as your toplevel function.
Now run sml and try out your CM file like so:
$ sml
Standard ML of New Jersey v110.73 [built: Sun May 15 21:34:53 2011]
- CM.make "sources.cm";
[autoloading]
[library $smlnj/cm/cm.cm is stable]
[library $smlnj/internal/cm-sig-lib.cm is stable]
[library $/pgraph.cm is stable]
[library $smlnj/internal/srcpath-lib.cm is stable]
[library $SMLNJ-BASIS/basis.cm is stable]
[autoloading done]
[scanning sources.cm]
…
val it = true : bool
-
If you got false instead of true, examine the output and fix the bug. Otherwise, you can now proceed to build your executable like so:
$ ml-build sources.cm Progname.main progname
Standard ML of New Jersey v110.73 [built: Sun May 15 21:34:53 2011]
[scanning accrete.cm]
[library $SMLNJ-BASIS/basis.cm is stable]
…
[code: 305, data: 33, env: 40 bytes]
$
OK, this says to produce a new binary heap image named progname based on the sources.cm file here and the main function Progname.main (it doesn't have to be anything in particular as long as loading sources.cm means you can get to it.).
Unfortunately, ml-build doesn't produce a binary, it produces something called a heap image which is probably SML/NJ gobbledegook for bytecode. If you're lucky and you have the program heap2exec installed, you can run that and produce a real binary:
$ heap2exec progname.x86-progname progname
heap2exec takes a heap image and produces a standalone binary. The heap image name is derived from the name you used above, plus some platform details. If you don't have this, don't despair, you can instead make a wrapper script with code such as:
#!/bin/sh
sml @SMLload progname
This instructs the interactive compiler to load the heap image we made a moment ago. Less clean and beautiful, but it works fine and gives you what you really care about: a program you can run from the command line.