Over the past 6 months, I’ve been working on a programming language called Pinecone. It’s not mature yet, but it already has enough features working to be usable, such as: variables, user defined structures, etc. So I must be doing something right by making a completely new language.
Parsing
The parser turns a list of tokens into a tree of nodes
Flex
A program that generates lexers
Decision
In the end, I didn’t see significant benefits of using Flex, at least not enough to justify adding a dependency and complicating the build process.
LLVM
It’s basically a library that will turn your language into a compiled executable binary.
- While not assembly language hard, it is gigantic complex library hard.
- It’s not impossible to use, but you’ll need some practice before you’re ready to fully implement a Pinecone compiler.
Build Your Own Compiler
Because of the number of architectures and operating systems, it is impractical for any individual to write a cross platform compiler backend.
Compiled vs Interpreted
There are two major types of languages: compiled and interpreted
- Compiled: A compiler figures out everything a program will do, turns it into machine code, and saves that to be executed later
- Interpreted: An interpreter steps through the source code line by line, figuring out what it’s doing as it goes
- Generally, interpreting tends to be more flexible while compiling has higher performance
- Pinecone uses both
Choosing a Language
If you are writing an interpreted language, it makes a lot of sense to write it in a compiled one (like C, C++ or Swift) because the performance lost in the language of your interpreter and the interpreter that is interpreting your interpreter will compound.
- A slower language (like Python or JavaScript) is more acceptable.
Lexing
The first step in most programming languages is lexing, or tokenizing.
Getting Started
“I have absolutely no idea where I would even start” is something I hear a lot when I tell other developers I’m writing a language.
- In case that’s your reaction, here are some initial decisions that are made and steps that are taken when starting any new language.
Transpiling
Write a Pinecone to C++ transpiler, and add the ability to automatically compile the output source with GCC
- Currently works for almost all Pinecone programs
- It is not a particularly portable or scalable solution, but it works for the time being
Conclusion
If in doubt, go interpreted
- Interpreted languages are generally easier to learn and design
- When it comes to lexers and parsers, do whatever you want
- Learn from the pipeline I ended up with
- Right now Pinecone is in a good enough state that it functions well and can be easily improved
Bison
The predominant parsing library is Bison.
Why Custom Is Better
Minimize context switching in workflow: context switching between C++ and Pinecone is bad enough without throwing in Bison’s grammar grammar
- Keep build simple: every time the grammar changes Bison has to be run before the build
- A custom parser may not be trivial, but it is completely doable
Task of the Lexer
The lexer takes in a string containing an entire files worth of source code and spits out a list containing every token.
- Future stages of the pipeline will not refer back to the original source code, so the lexer must produce all the information needed by them.
High Level Design
A programming language is generally structured as a pipeline
- Each stage has data formatted in a specific, well defined way
- It also has functions to transform data from each stage to the next
- The first stage is a string containing the entire input source file, the second is a function, and the final stage is something that can be run
Action Tree
Most akin to LLVM’s IR (intermediate representation).
Action Tree vs AST
The action tree is the AST with context
Future
Pinecone will get LLVM compiling support eventually.
- Until then, the interpreter is great for trivial programs and C++ transpiling works for most things that need more performance
- It’s just a matter of when I have time to make some sample projects in LLVM and get the hang of it