Oxiida
oxiida
is a DSL for building and controlling tasks running on remote resources.
It is opinioned on focusing at running scientific workflows in a high-throughput manner.
oxiida
supports construct and run tasks/jobs with different timespan from seconds to months.oxiida
supports run tasks/jobs on local machine, cloud or HPC.oxiida
record the full provenance of data flow for future backtrace.oxiida
language can embed into Python, (coming soon) Julia and (planed) Lua to powerup and standard your current workflow setup.- (coming soon)
oxiida
runtime has native support for Workflow Definition Language (wdl), Common workflow language (cwl) and nextflow workflow (nf).
Teleports
- Jump to Getting started if you are new to Oxiida.
- Jump to Embed in Python if you need to orchastrate your python tasks in to workflow.
- Questions and discussions, please join oxiida Zulip chat, I am delight to hear from you!
Quick start guide
In the quick start guide, you can see two examples:
Tasks are executed as part of a workflow, which can be constructed just as intuitively as building arithmetic expressions or chaining commands in a shell pipeline. Additionally, every workflow automatically records data processing, enabling backtracing and review in the future.
For both examples, the workflow can be embeded into the python, julia and lua.
Before you start, make sure you have Oxiida installed.
- To simply run workflows with Oxiida, download the Oxiida binary here and place it somewhere in your system’s PATH.
- If using Oxiida from Python, Julia, or Lua, install the
oxiida
library for your language of choice.
For detailed installation instructions, see the installation guide.
Command‑Line Notation
In this guide, lines that you should type in a terminal start with
$
. You don’t need to type the$
character itself; it’s the prompt. Lines that don’t start with$
show the output of the previous command. PowerShell‑specific examples use>
rather than$
.
Perform an arithmetic operation
Suppose you want to perform an arithmetic expression evaluation, with the requirement you want to record the full data flow for future provenance.
(6 + 4) * -5 / 2^2
In oxiida you do exactly the same, writing following code to a file, named it arithmetic.ox
(or name what every you like, it is just a text file).
x = (6 + 4) * -5 / 2^2;
print x;
You run it through the oxiida binary
$ oxiida run --storage file arithmetic.ox
You then get the result of -12.5
and a file provenance_dump.csv
file for how data nodes connected with operations.
--storage file
is to tell Oxiida to write the data flow into a file (by default named provenance_dump.csv
) in the same directory you running the command.
If the option not provided, the information will be dumpped to stdout.
NOTE: Persistence the data flow to csv file is only for prototype demostration purpose. It is mainly to show I can easily implement a persistence approach and have a multiple writer without using mutex because the actor pattern is used here. The planned feature is to support SQLite as default data storage backend.
Perform a pipeline of shell tasks
Suppose you want to perform two shell pipeline operation on multiple shell commands, with the requirement you want to record the full data flow for future provenance. Moreover, two pipeline are independent and may tasks a bit longer to finish but you don't want to wait one to finish before start the other.
The example shell piple is like this
$ echo -e apple\nbanana\napple\norange\nbanana\napple | sort | uniq -c | sort -nr
$ echo \n
$ sleep 2
$ echo -e rust\npython\njulia\nrust\nrust\njulia\nlua\nlua | sort | uniq -c | sort -nr
$ echo \n
$ sleep 2
you wait for 4 seconds and you see the output
3 apple
2 banana
1 orange
3 rust
2 lua
2 julia
1 python
Let's save 2 seconds of your life with some records of what you are doing
para {
seq {
out = shellpipe {
"echo" "-e" "apple\nbanana\napple\norange\nbanana\napple" | "sort" | "uniq" "-c" | "sort" "-nr"
};
print out.stdout;
// `shellpipe` is actually a syntax sugar of combination of `shell` call
shell {"sleep", ["2"]};
}
seq {
out = shellpipe {
"echo" "-e" "rust\npython\njulia\nrust\nrust\njulia\nlua\nlua" | "sort" | "uniq" "-c" | "sort" "-nr"
};
print out.stdout;
shell {"sleep", ["2"]};
}
}
Save it in the para_pipe_shell.ox
and run it with
$ oxiida run para_pipe_shell.ox
You get the same output, but depend on which seq
(for sequence) block finish first you may get results of the language ranking before the fruit ranking.
Perform tasks with control flow
Oxiida has native support for the control flow (it is a language so sure it support).
The loop control flow has while
and for
syntax support and the regular if
..else
with else if
without sugaring it as elif
.
Here is an made up shell call tasks that use all of them.
out = shellpipe {"shuf" "-n1" "-e" "good day" "bad day" "soso day" | "tr" "-d" "'\n'"};
print out.stdout;
if (out.stdout == "good day") {
print "I can do some scientific stuff.";
} else if (out.stdout == "soso day") {
print "I need get a beer.";
} else {
// it is a bad day man
x = 0;
while (x < 5) {
print "I need to do some rust!";
x = x + 1;
}
for uhr in ["23:00", "24:00", "01:00", "02:00"] {
print "I keep on rust at:" + uhr;
}
}
Example of running workflow in Python
Oxiida is published in pypi, install it by
$ pip install oxiida
Similar as using the arithmetic, now I need to run a function (let's say I want to do high school match with math.sin(x)
, and a customize function to compute some nonsense stuff).
And this function is defineded or imported in python.
I can run it from Oxiida with multithreading support.
Create a python script wf.py
with following text.
import oxiida
import math
from time import sleep, time
def super_complex_operation(x: float, y: float, z: float) -> float:
intermediate1 = math.sin(x) * math.log1p(abs(y))
intermediate2 = math.exp(-z) + x ** 2
# Sleep 2 sec to demostrat two of them can run concurrently without writing multithreading control code.
sleep(2);
result = (intermediate1 + intermediate2) / (1 + abs(z - y))
print("time:", time.strftime("%Y-%m-%d %H:%M:%S"))
return result
# language = oxiida
workflow = """
require super_complex_operation;
require time, sleep;
para {
print "--anchor--";
seq {
print(super_complex_operation(10, 3, 6));
}
seq {
print(super_complex_operation(5.4, 3, 7));
}
}
"""
if __name__ == '__main__':
oxiida.run(workflow)
Then run it by
$ python wf.py
The powerfulness of running python function within Oxiida is you do not need to write multithreading code or async code that need to bridge with synchronous python code.
Using para
block syntax allow to call both functions to run concurrently in separate threadings (althrough the heavy CPU load tasks are still dragged by the GIL).
Once the Python remove GIL limit, the Oxiida will levarage the no-gil CPython and automatically get full power.
Cheatsheet
Builtin Tasks and Operations
Category | Description |
---|---|
Binary operations | + , - , * , / , ^ |
Unary operations | + (number/string), - (numbers only), ! (booleans) |
Comparison operations | > , >= , < , <= , == , != |
Logic operations | and , or (short-circuit) |
shell | Run shell command |
shellpipe | Run shell pipeline (syntax sugar for shell ) |
Builtin Types
Type | Description |
---|---|
Number | internal representation will be a f64 (for both integers and floats) |
Boolean | either true or false |
String | Anything inside double quotes |
Array | Same type items in [] , separated by , (e.g., ["monday", "sunday"] ) |
Mapping | Not directly defined or passed to functions, but can be returned. Access attributes using . by key (e.g., object.key ). |
Grammar Rules
The detail grammar rules, only come here when something is really hard to figure out and you want to figure it out by yourself. Otherwise, join and ask me in Oxiida Zulip.
Program structure
Non-terminal | Production |
---|---|
Program | Statement* |
Statements
Non-terminal | Production |
---|---|
Statement | SimpleStmt ";" | Block | IfStmt | WhileStmt | ForStmt |
SimpleStmt | Expression | "print" Expression | "require" IdentifierList |
Block | "seq" "{" Statement* "}" "para" "{" Statement* "}" "{" Statement* "}" |
IfStmt | "if" "(" Expression ")" Statement "if" "(" Expression ")" Statement "else" Statement |
WhileStmt | "while" "(" Expression ")" Statement |
ForStmt | "for" Expression "in" Array Statement |
IdentifierList | identifier ( "," identifier )* |
Expressions — precedence (low → high)
Level | Non-terminal | Production |
---|---|---|
8 | Assignment | Shell "=" Expression | Logic |
7 | Logic | Equality { ( "and" \| "or" ) Equality } |
6 | Equality | Compare { ( "==" \| "!=" ) Compare } |
5 | Compare | Term { ( ">" \| ">=" \| "<" \| "<=" ) Term } |
4 | Term | Factor { ( "+" \| "-" ) Factor } |
3 | Factor | Power { ( "*" \| "/" ) Power } |
2 | Power | Unary { "^" Unary } |
1 | Unary | ( "+" \| "-" \| "!" ) Unary | Attribute |
Primary & postfix
Non-terminal | Production |
---|---|
Attribute | CallPostfix { "." identifier } |
CallPostfix | Primary { "(" [ Expression { "," Expression } ] ")" } |
Primary | "true" | "false" | "nil" number | identifier | string "(" Expression ")" Array | Shell | PipeShell |
Composite literals & shell constructs
Non-terminal | Production |
---|---|
Array | "[" [ Shell { "," Shell } ] "]" |
PipeShell | "shellpipe" "{" PipeShellExpr "}" |
PipeShellExpr | `PipeShellExpr " |
ShellUnit | string string* |
Shell | "shell" "{" Term [, "[" Term { "," Term } "]"] [, Shell] "}" | Logic |
Tokens / Lexical categories
Category | Lexemes (examples) |
---|---|
Keywords | if , else , while , for , seq , para , print , require , and , or , in , true , false , nil , shell , shellpipe |
Operators | = == != < <= > >= + - * / ^ ! . |
Delimiters | ( ) { } [ ] |
Literals | number (f64), string |
Identifiers | identifier |
Operator-precedence recap
Precedence | Operators |
---|---|
1 (highest) | + - ! (unary) |
2 | ^ |
3 | * / |
4 | + - (binary) |
5 | < <= > >= |
6 | == != |
7 | and or |
8 (lowest) | = |
Getting Started
Let’s start your Oxiida journey! There’s quite a bit to learn, but every journey starts somewhere. In this chapter, I’ll discuss:
- Installing Oxiida on Linux, macOS.
- Writing a program that prints
Hello, world!
. - Writing a program that doing useful shell pipeline with provenance.
Installation
Before you start, make sure you have Oxiida installed.
Pre-compiled binaries
Download the Oxiida binary from SourceForge and place it somewhere in your system’s PATH
.
From a programming language
If you intend to call Oxiida from Python, Julia, or Lua, install the corresponding oxiida
library:
Language | Package manager command |
---|---|
Python | pip install oxiida |
Julia (not yet avail) | import Pkg; Pkg.add("Oxiida") |
Lua (not yet avail) | luarocks install oxiida |
Troubleshooting
To check whether you have Oxiida installed correctly, open a shell and enter this line:
$ oxiida --version
You should see the version number.
oxiida x.y.z
If you see this information, you have installed Oxiida successfully! If you don’t
see this information, check that Rust is in your PATH
system variable as
follows.
In Linux and macOS, use:
$ echo $PATH
If that’s all correct and Oxiida still isn’t working, please join the Zulip chat and ask questions directly
Hello, world!
Now that you’ve installed Oxiida, it’s time to write your first Oxiida script.
It’s traditional when learning a new language to write a little program that
prints the text Hello, world!
to the screen, so we’ll do the same here!
Note: This book assumes basic familiarity with the command line. Oxiida makes no specific demands about your editing or tooling or where your code lives, so if you prefer to use an integrated development environment (IDE) instead of the command line, feel free to use your favorite IDE. Unfortunatly I don't have time to implement LSP for Oxiida. It is planned and will be one available after basic language primitive constructs are fixed.
Creating a Project Directory
You’ll start by making a directory to store your Oxiida code. It doesn’t matter to Oxiida where your code lives, but for the exercises and projects in this book, I suggest making a projects directory in your home directory and keeping all your projects there.
Open a terminal and enter the following commands to make a projects directory and a directory for the “Hello, world!” project within the projects directory.
For Linux, macOS, and PowerShell on Windows, enter this:
$ mkdir ~/projects
$ cd ~/projects
$ mkdir hello_world
$ cd hello_world
Writing and Running an Oxiida script
Next, make a new source file and call it main.ox. Oxiida files always end with the .ox extension. If you’re using more than one word in your filename, the convention is to use an underscore to separate them. For example, use hello_world.ox rather than helloworld.ox.
Now open the main.ox file you just created and enter the code in Listing 1-1.
print "Hello, world!";
Save the file and go back to your terminal window in the ~/projects/hello_world directory. On Linux or macOS, enter the following commands to compile and run the file:
$ oxiida run main.ox
Hello, world!
Regardless of your operating system, the string Hello, world!
should print to
the terminal. If you don’t see this output, refer back to the
"Troubleshooting" part of the Installation
section for ways to get help.
If Hello, world!
did print, congratulations! You’ve officially written a Rust
program. That makes you a Rust programmer—welcome!
The line in listing 1-1 does all the work in the little script: it prints text to the screen. There are three important details to notice.
First, print
keyword lead a print statement to print the evaluate value of the followed statement.
Second, you see the "Hello, world!"
string. It is a expression, thus the value returned from
this evaluation to be printed by the print
keyword on to the screen.
Third, the statement end the line with a semicolon (;
), which indicates that this
expression is over and the next one is ready to begin. Most lines of Oxiida code
end with a semicolon.
A shell pipeline
Now, let's move to a real world case, orchestrate some shell commands into workflow.
The shell task is one of the basic built-in tasks provide by oxiida. The raw shell call is construct as
out = shell {"echo", ["-ne", "Hello, Oxiida!"]};
print out.stdout;
print "--shell pipeline sugar--";
out = shellpipe { "echo" "-e" "apple\nbanana\napple\norange\nbanana\napple" | "sort" | "uniq" "-c" | "sort" "-nr" };
print out.stdout;
Oxiida tutorial
Variables
The scope and shadowing
x = 5;
print x;
para {
seq {
x = x + 1;
print x;
}
seq {
x = x + 1;
print x;
}
}
print x;
Data Types
Comments
Control Flow
Blocks introduced by for
and if
..else
use braces { ... }
.
A brace block always starts a new local scope, so any variable you assign inside it becomes a fresh local binding even when a variable with the same name already exists outside.
A typical if..else block looks like this,
x = 11;
if (x > 10) {
x = x + 1;
y = 0;
print "1st if branch";
print x; // output: 12.0
} else if (x < 2) {
seq {
print "else if branch";
print x;
}
} else {
print "else branch";
print x;
}
// error, y not defined. different from python/julia
// print y;
// x is the outter scope x
print x; // output: 11.0
An example for loop like this,
x = 0;
for x in [1, 2, 3, 4] {
print x + 1;
y = x + 1;
print y;
}
print "-------";
print x; // output: 0.0
// y is not defined in the outer scope.
// print y;
A while
loop, by contrast, is closed with the keyword end and does not create a new scope.
This lets you declare the loop variable before the loop, update it inside the loop body, and keep using it afterward.
The while
loop looks like so,
x = 0;
while (x < 10)
x = x + 1;
print x;
end
print "----";
print x; // print 10.0
The language has no separate declaration keywords like let
or var
.
A simple assignment both declares and assigns if the variable was not found in the scope.
Because of this, block delimiters themselves signal scope changes: braces mean "new scope", while
..end
means "stay in the current one."
With this scheme, a for
loop behaves like Julia (the loop variable is local), whereas a while
loop behaves like Python (the variable remains shared).
The result is a minimal grammar that makes scope boundaries visually obvious and mutation rules predictable.
Builtin tasks
Parallel execution
Error handling
Data Persistence
Proverance Introspect
Command Line Tool
run
daemon
start daemon
$ oxiida daemon start
submit workflows to daemon to run
$ oxiida submit wf.ox
workflow
task
storage
Oxiida embed in Python
Run in python script
Call Python function
Data Types
Oh! gil! Un-painful parallelism
Multi-threading for IO-bounded tasks
In python, you can use multithreading to unlock some performance for IO-bounded instructions.
For instance look at the example below, the io_work()
calls time.sleep()
function which is an IO-bounded operation.
However, if python runs in sequence, this sleep will block other operations there until it finished.
Using multithreading can release the CPU power by not blocking such type of IO tasks.
import oxiida
from time import time, perf_counter, sleep
def count_duration(t0: float) -> float:
return perf_counter() - t0
def io_work():
sleep(1)
# How long does two run take?
pt0 = perf_counter()
io_work()
io_work()
duration = count_duration(pt0)
print(f"run in sequence: {duration:.2f} s")
It runs in 2s in total, not very good, things can sleep together.
Let's then use multi-threading
# extra import of threading from python stdlib
import threading
pt0 = perf_counter()
t1 = threading.Thread(target=io_work)
t2 = threading.Thread(target=io_work)
t1.start()
t2.start()
t1.join()
t2.join()
duration = count_duration(pt0)
print(f"triple threads: {duration:.2f} s")
Nice, it only takes 1s now, all good.
In Oxiida, if the python ffi calls are run inside para
block, they are spawned in multithreadings.
import oxiida
from time import time, perf_counter, sleep
def count_duration(t0: float) -> float:
return perf_counter() - t0
def io_work():
sleep(1)
# language = oxiida
workflow = """
require io_work, count_duration, perf_counter;
require time;
print "-- multiple tasks (run in sequence) --";
t0 = perf_counter();
seq {
io_work();
io_work();
}
print count_duration(t0);
print "-- multiple tasks (multithreading with gil) --";
t0 = perf_counter();
para {
io_work();
io_work();
}
print count_duration(t0);
"""
if __name__ == '__main__':
print("-- running workflow --")
pt0 = perf_counter()
oxiida.run(workflow)
# you can get the time count and print it in python as well
_ = count_duration(pt0)
If io_work
called from seq
block, they run in sequence, if they called from para
block, they run concurrently in multithreading manner.
Pretty neat, no? You can avoid to use concurrent programming language primitive such as manage join handlers.
It is fit well for the workflow language.
Because not like generic programming language, the workflow language is target to run the task to the end instead of mutate the middle state of a task.
That is to say, at the end of scope the handlers are joined automatically to make sure the tasks run to the end.
Multi-processing for CPU-bounded task
However, the elephant in the room is Python’s notorious Global Interpreter Lock (GIL). For CPU-bound code, a single process can run only one thread at a time, so extra threads can’t step in to share the work.
# Heavy CPU call to show the GIL still hold and cannot be surpass
def cpu_work():
total = 0
for i in range(10_000_0000): # this is purely CPU, no I/O, no sleep, takes 3s in my intel-i7 CPU
total += i * i
# 1) How long does one run take?
pt0 = perf_counter()
_ = cpu_work()
duration = count_duration(pt0)
print(f"single thread: {duration:.2f} s")
# 2) Now run two of exactly the same job in “parallel” threads
pt0 = perf_counter()
t1 = threading.Thread(target=cpu_work)
t2 = threading.Thread(target=cpu_work)
t1.start()
t2.start()
t1.join()
t2.join()
duration = count_duration(pt0)
print(f"triple threads: {duration:.2f} s")
Multithreading not only fails to accelerate the workload, it can actually slow the program down compared with running the tasks sequentially. The frequent context switches that threads introduce add overhead. You can fall back on Python’s built-in multiprocessing, but other workflow engines can’t simply toggle it on—they’re still restricted by the expressiveness of Python’s own language constructs.
By embedding an Oxiida workflow in your Python script, you can build a CPU-bound pipeline that sidesteps the GIL: the Oxiida runtime transparently forks and orchestrates multiple OS processes through its Rust backend.
import oxiida
from time import time, perf_counter, sleep
def count_duration(t0: float) -> float:
return perf_counter() - t0
def cpu_work(idx: str):
total = 0
for i in range(10_000_0000):
total += i * i
print(f"finish! task #{idx}")
# language = oxiida
workflow = """
require cpu_work, count_duration, perf_counter;
require time;
print "-- multiple tasks (multiprocessing) --";
t0 = perf_counter();
para {
=cpu_work=("1st");
=cpu_work=("2nd");
// or add more if you have more cores, it still finished with same amount of time.
//=cpu_work=("more");
//=cpu_work=("and more");
//=cpu_work=("and mor more");
}
print count_duration(t0);
"""
if __name__ == '__main__':
print("-- running workflow --")
pt0 = perf_counter()
oxiida.run(workflow)
_ = count_duration(pt0)
The only different here is the "cat-like" syntax I introduced, which flags a function is send to run in a forked process.
If you open top
/htop
to monitoring the CPU/threads usage, the contrast is clear: with threads, the GIL keeps each process at rouphly 50% of CPU usage.
But with process forked , the CPU usage is 100% and the whole workload finishes in essentially the same time as a single task.
For developers
Command Line Tool
Designs
Real-world goals
- Quantum ESPRESSO simulation but easier.
- SSSP verifications but minimal man labor.
- Bioinformation using container technologies.
- Machine learning using hyperqueue.
Milestones
- run cp2k simulations on local laptop with native container runtime integrated (bare machine launch).
- run cp2k simulations on HPC through SSH. (let QE go where ever it is.)
- run python based machine learning on hybrid local + GPU cluster.
- run bio-information typical RNA data processing pipeline on cloud.
- support WDL by parsing spec to my ast.
- run some example through hyperequeue.
Roadmap
In the prototype phase, nice to show powerfulness of new self-baked syntax and runtime with following features.
- process as the inner task dirver constructed as a generic state-machine.
- trivial arithmetic task that wrap as the process for consistency.
- shell task that can pause/resume/kill asynchronously.
- base syntax tree to evaluate arithmetic expressions.
- base syntax tree representation to evaluate shell commands.
- pipeline shell tasks through syntax tree.
- tracing log, separate print statement and log to tracing.
- Builtin binary expressions.
- std lexing and parsing to my syntax tree.
- control flow: if..else, while loop
- array-like type for holding multiple return and used as iter in for loop (python-like syntax).
- para block using shell example.
- customized provenace to file.
- pipeline shell syntax sugar.
- miette to pop nice syntax error.
- design doc for the syntax specifications, through a mdbook (https://github.com/oxiida/book).
- FFI through pyo3 to embed oxiida-lang into python script.
- (*) workflow reuse
- (*) module import.
- Clear scope and variable management (plan to use var stack)
- Performance, workaround GIL with multiprocessing.
- versatile runtime that listen to the tasks and launch them.
- sqlite as default persistence
-
Default config folder using
.config/oxiida/
and support profile switch by switch config. Should also manage proper persistence target of persistence between run and submit to daemon. - query db and provide a lame version of graph print.
- (*) the ffi call from python should able to return value in python, the workflow can return result as a expression.
- (*) traverse ast and spot parsing time error as much as possible.
- (*) statement should return value of last expression, para block should return an array with lexical order.
-
(error-prone, thus disallowed) andpara while
para for
syntax. (check https://docs.julialang.org/en/v1/manual/multi-threading/#The-@threads-Macro) - (*) snapshot of a syntax tree and serialize details to restore. (this requires to change from recursive interpretor to flat loop interpretor, a big refactoring)
- graceful cancellation with termination signals, use SIGTERM
- static and strict data type.
- type checking for the ffi function declaration.
- separate the bin and lib parts into independent crates.
- pre-runtime type check for array and variable assignment.
- tracing for crutial flow nodes.
- (*) Support container tasks
- Support ssh tasks
- Support container tasks over ssh wire.
- chores: TODO/XXX
After prototype, incrementally add more nice to have proper language things:
- reintrospect the design, especially where the reactor pattern is used. Draw figures for clear demostration.
-
error handling of task, how to represent in syntax, rust way with
match
keyword? - ergonomic: py daemon attach env mutate info to the reply message to client.
- when using =_= syntax, provide way to control upbound number for the available processes.
- FFI call from julia/lua
- traverse the ast and generate the control flow graph.
-
break
andcontinue
keywords. - let's build a REPL, oxiida its a interpret language.
-
??
local
keyword for variable shadowing. - performance: expressions in array can be evaluated concurrently.
- parallel the expression evaluation and hold the future to join when use.
- separate CLI and library part and make library part dedicate crate.
- separate pyo3 part as python SDK.
- release fixed version of specification with the basic language primitives.
-
clean the dependencies that can be implemented by my own. list are
corfig
anddirectories-rs
crates for profile/config setting.rmp-serde
for codec to msgpack.
-
For python binding, clean the dependencies that can be implemented by my own. list are
serde-pythobject
for convert pyAny back and forth.serde_json
, because should not bind to json as serializable format.
- Considered native win support. Now I use
nix
which is a issue for the moment for win support, but should consider to support it as a DSL.
Now I can public the repo and release 0.1.0 to crates.io
After 0.1.0
- anoymous function and workflow.
- parser for WDL
- parser for nextflow
- parser for CWL
Misc
Workflow definition
The anology can be made by mapping the concept of remote HPC as the CPUs and the database as the RAM (optional if the runtime keep on running and do not need to recover tasks from persistent storage, then just use RAM as "RAM").
It is unavoid to have a domain specific language to write workflow, since workflows can be regard as programming to control the remote resource.
Therefore, oxiida
has an async runtime based on the rust tokio and has its own syntax (the standard procedures of a programming language design with lexing/parsing/interpreting) to mark as a small language for workflow construction.
Syntax
The syntax of workflow definition part is borrow from Golang, since golang provide a great solution for bridging the synchronouce and asynchronous code. The reason is that there may no need to have a general programing language that cover lots of nucance but for defining and executing workflow, two topics are unavoid:
- is it support control flow?
- can things run in parallel?
If the async or lock concepts for concurency programing is not introduced, there is no hybrid way to accomplish both.
The syntax should distinguish from the platform language, say in the python SDK, the syntax can python-like but developers should easily distinguish it is programming on the workflow. Because the workflow will then pass to the runtime that is not python VM and required different error handling. The way to debug and to inspect the result can similar to python but not the same. By delibrately distinguish a bit on the syntax make user aware when the problem happened where they need to go to check and to ask the questions, to python or the workflow engine. Or the internal engine error should be treat as panic and not exposed to user (when user hit such kind of error, they are supposed to report to me as a bug). To make the panic nice for user, can consider to use https://github.com/rust-cli/human-panic. If the panic happened in the daemon side, a message should send back to the client to tell user where to the log file to get the log and upload (or automatic this procedure if possible, by asking for the permission.) The syntax and workflow runtime error are exposed to user with extensive help information so they can fix by themself.
- python-like or non-python-like?
- syntax customized for different base language or single syntax?
{
and }
pairs are used for separate the scope of variables passing between processes.
The newline is treat as statement terminator, since I didn't expect multilines statement in building a workflow.
Coerce to limit the versatile of syntax
I don't want user who write workflow write super complex flows that make the DSL looks like a generic programming language. Therefore, I'll make some assumptions and simply throw compile time unsupport error when unexpected syntax shows up. parallel inside while is one of them, and I'll come up with more rules that users may think about but should not support.
Basic language primitives should all support, such as while
, for
, if..else if.. else
and break/continue
.
The unsupport list are for things like nested para block, or para block inside para while etc.
Runtime
Runtime is a huge topic requires a lot mental efforts to make the design correct.
I need to come back many times to re-think how to make it well implemented to get both performance and maintainability.
There are hard concept such as Pin
/Unpin
need to go throught again and again and have a well understand on deep native tokio runtime implementation to borrow ideas and to make the pattern consistent.
The inner state machine is called "Process", it hold states and return handler for the runtime controller to communicate from outside of runtime. The "Task" is a wrapper of the "Process", it contains also the input and output of the runable target. Inside the process, it defines generic running logic which in the context of workflow usually require involving the resource management. The process running logic is essentially how the input port being used to modified the output port.
Therefore the "Task" interface require task.new
as the constructer (will builder pattern fit better??) and require task.output
to access the output later when task completed.
It also require the task.execute
to define how to use inputs to modify the mutated output port.
concurrent processes
The reserved keyword para
is introduced to decorate the block (should also for expression evaluation??) to run in concurrent manner.
The concurrent under this context means for inside a workfllow, the instructions can not just goes in the lexical order but statements can run "in parallel" or more precisely asynchronous in the tokio runtime.
Because if the workflows or processes are launched separately they are run in runtime asynchronously.
In oxiida, a DSL is introduced there for there are lexical orders for the code (for construct the workflow) are written.
By default, tasks run in sequence mode, only if the task is annotated as parallel, it will spin up to run concurrently.
This is key to avoid suprising execution order of workflow.
But between workflows, since there are no code written to communicate between workflows to run one after another, it is safe to have them run in any order with their own variable environment.
Under the hood, sequence run and parallel run are distinguished by the join handler of the spawned couroutine. For the sequence tasks, the join handlers are immediately await after the spawn of the task. This ensure the async runtime will not execute other task until the await join handler returned. While for the parallel tasks, the join handlers are later await when the leave the parallel scope, or even later when the workflow finished.
It is essential that the join handlers are all properly await so that no tasks abort unexpectly when the runtime finished, which may cause remote resource leaking.
The design here is to have a handler type check at the end of runtime, the joiner is an enum type that can be either JoinHandler
or resolved output (dyn
needed??).
If it is a join handler which means there are no explicity consume of the output, the runtime will trigger the await and warn the user that "Some output is not used and runtime is await to make sure it is finished without resource leaking".
In principle, since I have the whole DSL parser implemented from scratch, I can design the syntax for inline parallelism.
Semanticlly, it means an expresion can have partial section of an one line execution run in parallel.
For example, (6 + 4) * (5 + 4)
can have left-hand side and right-hand side of *
operator run at same time by violate the left association law of *
operator.
At the moment, I didn't see much use case of adding such complicity, so I just limit the parallelism can declared to certain type of statements apply to certain type of task, e.g. ShellTask
, SchedulerTask
.
But the parallelism of evaluating the array element expression may not cause any unexpected behaviors since they are by default not rely on the others.
It thus worth to support the parallel run of expressions in the array.
It requires some restrictions for the statements that can be run in parallel block.
Every statement has their own variable scope, so they are able to run in any order.
Actually it is natural when implement in rust the borrow checker force me to deal with variable environment nicely by making them independent for each statement.
Because the environment is passed as &mut var_env
so to make it can be used for another corountine, I choose to clone it and move the ownership to the statement run in a async block.
This makes it again not so sure if the scope is needed or not.
All the statements in the para block will be launched concurrently and no hybrid mode support yet to have single statement that is run in sync. In principle it is allow to have same variable name for different statements since they have own local scope, but usually it reveal some potential risk of a bug, this is raise as error spotted in the ast resolve phase. The statement is not allow to redeclare or modified the variable in the parent scope (the shared var_env are read-only), this restriction make it possible avoid using mutex. At the end of para block all the handlers are wait to join so nothing goes no control.
The statement contains evaluation of expressions, and I do not know in advance whether the expression has side effect or not. The side effect means the statement does not mutate the resource, a more detail categorise on which expression should be regard having side effect should be well defined, at the moment, the assignment is one example of not having side effect, while the tasks with process run in runtime treat as having side effect.
In terms of para expression, it is not clear whether it require to support inline decorate expression with para
to make an expression can immediately return the control.
I can see use cases that in the resource wise it is effecient to launch things concurrently and continue the other instructions.
But this means the expression need to return a future type (the JoinHandler
in my runtime implementation) that need to be resolved to complete manually (or automatically at the end of scope) to avoid resource leak.
Not very sure how complex the syntax goes to do it right, so leave it as a feature after prototype phase.
When the expression does not has any side effect the ast traverse phase will detect such problem and throw error to warn a potential bug and ask to put the statement in the block to form a valid statement.
CLI
The command line interface design of AiiDA was complexed because of the tangling of the inner operation with the interface layer. That is to say, no layer and abstraction between what the inner system should do and hwat the commands are send to the interface ask for what to do. An typical example is the process control, which need to communicate with process queuing system (in aiida the queue is managed by the rabbitmq and event loop). The command directly interact with the inner API that connect to the rabbitmq. This makes multi-users support and AiiDAlab communication hard to implement and there is no security gaurenteed once user get cli access to the port. There were effort to have standard RestAPI to talk to the engine part to communicate but it is just hard because the public APIs in aiida is a mess.
In Oxiida, the abstraction of the interface is natively under consideration. The CLI is not directly manipulate any inner state but send message over tcpstream to communicate with the running daemon or with the persistent system (storage, or database). This makes it easy to add authentication in the handshake when tcp connection is created either through CLI or through other interface (I am thinking about protocol buffers and using gRPC to standardlize the communication to realize the language agnostic frontend inerface design).
The message is send to an actor that responsible for manage the manipulation of inner state of the runtime system, include persistent actor and ffi tasks actor.
The persistent actor has a single connection to the storage.
There is a caveat here need more thinking about the lifetime of the actor.
If there is one such actor attached with daemon and it is writing to the storage, for the run
command, it should reuse this actor which require mutex or use a new actor which result in racing?
It leads to more question such as can multiple daemon runs at the same time and connect to the same persistence system?
The catch is always about data racing when the filesystem are shared to be accessed from more than one entity.
No idea can be borrowed from AiiDA since it just igore such design consideration by shift the responsibility to PostgreSQL which support multi-writer.
In the design of Oxiida, the sqlite is targetted as the production database and it accept single writer by default.
The decision here is, I will have dry_run
which always start its own task actor and write things the DB in local; submit
by default submit to the daemon and submit --wait
will submit and wait for workflow to finish.
The submit --wait
is the command which print
will send the message back through the channel and show the print info to the user space.
If no --wait
is passing, then the print
statement will just log into the tracing.
The process control actor is per runtime, every actor has its own oxiida language runtime to interpret the source code. I can do this based on the assumption that the workflows are independent drive to the terminated status. Every workflow get its variable environment either by closure from ffi or from where it is launched, and can then run independently in what every oxiida runtime it can run.
Function and workflow
I define two types of reusable block, function and workflow.
Function is defined with the keywords function
while the workflow is declared with workflow
.
The different is that the function is one of the fundamental building blocks of workflow, itself is represented as a data processor task.
The workflow on the other hand can be reuse but it is more for the orchastra purpose.
From usage point of view, running function will result in spit out a task data with input and output, while the workflow will contains the tasks and represented as the upper layer that encapsulate its callee's tasks.
On the other hand, workflow can call workflow and function, while function can only call function and only the function called from the workflow will be treated as a task.
It is a lot overhead if every granular expressions are being recorded with data into storage. The function introduce a block that the intermiddle steps are not recorded but only the function it self as an abstract data logging node in the persistence storage. On the contrary, the workflow keep track of all details of intermiddle steps.
This bring the idea that I may also want to introduce anonymous function/workflow. But that require more details of design how to manage closure etc, not plan before v1.
Config and data folder
I don't want the ship the burden to user to setup where to store the config and where to store the db.
The default config is standard path from directories-rs
(follow the standard of every OS).
The config will contain folders for multiple profiles, the default
profile is created when the application first time runs, using LazyCell (non-thread safe but fine for Oxiida since the entry of cmd is single thread) to store the config parameters.
The default db when it runs with daemon mode, it should also goes with ProjectDirs.data_dir
folder of directories-rs that is $XDG_DATA_HOME/<project_path> or $HOME/.local/share/<project_path>
for Linux.
The data folder should be able to be changed easily through config.
Storage for data persistence
One of the key feature to mark a scientific workflow is the ability to record the dynamic running logic.
The data nodes as the input or output of the tasks will be generated and written into the persistent storage system where the runtime runs.
It can be a file, a database or even the remote object store.
This requires the input/output data is serializable.
Powered by the serde
crate, combined with the type checking of the DSL I introduced in the syntax, it is clear what should be the basic type for the tasks' input/output.
The basic types are serializable values defined in: https://docs.rs/serde_json/latest/serde_json/enum.Value.html
(Now by default everything is recorded, do I need control the storage in a fine grid??) Whether to store input/output along with the process is controlled in the block statement level.
By default, everything is not stored for performance consideration and so that provenance can be an optional feature.
When the block is marked as "proof" by the leading
I can not see use case of unmarking for a particular expression inside because once the provenance needed it requires the provenance to be complete.
For the customized task, it is in the task implementation define which data should be stored.@
for @{ stmts }
, it casade into the statements inside.
When an expression marked as stored, the terminator data expression will attached with an unique ID so that when the terminator value output become the input of another expression, the ID indicate that it sterms from the idendical data.
Meanwhile, the data with the ID will be dump to the storage defined.
The storage is controlled by passing through the CLI.
I don't have idea yet what command line API should be for the storage.
For prototyping, I'll implement two types of storage system.
One is the dummy (named NullPersistent
) one that nothing will actually goes to persistent storage but just emit the storing operation to the stdout.
Then it is file in the disk (named FilePersistence
) as the persistent to the file system.
Both of them share the same interface showing how a persistent system should operated.
After support Sqlite as default persistent, I remove NullPersistent since it unusable. The FilePersistence may still useful for just dumping things and will probably raise error if user trys to query from it. I also try to keep two to have the trait because the persistent to PostgreSQL and surrealdb is something may needed when it goes to huge throughput and for distribute deployment. The idea is also that the base version provide the sqlite and when it potentially go to commercial stage the other database support is additional services.
There are three types of data nodes that will stored in the persistent storage system for future provenance. It consist of terminator value data note, task process node and edge node. Each of them will have their unique id when stored in the persistence media such as file or database.
Distinguishing between the value array and expression array, where value array contains only terminator values that are serializable and will be the value nodes. The expression array contains expressions that will be further evaluated towards to the value array.
Those types are mapping to the terminator expression of the syntax tree. Then storing the data just become serialize the terminator data syntax tree node and calling the persistent storage system, which can be passed to the runtime for switching where to store the provenance. Implementing the different storage system are just fitting for the interface with the dependency injection pattern, neat.
Storing of input and output is independent from running process, because process is where inputs are consumed to generate output.
The storing require inputs data that already exist before process is running and require outputs data that only exist after process complete.
Meanwhile, there are expressions that has data connection that requires no process to generate data such as Expression::Attribute
.
For the assignment case, the output is deduct from the input by .
operator which trigger a hashmap lookup rather than running a process.
Variable scope and shadowing
I personally think a proper generic programming language should separate the declaration and assignment a variable instead of mix them like python. But as a small DSL, it might be a bit too verbose to have explicit declaration. However, does not separate two statement means when the assignment appears the second time, it either a valid assignment or wrongly override the old value by overlook it is already used. I can reject any reuse of the variable but it will end up having too many variables just for simple stuff. To mitigate the problem, I apply the pre-runtime check to throw an error only when the variable is assigned with a different type of value.
For the variable shadowing, if I go with only have a global shared var_env, the shadowing will take the Lua way by using local
keyword to declaring the variable to the local environment.
- ? how to deal with variable confliction?
- ? when there is also ffi var_env, how to manage it together?
- ? when ox module import introduce to allow workflow reuse, how to solve confliction?
Do I need scope and closure? (TBD)
For simplicity and for the moment I'd assume the workflow are small and fit for one page to review, using a shared global variable environment won't be a large issue.
there are not too many variables that may cause namespace confliction.
Anyway, it is not hard to implement, so let's see the use cases.
It seems when I need nest workflow, the closure is unavoidable? Meanwhile in the rlox practice, I still can not solve the memory leak of self-reference when implement the closure. If it can be avoid I'll not add the feature.
The scope seems unavoidable when it comes to require independent local variable environment for each statment run in parallel.
Thus I may make every seq/para block a new scope, a new workflow has its own scope and every statement in the parallel block has a new scope.
See the para block design below where I describe why the scope can still be avoid if I just clone and pass the ownership to the parallel statements.
I think in the end it is a trade off between whether clone around a var_env become too expensive.
Use just one global environment has the advantage that the runtime implementation is simple without need to using reference counter and RefCell
to manage the var_env for different scopes.
Or maybe there are much easy or proper way to implement var_env?
Control flow
a while loop evaluates to an array containing each element that the body evaluated to, like CoffeeScript?no, this is quite unusual syntax, and while is then an expression.
I support the for and while loop. The for loop is loop over a lenght fixed array thus it will always end the loop. The while loop has the risk to use to not exit the loop, this will happend when the incremental expression is missed to run. So first of all, recommend to use for loop over while loop. Then for the while loop, it require to add a max iter to avoid the infinite loop, which is an outlandish syntax therefore to introduce it nicely I'll have a resolve time checker point to how to use this syntax when it is not provided in the first place.
There is an ambiguity when combine for loop with using loop entity passing as identifier.
Becasue the for
statement yet only accept passing explicit array in the grammar.
To support it, cases between it is a explicit array and a identifier (trickyer if also want to support Attr expression) should be separately take care of.
Now, I simply don't allow to parsing following syntax:
xs = [1, 2, 3, 4];
print xs;
for x in xs {
print x + 1;
}
Type system
Since the input/output are anyway serializable basic types (or composite from basic types), the type should be easy to conduct. The type checking in the compile time by an extra traverse of syntex tree can also help to detect bugs such as accessing non-exist attrs of a identifier.
Error handling (syntax error)
The workflow definition and the workflow execution are separated. The interpreter should try its best to spot the syntax or workflow definited problem before it moved to runtime. The check is done in the syntax tree resolve phase to spot errors such as misuse of tokens and undefined variables.
Error handling (task error)
Performance
The interpreter of oxiida is as tree-walk interpreter, it uses rust as the transpiler IR. Because the performance of the oxiida language has opinionate goal compare to generic programing languages. The bottleneck of running workflows are mostly depends on the resources rather than the language itself. Therefore when talking about the performance of workflow language, the performance is the throughput per CPU. It means the more remote (also include the local process not directly run with the engine) processes a CPU which runs the engine can handle, the more performance it has. The codegen of oxiida language is to generate to rust code that can either directly run as script or passing to the running oxiida engine.
The "engine" is a fancy word that can scare future developers away because it seems hard to touch (or misused to show some crap design is fakely fancy). It also not so much clear describe how oxiida running user defined processes and workflows. The close jargon in programing language design should be the process virtual machine or runtime.
Foreign Function Call (FFC) support
Python
The oxiida
workflow language can embed into python, and run from python with calling python-side function by passing local environment.
The oxiida
is wrapped as a python library also named oxiida
, and has function run
that can get a string of workflow code and run it in the rust side runtime.
Python functions defined in the local scope will be used to store function declarations.
When there is a function call from oxiida side, it will first look into oxiida scope (the scope of the source workflow definition).
If nothing found in the oxiida scope, it will relay the call to the foreign function where the binding implemented, look up the function and call.
The function look up ad call is done by passing a message with function name and parameters.
The parameters are from ox workflow source code.
Message can be listen by an actor can dispatch the python function call one thread per each message.
The actor is either started with local_set without daemon, or start with the daemon.
Every workflow has its own actor spawned in the daemon, the reason is to make workflow has independent ffi local environment to avoid mutation of variable across workflows.
The foreign function should be declared with require
keywords before it can be called.
After declaration, it is used with function call syntax.
The design decision of using an actor to spawn blocking thread for running foreign function is because I want to avoid have any foreign language binding crate as dependencies for the core oxiida.
It sort of like an rpc pattern but the message didn't go through tcp stream with requiring a broker middleware.
The messages are communicated over rust tokio channel and errors live in the runtime I have full control thus no burden to debug.
The actually foreign function call is through the binding in this case the pyo3 to have rust code call python in a with_gil
block inside a spawn_blocking
thread.
Using a dedicated new thread to make foreign function call make it possible to shift the load of managing IO bound python functions to system scheduler through multithreading.
However python before 3.13 has the notorious global interperet lock (GIL), so for CPU bound functions even they are spawned in the different thread, when running through pyo3, it has to hold the gil.
From python 3.13, the no-gil is opt-in therefore can release the power of multithreading when executing the cpu-bound tasks.
But it is not yet the default for the python community and not lots of community maintained libraries are compiled with no-gil.
Therefore in Oxiida python binding, when the actor get message to run the python function, it can spawn a forked process.
As a result, for running N CPU-bounded when it paralleled with multiprocessing, with syntax =fnname=
in the para block, all CPUs used can reach 100% usage.
In the daemon mode, it requires a python function actor to run python functions that available from current environment. It requires two ffi, one is when python specialized daemon start it capture the current environment, the other is the closure of the script being run. The python environment for daemon is a tricky problem in AiiDA, when new package added or when the editable installed package is changed with source code, it requires to restart daemon but it is a silent rule AiiDA users have to know. In a mature system, the ergonomic way is to notify user to restart the daemon, otherwise responsible for the unexpected results when the environment has changed. The solution I had in mind is to have a file system monitor start with daemon to look at the site-packages folder of the environment. When file changed, daemon attach a warning in front of every client call ask to restart the daemon.
Julia/Lua (TDDO)
Look at python, how it should be done. Do I need to support having rust side can call all plugins?? How to make it possible to have plugin community around different language?? Can nvim an exampler?
Miscellaneous (no categorized)
How much APIs exposed by Process?
Implementing a certain type of Process
is quite sophisticate, it may result in infinite loop or putting functions in wrong mode (async v.s. sync).
Thus, the core part should provide all the implemented types processes to try to cover some generic cases, any new process type will regard as new features.
Provided process types are:
SchedulerProcess
: similar to AiiDACalcJob
, bundle the remote file operations through transport and perform scheduler (slurm first and see what requirement it has, make it plugable) operations through transport.LocalProcess
: no remote communication required, but use the resources where the service is running. The process spawn using spawn-blocking in other threads for performance, but require tense monitoring to avoid draining the resource and starving the service.KubernetesProcess
: communicate to a k8s cluster and start customized pod on demand, i.e. cloud processes.HyperequeueProcess
: or sub-scheduler using hyperqueue (rust crate, therefore can call through native APIs) as native sub-scheduler once a large abstract resource is available.
See if the categories above needed, some might be categorize under same abstraction.
Paused state
The paused state is special. It can be recovered from the middle means it hold the process infos so it knows how to advance to next state without starting from beginning. In the runtime, the process state stay there waiting for a resume signal. Since it may stay in the runtime for long time, in the paused state it should carry as less resource as possible. It requires when transition to paused state, all resouces from previous state should be well closed. In anycase, it should be gaurenteed by between state transition, no resources should be carried over. For example, if the process need SSH communication over a SSH transport through an actor in the transport pool. In pausing, this resource should be droped and recreated when resume.
The pausing is happened before the proc.advance()
which is the function call where the really operation happeneds.
Therefore, in resuming the state is go back to which before the pausing.
Resource actors pool
The workflow consist of process which act as the entity to communicate with resources. In order to communicate with remote resource, it require some way to transport the information over the wire, for example SSH protocol is one of the typical transport method. Frequetly access remote may bring unnecessary overhead on initializing procedures such as handshake or authentication. Wrost case, too frequent accessing may be regaurd as attack and banned from the resource provider.
By design, there will be only one actor represent the communication to one resource pool. Which means new communication request from process first goes to the pool to take an ready to use resource and use it. Through this approach, it avoids the frequently open and drop the resources. However, the remote resource may have timeout for the open connections and close from server side forcely. To overcome the issue, the requests can poll and keep on move to next one in the pool so after sometime the older resources fade away.
The mechanism require more design consideration and it is a key part that influence the reliability of the tool.
run process in local thread
The Process does not need to be Send + Sync
, since I made it only run in local_set
theread.
The reason is send over can be expensive when process carry a lot of data.
Meanwhile, process is Send
means the type of its inner data it carries needs to be Send
, for instance the input and output.
This make it cumbersome to having explicitly write Send
trait bound for its data.
The create state
The created state seems a bit redundant since the first running state can be the entry for launching. Maybe it is useful when need to write persistent proc info intot disk i.e. DB.
SSH communication
- when upload files to remote, construct files/folders in local sandbox. Then create same folder tree structure. Then upload files. Finally verify the tree structure.
- metioned by Alex, how fine should the API exposed to allow different users to interact with the SSH? As a final user, the more abstract the better, as a plugin developer, the explicit the better. It is an API design problem so it is hard in general. I need to keep thinking about this.
Mumbling
Relation with AiiDA
The name Oxiida means the oxidize (rusty) AiiDA. Things get oxidized things get stable. AiiDA never stable because it never ever had clear boundaries on the internal and public APIs.
Have to say, AiiDA has a lot of good design decisions and the community is one of the biggest treasure should not be ignored. Among the material simulation community it is the most unique one that try to cover and use advance technology to reach a high throughput on running scientific workflows. Using RMQ + PSQL to leverage multiple processes to interact with large amount of remote resources is a nice workaround for python GIL back to 2016 when asyncio is not yet a python's stdlib. Especially, when 6 years ago when I need a workflow to support the large database storage it is the only thing I can find in the field.
The calcjob and workchain has clear specification and clear type definition which make them well definied and can be generic. Oxiida should provide native support to run them. The class should able to be convert to syntax tree and modified into oxiida format and run through oxiida interpreter.
The plugin system (more like a dependency injection pattern in AiiDA) is good, it is the fundation of gathering a community.
Some desgins are just bad, so ignore them:
- A heavy and too much details CLI, need to replace with a TUI.
- orm is the thing I always want to avoid, it should not be an excuse for not learning SQL (core developers should know SQL, users can have another DSL i.e query builder).
- workgraph is crap :< it has no design considerations, it has piles of ai-gen code, it duplicate aiida-core engine by copy plumpy codebase (thanks for this shit that push me away from AiiDA)
- the engine part (plumpy/kiwipy) is just complex and I am not proud to understand that complicity. It is not needed if multithreading is natively supported (however, before 3.13 still not possible to opt-in with real multithreading performance, thanks python's GIL, f*ck you!).