Introduction
Rmind is an interpreted programming language and an application for privacy preserving statistical analysis. The syntax, semantics and API of implemented library functions aims to be similar to R.
This document describes the language and the interpreter. The API is documented separately.
Running Rmind
-
To get help run
rmind --help
. -
The path to the Sharemind client library configuration can be specified with
--conf /path/to/client.conf
. If this is not specified, Rmind will try to read it from:-
$HOME/.config/sharemind/client.conf
-
$XDG_CONFIG_DIRS/sharemind/client.conf
-
/etc/sharemind/client.conf
-
-
Running the program without arguments starts the interactive mode (REPL - read eval print loop).
-
When the program is run with a single argument that’s not recognised as a command, the argument is assumed to be a filename. The file is loaded as a script and interpreted.
-
After the server-side SecreC program
analytics_engine.sb
is installed on all servers,rmind init
should be run once. -
To delete temporary tables run
rmind init
again. This will not delete permanent tables. Do not do this while somebody is using the program. -
To delete a saved table run
rmind delete dsname tablename
where dsname is the name of the data source and tablename is the name of the table.
The directory containing the manpages of the Rmind library can be
specified with the environment variable RMIND_DOCS_PATH
. This
enables the ?procedurename
syntax in the REPL for reading the
manpages of the library procedures.
You can specify the Sharemind client library logging level with
--logLevel
and the log file with --logFile
. By default, the log
level is “fatal” and the log file will be stored in
$HOME/.rmind_client_log
. The log levels are:
-
fatal
-
error
-
warning
-
normal
-
debug
-
fulldebug
Using the REPL
The REPL supports line history and basic tab completion.
The history can be scrolled with up/down arrow keys and searched using Ctrl + R.
If you have not typed anything and press tab, a list of possible language keywords, library functions or defined variables is displayed.
The Haskeline library is used for history and line editing. It supports Emacs and Vi style keybindings. The default is Emacs style. The keybindings are listed on the Haskeline wiki. It also contains information about configuring preferences and customising keybindings.
If you enter a question mark followed by the name of a procedure, a manual page with the procedure’s section is displayed (using the UNIX man program).
The REPL can be exited using q()
.
Plotting
Plotting procedures in Rmind return a special plot object. Evaluating
this object in the REPL will display it in a window. Plots can also be
displayed with the show
procedure and saved with the save
procedure or from the toolbar in the display window. Multiple plots can
be combined into one with the multiplot
procedure.
For example, if a
and b
are private vectors, we can create two
vertically stacked histograms and save it to a file like this:
hist1 <- hist(a)
hist2 <- hist(b)
combined <- multiplot(list(hist1, hist2))
save(combined, "plot.png")
Hypothesis testing
To preserve privacy, Rmind does not declassify test statistics. To
perform a statistical test, the user specifies the significance level.
The test statistic corresponding to this significance level is
calculated and sent to the Sharemind program performing the testing
procedure as a secret value. This program calculates a secret test
statistic from the data, compares it to the received statistic and
publishes the result of the comparison which is sent back to the Rmind
user application. The convention is that procedures return TRUE
when
the null hypothesis is rejected and FALSE
otherwise.
Accessing data
To load the Sharemind database “foobar” from the data source “DS1”
into the variable table
, use:
table <- load("DS1", "foobar")
If you now enter just table
, information about the table is printed.
Note that no data is actually read, this just gives you a reference that
you can use to perform operations on the data.
Column vectors of a table can be accessed in two ways. If table
has
column foo
, then it can be indexed with table$foo
or
table["foo"]
.
names(table)
returns a list with all the column names of table
.
columns(table)
returns a list with all the column vectors of
table
.
attach(table)
creates variables for all the columns of the table.
Using this in scripts is not recommended because it will be confusing to
read a program with implicitly defined variables.
Note that the concept of a table in Rmind is not the same as a data
frame in R. A table in Rmind refers specifically to a database table
stored in Sharemind which means private vectors can not be combined into
a table without storing them as a table in the Sharemind installation.
Tables are also immutable. You can not add a column to a table unless
you store a new table using store.table
or cbind
.
Creating database tables
The store.table
procedure can be used to store a list of private
vectors as a database table. This can be used after data processing to
permanently store the results for later use. For example, if we have
private vectors x
and y
, we can store them as a table in data
source “DS1” using:
store.table("DS1", "my_table_name", list("x", "y"), list(x, y))
An alternative which can be used when the columns are known statically
is the cbind
procedure which combines a list of columns into a
table. The procedure takes pairs of column names and values as keyword
arguments. For example, we can create a table with columns “foo” and
“bar” from values x
and y
using:
cbind(foo=x, bar=y)
If we wish this table to have a name and be stored permanently, we can pass the data source and table name as keyword arguments:
cbind(result.ds="DS1", result.table="my_table_name", foo=x,
bar=y)
Aggregating database tables
Rmind has a function called aggregate
which is similar to the R
function with the same name or the group by functionality of SQL. Let’s
say we have a table t
with columns “id”, “x” and “y”. The
following example groups the rows of t
by the column “id”,
calculates the mean of “x” and sum of “y” in each group:
aggregate(t$id, list(t$x, t$y), list("avg", "sum"))
If the rows have to be grouped by multiple attributes, we can pass a list of columns as the first argument.
If the result of the aggregation should be stored as a permanent table
with a name, we can use the ds
and table
arguments:
aggregate(t$id, list(t$x, t$y), list("avg", "sum"),
ds="DS1", table="my_table_name")
The API reference lists the supported aggregation functions and other arguments.
Merging database tables
Rmind supports joining database tables with the procedure merge
. It
is modelled after merge
in R and supports inner join, left outer
join, right outer join and full outer join. The following example merges
tables t1
and t2
:
merge(t1, t2)
Since no key column has been specified, Rmind will try to use the
intersection of column names as the join key and will perform an inner
join. The join key can be specified by giving the name of the column or
a list of column names as the by
argument. The following example
joins tables t1
and t2
by the column “a”:
merge(t1, t2, by="a")
If the key column has a different name in each table, we can use
arguments by.x
and by.y
. The following example joins the tables
by the columns “a” and “b”:
merge(t1, t2, by.x="a", by.y="b")
We can require a full outer join by setting the argument all=TRUE
, a
left outer join by setting all.x=TRUE
and a right outer join by
setting all.y=TRUE
.
File I/O
A file can be opened for writing (in text mode) with the file
procedure which returns a special file handle object. Strings can be
written to this file using the write
procedure. The file can be
flushed with flush
and closed with close
. The following example
creates a text file with the contents “Hello, world!”.
fh <- file("my_text_file")
write("Hello, world!", fh)
close(fh)
As an example, you could create a convenience procedure for writing your analysis results as a CSV file:
write.csv <- function(filename, names, cols) {
ncols <- length(cols);
f <- file(filename, "w");
for (i in 1:ncols) {
write(names[[i]], f);
if (i != ncols)
write(",", f)
};
write("\n", f);
nrows <- length(cols[[1]]);
for (i in 1:nrows) {
for (j in 1:ncols) {
write(toString((cols[[j]])[[i]]), f);
if (j != ncols)
write(",", f)
};
if (i != nrows)
write("\n", f)
};
close(f)
}
cols <- list(c(1, 2, 3), c(10, 20, 30))
names <- list("foo", "bar")
write.csv("results.csv", names, cols)
Language
Introduction
As in R, there are no statements, only expressions. The language has no
scalars, only vectors. Scalars are simply one element vectors. Operators
work on vectors element-wise. When operating on vectors with different
length, the shorter vector is cycled (copies of it are concatenated so
that it would be as long as the other operand). There are four primitive
types: integers, double precision floating point numbers, booleans and
strings. Like R, Rmind is weakly typed when operating with numbers and
booleans. For example, 1 + TRUE
is two.
The important difference in regards to R is that there are also vectors
containing private data. Many arithmetic, relational and logical
operators also work on private vectors. Statistical tests, modeling and
plotting usually only work on private vectors. Sharemind programs are
written in a programming language called SecreC. SecreC has more
datatypes than R and so do private vectors in Rmind. Private vectors can
contain unsigned or signed integers with 8/16/32/64 bits, single or
double precision floating point numbers or booleans. There are also
xor_uint
unsigned integers with 8/16/32/64 bits. These are unsigned
integer types that are faster for some operations (like comparison) but
they are only used for specific tasks.
Public vectors can not have missing values but private vectors can.
The result of a computation with two private values will also be
private. When a private value is used in an operation with a public
value, the result becomes private. Like with public types, operations on
private values (or combinations of private and public values) are weakly
typed (adding a private uint16
and a public integer vector works).
The only exceptions are floating point vectors which can be converted to
integers using the round
, floor
and ceiling
procedures.
Procedures that work on private data do not modify the data but return a modified copy. This ensures that the original data is preserved.
Note that most statistical operations work on int32
or int64
vectors so in terms of performance it’s best to use one of them for
storing data because there will be no implicit data type conversions.
Vectors
Single element vectors can be created with ordinary scalar literals
(1
evaluates to an integer vector containing one).
Multiple element vectors can be created with the c
function. For
example, c(1, 2, 3, 4)
evaluates to an integer vector containing
numbers one through four. This vector can also be created using the
sequence syntax 1:4
. The c
function also concatenates vectors,
strings and lists.
Note that there are no vectors of strings.
See also the list section about how to index vectors and lists.
Lists
Lists can be created using list
. For example, list(1, 2)
evaluates to a list containing two single element integer vectors
containing one and two. The elements of a list do not have to be of the
same type. The c
function also creates lists if it’s passed
arguments with non-coercible types (for example, strings and integers).
List elements can be named. For example list(a=1, b=2)
creates a
list with values 1, 2
and corresponding names a, b
. Named list
elements can be referenced by name. For example, if x <- list(a=1)
then x$a
evaluates to 1
. The names of a list can be queried with
names(x)
.
Lists can be indexed with brackets. Indices start from 1. Unlike most
programming languages and like R, a list is returned containing the
elements with the indices between the brackets. For example, if a
is
a list, then a[1]
is a list containing the first element of a
,
a[c(1, 3, 5)]
is a list containing the first, third and fifth
element of a
and a[4:6]
is a list containing the fourth, fifth
and sixth element of a
. If the index vector consists of booleans
then a value of a
will be picked according to the boolean in the
same position in the index vector.
The same syntax can be used on private vectors. The length of private
vectors is not changed but the values that are not picked are marked as
missing. This syntax is useful for filtering data. For example,
a[b > 0]
is a private vector with elements of a
where b
is
positive. Another option for filtering (that also works on database
tables) is the subset
procedure.
Factors
Rmind supports private factor types. The CSV importer can create factors
from string columns in CSV files. You can also create factors using
factor
. For example factor(list("a", "b"))
.
To get the levels of a factor you can use levels
which returns a
list of integer codes that the factor levels have been mapped to. The
list elements are named according to the factor levels. For example, for
the previously created factor, levels
would return
list(a=1, b=2)
.
You can use the label of a factor in comparisons. If we have a factor
x
with label "foo"
then x == "foo"
is a valid expression.
Factors can be used as categorical data in hypothesis tests. freq
and freqplot
also support factors.
Matrices
Rmind has limited support for private matrices. There are currently no
public matrices. A matrix can be constructed from private vectors a
,
b
, c
with matrix(a, b, c)
. The vectors will become columns
of the matrix. The types of the vectors must match.
The binary operators supported on matrices are: addition, subtraction, multiplication, less-than, greater-than, less-than-equal, greater-than-equal, equal, not equal. The types and dimensions of the operands must match, there is no implicit casting or cycling. Like with vectors, operations are performed element-wise.
The same operators are supported when one of the operands is public.
Public matrices are represented as a list of column vectors. The number
of vectors must match the number of columns of the matrix. If a column
vector has less elements than there are rows in the matrix the vector is
cycled. The public vectors will be converted to the type of the matrix.
This means that if the matrix contains booleans and the public vectors
contain floating point numbers for example, then the floating point
numbers will be converted to booleans, so be aware. For example, to
check if all elements of a matrix m
with 10 columns are positive,
you can use m > rep(0, 10)
.
A matrix m
can be filtered using another matrix f
using the
m[f]
syntax.
The rows of boolean matrices can be reduced with the row.fold
function.
Functions
Rmind functions are first class anonymous closures. Functions can be
declared with function(argList) body
where argList
is the list
of formal arguments and body is the expression whose value is the value
of the function call.
To name a function, assign it to a variable.
Functions can have positional and keyword (optional) arguments. For
example, increment <- function(x) x + 1
defines the increment
function. If we define f <- function(x=10) x
then f()
is 10
while f(42)
is 42. There’s also a special argument “…”. In the
function body, “…” will be a list containing all arguments passed to
the function after positional and keyword arguments.
Rmind has lexical scope. Some built-in functions also use dynamic scope.
Positional arguments are evaluated in the calling scope, keyword
arguments are evaluated in function scope. This means default values of
keyword arguments can refer to each other. For example, you can write a
function f <- function(x=10, y=2 * x) x + y
where the default value
of y
is computed from the value of argument x
.
Like in R, function arguments are evaluated lazily, ie when their value is needed. This means that some arguments may never be evaluated. Thus one should be very careful not to use expressions with side effects as arguments.
Early return is supported using return(expr)
where the value of
expr
will be the value of the function call.
Blocks
To sequence multiple expressions separate them with ‘;’ and surround the sequence with curly braces. The value of the last expression will be the value of the block. For example, the increment function can be defined like this:
increment <- function(x) {
x <- x + 1;
x
}
NULL
NULL
is a special value that indicates the lack of a value. For
example, some functions can take NULL
as the value of an optional
argument.
Conditional expressions
Conditional expressions have the form
if (condExpr) trueExpr else falseExpr
. If condExpr
evaluates to
TRUE
then the value of the conditional expression is the value of
trueExpr
, otherwise it’s the value of falseExpr
. The false
branch is optional. If it’s missing and the conditional expression
evaluates to FALSE
the result is NULL
. For example,
a <- if (2 < 3) 42
assigns 42 to the variable a.
Looping
There are three looping constructs:
-
repeat expr
will evaluateexpr
infinitely. -
for (i in seq) expr
will evaluateexpr
for as many times as there are elements inseq
which must be a vector or a list. The variable i will have values of the elements of the sequence. For example,for (i in 1:10) print(i)
will print numbers one through ten. The value of the last evaluated expression is the value of the loop. -
while (expr) body
will evaluatebody
as long as the value ofexpr
isTRUE
. The value of the last evaluated expression is the value of the loop.
next
continues with the next iteration of the loop.
break
breaks out of the loop.
Model formulae
There’s a special built-in syntax for specifying models to modeling
procedures like lm
(linear regression) and glm
(generalised
linear model fitting). The operator for defining a model is ~
.
Currently we only support adding independent variables to the model.
That means the left hand side of the model must be a single variable and
the right hand side must be a sum of variables. For example,
y ~ x1 + x2
defines a model with a dependent variable y
and
explanatory variables x1
and x2
.
Dates
The CSV importer uploads dates as xor_uint32
values containing the
components of a date (year, month, day) in succession. Normal comparison
operators work on dates as well as minimum and maximum when aggregating
data. To extract components of dates, use procedures year
, month
and day
. The date
procedure can be used to create a public date
value for comparisons. The difftime
procedure computes the
difference of two dates (in days) and add.days
can be used to add or
subtract days from a date.
Private strings
Rmind has basic support for bounded length private strings. Bounded length means that each string takes up the same amount of space and the actual length of the string is private. String data can be imported with CSV importer. Only ASCII encoded strings are supported and a limited set of operations can be applied to strings:
-
Strings can be compared with the equal or not-equal operators.
-
String columns can be aggregated with the functions “first” and “last”.
-
String columns can be used as part of the aggregation key if the key is hashed using the
unique.id
procedure. -
String vectors can be filtered like numeric and boolean vectors.
-
String vectors can be sliced using
head
andtail
.
Modules
The language does not have a module system but it’s possible to split a
program into separate code files. The source
procedure takes a path
of an Rmind program as an argument and interprets the source code of the
program in the scope of the call to source
. Note that the path
passed to source
is relative to the current working directory. The
import
procedure takes a path relative to the path of the file
containing the call to import
. If import
is called in the REPL
it behaves like source
.
Customising startup
On startup, Rmind will evaluate the .rmind_profile
file in the home
directory if it exists. This file can be used to define useful functions
without having to import a module when working in the REPL.
Operators
The list of operators supported by Rmind is listed in the following table.
$ |
Used for extracting column vectors from database tables. |
---|---|
|
Exponentiation. |
|
Negation/subtraction. |
|
Sequence ( |
|
Multiplication. |
|
Division. |
|
Integer division. Does not work with private vectors. |
|
Modulo. Does no work with private vectors. |
|
Addition. |
|
Less than comparison. |
|
Less than/equal comparison. |
|
Greater than comparison. |
|
Greater than/equal comparison. |
|
Equality comparison. |
|
Inequality comparison. |
|
Boolean negation. |
|
Conjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors. |
|
Point-wise conjunction. |
|
Disjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors. |
|
Point-wise disjunction. |
|
Tilde used for model formulae (in linear regression and generalised linear model fitting). |
|
Assignment. |