Untitled :: Sharemind Developer Zone

Introduction

Rmind is an interpreted programming language and an application for privacy preserving statistical analysis. The syntax, semantics and API of implemented library functions aims to be similar to R.

This document describes the language and the interpreter. The API is documented separately.

Running Rmind

To get help run rmind --help.
The path to the Sharemind client library configuration can be specified with --conf /path/to/client.conf. If this is not specified, Rmind will try to read it from:
- $HOME/.config/sharemind/client.conf
- $XDG_CONFIG_DIRS/sharemind/client.conf
- /etc/sharemind/client.conf
Running the program without arguments starts the interactive mode (REPL - read eval print loop).
When the program is run with a single argument that’s not recognised as a command, the argument is assumed to be a filename. The file is loaded as a script and interpreted.
After the server-side SecreC program analytics_engine.sb is installed on all servers, rmind init should be run once.
To delete temporary tables run rmind init again. This will not delete permanent tables. Do not do this while somebody is using the program.
To delete a saved table run rmind delete dsname tablename where dsname is the name of the data source and tablename is the name of the table.

The directory containing the manpages of the Rmind library can be specified with the environment variable RMIND_DOCS_PATH. This enables the ?procedurename syntax in the REPL for reading the manpages of the library procedures.

You can specify the Sharemind client library logging level with --logLevel and the log file with --logFile. By default, the log level is “fatal” and the log file will be stored in $HOME/.rmind_client_log. The log levels are:

fatal
error
warning
normal
debug
fulldebug

Using the REPL

The REPL supports line history and basic tab completion.

The history can be scrolled with up/down arrow keys and searched using Ctrl + R.

If you have not typed anything and press tab, a list of possible language keywords, library functions or defined variables is displayed.

The Haskeline library is used for history and line editing. It supports Emacs and Vi style keybindings. The default is Emacs style. The keybindings are listed on the Haskeline wiki. It also contains information about configuring preferences and customising keybindings.

If you enter a question mark followed by the name of a procedure, a manual page with the procedure’s section is displayed (using the UNIX man program).

The REPL can be exited using q().

Plotting

Plotting procedures in Rmind return a special plot object. Evaluating this object in the REPL will display it in a window. Plots can also be displayed with the show procedure and saved with the save procedure or from the toolbar in the display window. Multiple plots can be combined into one with the multiplot procedure.

For example, if a and b are private vectors, we can create two vertically stacked histograms and save it to a file like this:

hist1 <- hist(a)
hist2 <- hist(b)
combined <- multiplot(list(hist1, hist2))
save(combined, "plot.png")

Hypothesis testing

To preserve privacy, Rmind does not declassify test statistics. To perform a statistical test, the user specifies the significance level. The test statistic corresponding to this significance level is calculated and sent to the Sharemind program performing the testing procedure as a secret value. This program calculates a secret test statistic from the data, compares it to the received statistic and publishes the result of the comparison which is sent back to the Rmind user application. The convention is that procedures return TRUE when the null hypothesis is rejected and FALSE otherwise.

Accessing data

To load the Sharemind database “foobar” from the data source “DS1” into the variable table, use:

table <- load("DS1", "foobar")

If you now enter just table, information about the table is printed. Note that no data is actually read, this just gives you a reference that you can use to perform operations on the data.

Column vectors of a table can be accessed in two ways. If table has column foo, then it can be indexed with table$foo or table["foo"].

names(table) returns a list with all the column names of table.

columns(table) returns a list with all the column vectors of table.

attach(table) creates variables for all the columns of the table. Using this in scripts is not recommended because it will be confusing to read a program with implicitly defined variables.

Note that the concept of a table in Rmind is not the same as a data frame in R. A table in Rmind refers specifically to a database table stored in Sharemind which means private vectors can not be combined into a table without storing them as a table in the Sharemind installation. Tables are also immutable. You can not add a column to a table unless you store a new table using store.table or cbind.

Creating database tables

The store.table procedure can be used to store a list of private vectors as a database table. This can be used after data processing to permanently store the results for later use. For example, if we have private vectors x and y, we can store them as a table in data source “DS1” using:

store.table("DS1", "my_table_name", list("x", "y"), list(x, y))

An alternative which can be used when the columns are known statically is the cbind procedure which combines a list of columns into a table. The procedure takes pairs of column names and values as keyword arguments. For example, we can create a table with columns “foo” and “bar” from values x and y using:

cbind(foo=x, bar=y)

If we wish this table to have a name and be stored permanently, we can pass the data source and table name as keyword arguments:

cbind(result.ds="DS1", result.table="my_table_name", foo=x,
bar=y)

Aggregating database tables

Rmind has a function called aggregate which is similar to the R function with the same name or the group by functionality of SQL. Let’s say we have a table t with columns “id”, “x” and “y”. The following example groups the rows of t by the column “id”, calculates the mean of “x” and sum of “y” in each group:

aggregate(t$id, list(t$x, t$y), list("avg", "sum"))

If the rows have to be grouped by multiple attributes, we can pass a list of columns as the first argument.

If the result of the aggregation should be stored as a permanent table with a name, we can use the ds and table arguments:

aggregate(t$id, list(t$x, t$y), list("avg", "sum"),
          ds="DS1", table="my_table_name")

The API reference lists the supported aggregation functions and other arguments.

Merging database tables

Rmind supports joining database tables with the procedure merge. It is modelled after merge in R and supports inner join, left outer join, right outer join and full outer join. The following example merges tables t1 and t2:

merge(t1, t2)

Since no key column has been specified, Rmind will try to use the intersection of column names as the join key and will perform an inner join. The join key can be specified by giving the name of the column or a list of column names as the by argument. The following example joins tables t1 and t2 by the column “a”:

merge(t1, t2, by="a")

If the key column has a different name in each table, we can use arguments by.x and by.y. The following example joins the tables by the columns “a” and “b”:

merge(t1, t2, by.x="a", by.y="b")

We can require a full outer join by setting the argument all=TRUE, a left outer join by setting all.x=TRUE and a right outer join by setting all.y=TRUE.

File I/O

A file can be opened for writing (in text mode) with the file procedure which returns a special file handle object. Strings can be written to this file using the write procedure. The file can be flushed with flush and closed with close. The following example creates a text file with the contents “Hello, world!”.

fh <- file("my_text_file")
write("Hello, world!", fh)
close(fh)

As an example, you could create a convenience procedure for writing your analysis results as a CSV file:

write.csv <- function(filename, names, cols) {
    ncols <- length(cols);
    f <- file(filename, "w");
    for (i in 1:ncols) {
        write(names[[i]], f);
        if (i != ncols)
            write(",", f)
    };
    write("\n", f);

    nrows <- length(cols[[1]]);
    for (i in 1:nrows) {
        for (j in 1:ncols) {
            write(toString((cols[[j]])[[i]]), f);
            if (j != ncols)
                write(",", f)
        };
        if (i != nrows)
            write("\n", f)
    };
    close(f)
}

cols <- list(c(1, 2, 3), c(10, 20, 30))
names <- list("foo", "bar")
write.csv("results.csv", names, cols)

Language

Introduction

As in R, there are no statements, only expressions. The language has no scalars, only vectors. Scalars are simply one element vectors. Operators work on vectors element-wise. When operating on vectors with different length, the shorter vector is cycled (copies of it are concatenated so that it would be as long as the other operand). There are four primitive types: integers, double precision floating point numbers, booleans and strings. Like R, Rmind is weakly typed when operating with numbers and booleans. For example, 1 + TRUE is two.

The important difference in regards to R is that there are also vectors containing private data. Many arithmetic, relational and logical operators also work on private vectors. Statistical tests, modeling and plotting usually only work on private vectors. Sharemind programs are written in a programming language called SecreC. SecreC has more datatypes than R and so do private vectors in Rmind. Private vectors can contain unsigned or signed integers with 8/16/32/64 bits, single or double precision floating point numbers or booleans. There are also xor_uint unsigned integers with 8/16/32/64 bits. These are unsigned integer types that are faster for some operations (like comparison) but they are only used for specific tasks.

Public vectors can not have missing values but private vectors can.

The result of a computation with two private values will also be private. When a private value is used in an operation with a public value, the result becomes private. Like with public types, operations on private values (or combinations of private and public values) are weakly typed (adding a private uint16 and a public integer vector works). The only exceptions are floating point vectors which can be converted to integers using the round, floor and ceiling procedures.

Procedures that work on private data do not modify the data but return a modified copy. This ensures that the original data is preserved.

Note that most statistical operations work on int32 or int64 vectors so in terms of performance it’s best to use one of them for storing data because there will be no implicit data type conversions.

Vectors

Single element vectors can be created with ordinary scalar literals (1 evaluates to an integer vector containing one).

Multiple element vectors can be created with the c function. For example, c(1, 2, 3, 4) evaluates to an integer vector containing numbers one through four. This vector can also be created using the sequence syntax 1:4. The c function also concatenates vectors, strings and lists.

Note that there are no vectors of strings.

See also the list section about how to index vectors and lists.

Lists

Lists can be created using list. For example, list(1, 2) evaluates to a list containing two single element integer vectors containing one and two. The elements of a list do not have to be of the same type. The c function also creates lists if it’s passed arguments with non-coercible types (for example, strings and integers).

List elements can be named. For example list(a=1, b=2) creates a list with values 1, 2 and corresponding names a, b. Named list elements can be referenced by name. For example, if x <- list(a=1) then x$a evaluates to 1. The names of a list can be queried with names(x).

Lists can be indexed with brackets. Indices start from 1. Unlike most programming languages and like R, a list is returned containing the elements with the indices between the brackets. For example, if a is a list, then a[1] is a list containing the first element of a, a[c(1, 3, 5)] is a list containing the first, third and fifth element of a and a[4:6] is a list containing the fourth, fifth and sixth element of a. If the index vector consists of booleans then a value of a will be picked according to the boolean in the same position in the index vector.

The same syntax can be used on private vectors. The length of private vectors is not changed but the values that are not picked are marked as missing. This syntax is useful for filtering data. For example, a[b > 0] is a private vector with elements of a where b is positive. Another option for filtering (that also works on database tables) is the subset procedure.

Factors

Rmind supports private factor types. The CSV importer can create factors from string columns in CSV files. You can also create factors using factor. For example factor(list("a", "b")).

To get the levels of a factor you can use levels which returns a list of integer codes that the factor levels have been mapped to. The list elements are named according to the factor levels. For example, for the previously created factor, levels would return list(a=1, b=2).

You can use the label of a factor in comparisons. If we have a factor x with label "foo" then x == "foo" is a valid expression.

Factors can be used as categorical data in hypothesis tests. freq and freqplot also support factors.

Matrices

Rmind has limited support for private matrices. There are currently no public matrices. A matrix can be constructed from private vectors a, b, c with matrix(a, b, c). The vectors will become columns of the matrix. The types of the vectors must match.

The binary operators supported on matrices are: addition, subtraction, multiplication, less-than, greater-than, less-than-equal, greater-than-equal, equal, not equal. The types and dimensions of the operands must match, there is no implicit casting or cycling. Like with vectors, operations are performed element-wise.

The same operators are supported when one of the operands is public. Public matrices are represented as a list of column vectors. The number of vectors must match the number of columns of the matrix. If a column vector has less elements than there are rows in the matrix the vector is cycled. The public vectors will be converted to the type of the matrix. This means that if the matrix contains booleans and the public vectors contain floating point numbers for example, then the floating point numbers will be converted to booleans, so be aware. For example, to check if all elements of a matrix m with 10 columns are positive, you can use m > rep(0, 10).

A matrix m can be filtered using another matrix f using the m[f] syntax.

The rows of boolean matrices can be reduced with the row.fold function.

Functions

Rmind functions are first class anonymous closures. Functions can be declared with function(argList) body where argList is the list of formal arguments and body is the expression whose value is the value of the function call.

To name a function, assign it to a variable.

Functions can have positional and keyword (optional) arguments. For example, increment <- function(x) x + 1 defines the increment function. If we define f <- function(x=10) x then f() is 10 while f(42) is 42. There’s also a special argument “…”. In the function body, “…” will be a list containing all arguments passed to the function after positional and keyword arguments.

Rmind has lexical scope. Some built-in functions also use dynamic scope. Positional arguments are evaluated in the calling scope, keyword arguments are evaluated in function scope. This means default values of keyword arguments can refer to each other. For example, you can write a function f <- function(x=10, y=2 * x) x + y where the default value of y is computed from the value of argument x.

Like in R, function arguments are evaluated lazily, ie when their value is needed. This means that some arguments may never be evaluated. Thus one should be very careful not to use expressions with side effects as arguments.

Early return is supported using return(expr) where the value of expr will be the value of the function call.

Blocks

To sequence multiple expressions separate them with ‘;’ and surround the sequence with curly braces. The value of the last expression will be the value of the block. For example, the increment function can be defined like this:

increment <- function(x) {
    x <- x + 1;
    x
}

NULL

NULL is a special value that indicates the lack of a value. For example, some functions can take NULL as the value of an optional argument.

Conditional expressions

Conditional expressions have the form if (condExpr) trueExpr else falseExpr. If condExpr evaluates to TRUE then the value of the conditional expression is the value of trueExpr, otherwise it’s the value of falseExpr. The false branch is optional. If it’s missing and the conditional expression evaluates to FALSE the result is NULL. For example, a <- if (2 < 3) 42 assigns 42 to the variable a.

Looping

There are three looping constructs:

repeat expr will evaluate expr infinitely.
for (i in seq) expr will evaluate expr for as many times as there are elements in seq which must be a vector or a list. The variable i will have values of the elements of the sequence. For example, for (i in 1:10) print(i) will print numbers one through ten. The value of the last evaluated expression is the value of the loop.
while (expr) body will evaluate body as long as the value of expr is TRUE. The value of the last evaluated expression is the value of the loop.

next continues with the next iteration of the loop.

break breaks out of the loop.

Model formulae

There’s a special built-in syntax for specifying models to modeling procedures like lm (linear regression) and glm (generalised linear model fitting). The operator for defining a model is ~. Currently we only support adding independent variables to the model. That means the left hand side of the model must be a single variable and the right hand side must be a sum of variables. For example, y ~ x1 + x2 defines a model with a dependent variable y and explanatory variables x1 and x2.

Dates

The CSV importer uploads dates as xor_uint32 values containing the components of a date (year, month, day) in succession. Normal comparison operators work on dates as well as minimum and maximum when aggregating data. To extract components of dates, use procedures year, month and day. The date procedure can be used to create a public date value for comparisons. The difftime procedure computes the difference of two dates (in days) and add.days can be used to add or subtract days from a date.

Private strings

Rmind has basic support for bounded length private strings. Bounded length means that each string takes up the same amount of space and the actual length of the string is private. String data can be imported with CSV importer. Only ASCII encoded strings are supported and a limited set of operations can be applied to strings:

Strings can be compared with the equal or not-equal operators.
String columns can be aggregated with the functions “first” and “last”.
String columns can be used as part of the aggregation key if the key is hashed using the unique.id procedure.
String vectors can be filtered like numeric and boolean vectors.
String vectors can be sliced using head and tail.

Modules

The language does not have a module system but it’s possible to split a program into separate code files. The source procedure takes a path of an Rmind program as an argument and interprets the source code of the program in the scope of the call to source. Note that the path passed to source is relative to the current working directory. The import procedure takes a path relative to the path of the file containing the call to import. If import is called in the REPL it behaves like source.

Customising startup

On startup, Rmind will evaluate the .rmind_profile file in the home directory if it exists. This file can be used to define useful functions without having to import a module when working in the REPL.

Operators

The list of operators supported by Rmind is listed in the following table.

$ Used for extracting column vectors from database tables.

`$`	Used for extracting column vectors from database tables.
`^`	Exponentiation.
`-`	Negation/subtraction.
`:`	Sequence (`a:b` creates a vector from a to b).
`*`	Multiplication.
`/`	Division.
`%/%`	Integer division. Does not work with private vectors.
`%%`	Modulo. Does no work with private vectors.
`+`	Addition.
`<`	Less than comparison.
`<=`	Less than/equal comparison.
`>`	Greater than comparison.
`>=`	Greater than/equal comparison.
`==`	Equality comparison.
`!=`	Inequality comparison.
`!`	Boolean negation.
`&&`	Conjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors.
`&`	Point-wise conjunction.
`\|\|`	Disjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors.
`\|`	Point-wise disjunction.
`~`	Tilde used for model formulae (in linear regression and generalised linear model fitting).
`<-`	Assignment.

^

Exponentiation.

-

Negation/subtraction.

:

Sequence (a:b creates a vector from a to b).

*

Multiplication.

/

Division.

%/%

Integer division. Does not work with private vectors.

%%

Modulo. Does no work with private vectors.

+

Addition.

<

Less than comparison.

<=

Less than/equal comparison.

>

Greater than comparison.

>=

Greater than/equal comparison.

==

Equality comparison.

!=

Inequality comparison.

!

Boolean negation.

&&

Conjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors.

&

Point-wise conjunction.

||

Disjunction. This uses the first element of each operator. For point-wise conjunction use “&”. Does not work with private vectors.

|

Point-wise disjunction.

~

Tilde used for model formulae (in linear regression and generalised linear model fitting).

<-

Assignment.