CSV Importer
To use CSV Importer to secret share input data, you need two files:
-
Database export (in CSV format).
-
Data model (in XML format), describing columns to import and their data types.
You also need to know which column separator your data files uses. CSV
Importer supports the following file separators: tabulator (t
), space
(s
), semicolon (sc
) and comma (c
).
The following is a simple example of a data model description XML file. You only have to list columns that you want to secret share in this file.
<table name="payment-history" dataSource="DS1"
handler="import-script.sb">
<column key="true" type="primitive">
<source name="amount" type="int64"/>
<target name="amount" type="int64"/>
</column>
<column key="false" type="primitive" ignore="true">
<source name="nr_of_employees" type="uint32"/>
<target name="nr_of_employees" type="uint32"/>
</column>
<column key="true" type="primitive">
<source name="date_diff" type="uint32"/>
<target name="date_diff" type="uint32"/>
</column>
<column key="true" type="primitive">
<source name="negative_event" type="uint32"/>
<target name="negative_event" type="uint32"/>
</column>
</table>
A more detailed data model example with different data types, date
formats and transformations is available in
/usr/share/doc/sharemind-csv-importer/examples/model-example.xml
.
Using CSV Importer
With data (payment-history.csv
), its model (payment-history.xml
) and
column separator known, CSV Importer can be invoked by a single command:
$ sharemind-csv-importer --conf client.conf --mode overwrite \
--csv payment-history.csv --model payment-history.xml \
--separator c --log payment-history.log
Before secret sharing and distributing the data, CSV Importer verifies if the CSV file matches the given data model. Upon finding a discrepancy, it either gives an error or a warning stating that it could not read a particular value (Couldn’t parse … on line … column …) and leaves it empty in the imported database. In this case, go over the console output and log in order to verify and resolve these issues before continuing.
In addition, CSV Importer automatically creates necessary classifiers described in the data model. For example:
Classifier mapping of column `gender': female - 1 male - 2
This classifier mapping should be shared with analysts using Rmind or other query interfaces, as the numerical classifier values have to be used in analysis scripts.
Finally, CSV Importer asks whether to continue with secret sharing and uploading the input data. Keep in mind, that secret sharing triples the data volume, so uploading it to computation nodes may take some time.