The purpose of the DataFlow Project is to establish research data management services that are easy to use by a wide range of researchers, that meet the data repository needs of their research institutions, and that can be robustly deployed in a virtualized cloud environment, thus potentially offering considerable cost savings compared with deployment though local provision. These services will enable researchers to organise their personal research data more easily, more flexibly and more securely, and will facilitate the repository archiving and publication of research datasets.
The DataFlow architecture is based on a layered approach with an emphasis on simplicity, modularity and abstraction. In practice, this means web services linked via RESTful APIs . The first of these services, DataStage, is based on the local research data management infrastructure developed within the JISC ADMIRAL Project to serve Oxford researchers, a secure personalized 'local' file management environment for use at the research group level, appearing as a mapped drive on the user's PC. This allows users to save files to the cloud simply by clicking "save" in Excel or another application, provides additional Web access and DropBox integration, is configurable to have private, shared, collaborative, public and communal directories with easy-to-configure access controls, is secured by means of automated daily backup, and has the flexibility of dynamically invoking additional cloud storage as need dictates.
The second service is DataBank, an institutional-level research data repository, which will be a virtualized cloud-deployable version of the data repository bearing the same name created by Oxford's Bodleian Library. DataBank instances will expose both human- and machine-readable metadata describing their datasets, and will assign Digital Object Identifiers (DOIs) to hosted datasets, obtained automatically using the DataCite API, to aid discovery and citation. Both VMware virtualized services may be deployed locally or on a variety of cloud infrastructures, and both will be SWORD-compliant, using the SWORD 2 protocol to wrap datasets for repository submission. DataStage will use SWORD to submit valuable datasets to any compliant institutional or subject-specific repository, while DataBank will provide a SWORD-compliant ingest service for datasets from DataStage or similar SWORD-compliant clients. Both the SWORD communication protocol and the DataStage data packaging protocol can be used with any data types.
As a result, DataStage and DataBank are generic and domain-agnostic, being designed primarily to serve the 'long tail' of research datasets with file sizes ranging from 1 Mb-100 Gb (HD videos), rather than the very large datasets arising from particle physics, astronomy or high-throughput biological 4D imaging assays, for which provision is already available. Both can be deployed by research groups, departments, universities or research institutes, and can be customized with user-specific text and logos. The strength of this project is thus in its two-stage data management design, enhancing researchers' abilities to manage all their data 'locally', while providing an easy submission path for those datasets worthy of repository preservation and publication.