The DataFlow Project is building a two-stage cloud-deployable data management infrastructure for researchers, that can be used across national HEIs: (a) DataStage, to manage their research data locally, and (b) DataBank, to preserve and publish valuable research.
For local data management, rather than storing datasets on external hard drives in the lab, DataFlow lets researchers save their work to a DataStage file system that appears as a mapped drive on their computer, a lightweight system requiring them to install no special software on their computers. DataStage will allow specification of specific read/write permissions for Principal Investigators and individuals within a research group, to ensure appropriate levels of data confidentiality. The system will be lightweight, and will adopt best-practice standards to make sure data is secure and easy to retrieve.
They will then, through a convenient web interface, be able to deposit selected datasets from their local DataStage file management system to their institutional or subject-specific data repository, this being a cloud-based instance of the generic DataBank data repository.
DataStage will allow users to:
- Package datasets with descriptive metadata, including a DOI issued by DataCite;
- Determine who can access data: options include fully public datasets, embargoed material, and and “dark” or private data; and
- Search for and retrieve available datasets, enabling data to be reused and/or made available for peer review.
We are using a standards-based approach to ensure compatibility with as many systems as possible. For DataStage, researchers will be able to use Linux, Windows or Mac operating systems, while DataBank will be deployable on the Eduserv cloud, on a commercial data storage cloud, or on a local institutional server. The infrastructure will be open and scalable, to meet the needs of individual researchers and their institutions, for both public and confidential data.



