DRT: New toolbox for large scale EEG
sep 24th 2022 blog
New toolbox for large scale EEG Data Structure
Li Dong, Yufan Zhang, Lingling Zhao, Ting Zheng, Weidong Wang, Jianfu Li, Diankun Gong, Tiejun Liu, Dezhong Yao. DRT: A new toolbox for the Standard EEG Data Structure in large-scale EEG applications. SoftwareX, Volume 17(2022) 100933.
a b s t r a c t
The current evolution of ‘‘open neuroscience’’ has led to an increased amount of research on large-scale electroencephalography (EEG) applications, resulting in large quantities of accumulated EEG data. The batch sharing and processing of these massive EEG data play an important role in EEG studies within or across laboratories and result in an increasing requirement for a standard data file structure for existing EEG data. In this work, a new and more flexible data structure, named the Standard EEG Data Structure (SEDS), was proposed to meet the needs of both small-scale EEG data batch processing in single-site studies and large-scale EEG data sharing and analysis in single-/multisite studies (especially on cloud platforms). Furthermore, two versions (MATLAB and Docker versions) of the EEG Datafile Restructuring Toolbox (DRT) were developed to restructure EEG data files according to the SEDS. The DRT GUI (MATLAB version) dramatically reduces the time required for novice researchers, while the DRT (Dockerversion) is more efficient for experienced researchers. All materials including SEDS documents, tools, example datasets, etc., are available on the WeBrain website (https://webrain.uestc.edu.cn/) and Wiki(https://github.com/WeCloudHub/DRT). We hope that these two user-friendly toolboxes can make the relatively novel SEDS easier to collaboratively study, especially for applications in large-scale EEGstudies.
(The rights and content of the publication are Under a Creative Commons license).
"1. Motivation and significance Since the human scalp electroencephalogram (EEG) was first reported by Berger in 1929 [1], the scalp EEG has been a common technique for noninvasively detecting brain activity, with a high temporal resolution and low cost. The scientific report statistics shows that there is an increasing amount of interest in EEG due to its irreplaceable value in brain science. The recommendations for open science and the rise of cloud neuroscience [2], [3] have further led to more efforts with large-scale EEG applications in the EEG community. In addition, there has been the further accumulation of a number of public and local large EEG datasets, such as the TUH-EEG Corpus dataset with more than 30000 clinical EEGs ( https://www.isip.piconepress.com/projects/tuh_eeg/index.shtml) [4] and the “EEG Motor Movement/Imagery Dataset” (https://archive.physionet.org/pn4/eegmmidb/) with 1500 EEGs [5], [6]. Sharing and processing these massive EEG datasets play important roles in large-scale brain imaging collaborative studies across laboratories and further increase the requirement for a standard data file structure of existing large-scale EEG data. In addition, in the neuroscience field, there is a growing concern about data replication and reproducibility [7], i.e., whether the original data and analysis results can be well replicated by others [8]. However, many potential issues including various data file structures, confusing file organization and a lack of interpretation of the raw, intermediate and final data may largely decrease the efficiency of data sharing and analysis, as well as data reproducibility. Therefore, a tool of standard EEG data file structure for large-scale EEG sharing and processing (especially using cloud high-performance computing (HPC) facilities) is required, which is critical for establishing broader collaborative large-scale EEG studies across laboratories.
Due to the versatile advantages of EEG, including light weight, high compatibility with environments and systems, low cost, wearability, being wireless, etc., the field of EEG applications is broad. Thus, there are far more EEG equipment manufacturers than manufacturers of other noninvasive imaging equipment, and these EEG equipment manufacturers are building different hardware systems with different software and data formats. For example, the EEG data recorded by the Neuroscan EEG system may generate a set of “.dat, *.dap, and *.rs3” files using Curry 7, while the data recorded by the Brain Product EEG system may generate a set of “.vhdr, .vmrk, *.dat” files using the Brain Vision Analyzer. Furthermore, because commercial or free EEG tools always have own proprietary data formats for processing data, the file formats of the intermediate and final EEG data generated by different tools are still different (e.g., EEGLAB [9] can save data in the “.SET” format, and Curry 7 can save data in the “*.CNT” format). Such diversity of EEG data files perhaps is an impediment to the reuse of data, as well as building large-scale EEG databases for sharing across laboratories. To address the abovementioned issue of EEG data heterogeneity, some efforts have been made in the neuroscience community in recent years. Teeters et al. [10] proposed a common neurophysiology data format based on the HDF5 (http://www.hdfgroup.org/HDF5); this format uses “HDF5 groups” for the directories and “HDF5 datasets” corresponding to the files to store arbitrary array-type data. However, this data standardization is mainly used for the data of cellular electrophysiology and optical imaging experiments and not scalp EEGs, and a graphical utility named HDFView must be used to browse files while using HDF5. Bigdely-Shamlo et al. developed a “containerized” approach, named the EEG Study Schema (ESS) [11], to organize EEG data and metadata using a standardized file structure and metadata encapsulation schema. The limitation of this approach is the unmet need for easy manual or semiautomated usage, while the users have to manually program MATLAB scripts to import the metadata from semistructured formats, which increases the requirements on the programming skills for novice users. Recently, as an extension to the Brain Imaging Data Structure (BIDS), BIDS-EEG [12] has been proposed for readily organizing and sharing raw EEG data within and between EEG laboratories. The basic definition of BIDS-EEG assumes that each subject has a directory of raw data containing subdirectories for each modality and session. A number of EEG tools and public datasets are supported or organized using this standard. However, BIDS-EEG does have some limitations including the need to support more data formats (4 kinds of formats are supported currently), compatibility for both small-scale EEG batch processing in single-site and large-scale EEG sharing and processing across multisites (especially on cloud platforms), as well as relatively low flexibility for “pure” EEGs (e.g., for separated EEG data without other modalities).
In this work, a new and more flexible data structure, named the Standard EEG Data Structure (SEDS), was proposed to meet both the needs of small-scale EEG data batch processing in single-site studies and large-scale EEG data sharing and analysis in single-/multisite studies (especially on cloud platforms). The structure may increase the reproducibility and extensibility of EEG data. Furthermore, two versions of the EEG Datafile Restructuring Toolbox (DRT) were developed to restructure EEG data files according to the SEDS. In addition, because the quality assessment (QA) of the raw EEG data is an important issue for the reproducibility of EEG results, a QA module was integrated into the DRT as an optional function when converting data to the SEDS."