Skip to content

Bundle

URI: https://w3id.org/fga-wg/schema/bundle

Name: Bundle

Classes

Class Description
AccessMethod Description of an access method (i.e. communication protocol) that can be used to fetch a File object (orig: DrsObject). Exact copy of the AccessMethod object of the GA4GH DRS data model (https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/AccessMethodModel)
AccessURL The URL and associated HTTP headers to access the File object (orig: DrsObject). Exact copy of AccessURL object of the GA4GH DRS data model (https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/AccessURLModel).
Analysis Represents the computational processing applied to data from a sequencing experiment, or from another analysis. This can be described at the level of individual analysis steps in a workflow/pipeline, or more generally for the workflow/pipeline as a whole.
Any The Any allows the range of a slot to be any object (see https://linkml.io/linkml/schemas/advanced.html#linkml-any-type).
AssessmentValue Key-value pair representing a specific value produced by a quality assessment.
Bundle A bundle representing a set of genome annotation files, organised in sub-collections. Metadata has been harmonised in line with the "FAIRification of Genomic Annotations" data model.
BundleMetadata Top-level metadata about a bundle representing a set of genome annotation files, harmonised according to the "FAIRification of Genomic Annotations" data model. This includes self-referential identifiers and versioning of public deposits of the harmonized metadata.
Checksum A checksum of a File object (orig: DrsObject). Exact copy of the Checksum object of the GA4GH DRS data model (https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/ChecksumModel).
Contact Contact information for a person or an organisation.
Deposit Information about a public deposit of a document containing metadata about a set of genome annotation files.
Donor Information about the donor or complete organism from which the sample was taken.
Experiment Represents a sequencing experiment that has been carried out within a study, based on biological samples, and providing data files as output. Subsequent analysis of output data is described by the Analysis entity.
File General information about a particular data file. Most fields (marked with an asterix*) are copied from the GA4GH DRS DrsObject model (https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/DrsObjectModel), which is the top-level object returned from a DRS server in response to a successful lookup call (i.e. https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/Objects).
        GenomicAnnotationFile Information about a genomic annotation / track file. GenomicAnnotationFile is a specification of the File entity and inherits all the fields defined in File, in addition to the fields that are specific to GenomicAnnotationFile, as detailed here.
FileCollection A collection of files, according to some selection criteria. In the context of the "FAIRification of Genomic Annotations" data model, we are mainly interested in "GenomicAnnotationFile" entities, but other types of files can also be contained in a collection, e.g. raw data files such as FASTQ files.
GenomeAssembly Information about of the exact genome assembly used to generate the annotation file, defining the genomic coordinate system for the sequence features.
InputSource General object representing the source of data files, samples, or other entities used as input to a process or a result. An input source refering to a single file or sample object will represent that item only, while an input source referring to a container or process may represent a number of disctinct input items. InputSource also contains information about the type of relationship, replication labelling, versioning and retrieval date.
OntologyVersions Information about an ontology used for the bundle.
QualityAssessment Represents the results of a quality assessment that has been carried out on a data file resulting from an experiment or analysis.
Sample Information about a biospecimen/sample used as raw material for lab experiments.
Study A scientific study, i.e. a unit of research, within which experiments and/or analyses have been carried out.
Term Helper entity to represent an ontology term as a data value.
TrackGeometry Overall geometric properties of the sequence features in the genomic annotation file if considered as an one-dimensional genome browser track, in line with the track type delineations from Gundersen et. al, 2011. While conceptually based on visual characteristics, these properties are also useful to e.g. select relevant annotation files for non-visual analyses.

Slots

Slot Description
access_method Access method used to access the File object (orig: DrsObject).
access_methods The list of access methods that can be used to fetch the data file.
access_url AccessURL object providing URL and associated HTTP headers to access the File object (orig: DrsObject).
accessions Database accession numbers for the genome assembly, if available. Should precisely identify the genome assembly and be omitted if changes have been made to the assembly after retrieval, such as removing the alternate sequences.
aliases Human-readable aliases of the genome assembly. Can be imprecise, as preciseness is enforced in the other fields.
analyses Information about computational processing and analyses that have been carried out to generate the files.
analysis_description Human-readable description of the analysis.
analysis_external_id External, globally unique identifier for the experiment.
analysis_id Internal identifier for the experiment (unique within the metadata deposit).
analysis_input_sources External or internal references to sources for the input data analyzed. Internal references should lead to FileCollection, File, Experiment, or Analysis objects.
analysis_label A human-readable description of the analysis, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
analysis_main_tool Main software tool used for the analysis.
analysis_main_tool_version Version of the main software tool used for the analysis.
analysis_protocol Document describing the analysis protocol that was followed.
analysis_study_ref Internal reference to the study within which the analysis has been carried out.
analysis_type The type of analysis carried out.
analysis_workflow External reference to the analysis workflow, with availability in at least one machine-operable form (e.g. CWL, Nextflow, ...).
antibody_target The target of the antibody used in the experiment.
assay_type Sequencing technique intended for this library.
assessment_details_url URL to a report containing the detailed output from the quality assessment.
assessment_method Quality assessment method that has been carried out (e.g. BUSCO, OMArk, peak calling statistics, etc.)
assessment_values Main values produced by the quality assessment.
biological_processes Biological processes illuminated by the experiment.
biological_replicate_labels Labels denoting the biological replicates within which the relation is defined, if any.
biospecimen_classification Main type of structural unit to be used for classification of the biospecimen/sample.
bundle_deposit Information about the public deposit of the bundle.
bundle_description Human-readable description of the bundle.
bundle_input_sources References to other input sources from which this entire bundle was derived, or possibly including DOIs of other bundles used as source.
bundle_label A human-readable description of the bundle, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
bundle_metadata Top-level metadata about the bundle of genomic annotation files.
bundle_ontology_versions Map from the version-agnostic URL to a versioned URL (e.g. "versionIRI" in owl) of each ontology used in the current metadata deposit (corresponding to deposit_versioned_id").
cell_line Cultured cell line used in the biospecimen/sample.
cell_type Cell type of isolated normal cells in the biospecimen/sample.
checksum The hex-string encoded checksum for the data.
checksum_type The digest method used to create the checksum. The value (e.g. sha-256) SHOULD be listed as Hash Name String in the https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg [IANA Named Information Hash Algorithm Registry]. Other values MAY be used, as long as implementors are aware of the issues discussed in https://tools.ietf.org/html/rfc6920#section-9.4 [RFC6920]. GA4GH may provide more explicit guidance for use of non-IANA-registered algorithms in the future. Until then, if implementors do choose such an algorithm (e.g. because it's implemented by their storage provider), they SHOULD use an existing standard type value such as md5, etag, crc32c, trunc512, or sha1.
checksums A list of checksums of the data file. At least one checksum must be provided. For blobs, the checksum is computed over the bytes in the blob.
contact_id Globally unique identifier for a person (e.g. ORCID ID) or organisation (e.g. BioProject accession).
created_time Timestamp of content creation in RFC3339. (This is the creation time of the underlying content, not of the JSON object.).
data_content Classification describing the file's purpose or contents.
database_accessions Accession numbers for database records used as input source. Used in connection with "inputsource_external_ref".
date_of_retrieval Date of retrieval from the input source, typically used to timestamp downloading data from a database or URL.
deposit_first_created The date and time of the creation of the first deposited version of the metadata document.
deposit_id A globally unique and persistent identifier for the public deposit of the metadata document. A DOI or other persistent identifier is recommended.
deposit_last_changed The date and time of the last deposited change of the current metadata document (corresponding to "deposit_versioned_id").
deposit_versioned_id A globally unique, persistent and versioned identifier for the public deposit of the metadata document. A versioned DOI to a deposited document is recommended.
deposit_versioned_ref Reference to versioned id of deposit containing this file collection.
design_description The high-level experiment design including layout, protocol.
donor_age Age of the donor/organism at the time of sampling
donor_clinical_information Clinical information of the donor/organism at the time of sampling.
donor_development_stage Development stage of the donor at the time of sampling.
donor_external_id External, globally unique identifier for the donor/organism.
donor_id Internal identifier for the donor/organism (unique within the metadata deposit).
donor_organism_ref Internal reference to the donor/organism from which the biospecimen/sample was taken.
donors Information about the donors or complete organisms from which the samples were taken.
drs_uri A drs:// hostname-based URI, as defined in the DRS documentation, that tells clients how to access this object. The intent of this field is to make DRS objects self-contained, and therefore easier for clients to store and pass around. For example, if you arrive at this DRS JSON by resolving a compact identifier-based DRS URI, the self_uri presents you with a hostname and properly encoded DRS ID for use in subsequent access endpoint calls.
edge_weight_type The type of values associated with the edges.
edges_are_directed Whether the edges linking sequence features are directed (at least one edge between sequence features is defined with a direction).
edges_denote_parents Whether the edges linking sequence features denote a parent-child relationship (all edges between sequence features denote parent-child relationships such as genes to exons, i.e. where the child is fully covered by the parent).
edges_have_weights Whether the edges linking sequence features are weighted (at least one edge between sequence features has an associated weight).
elements_circular Whether the sequence features have circular coordinates (at least one feature that cross a sequence border).
elements_overlapping Whether the sequence features are overlapping (at least one base pair is simultaneously covered by two sequence features).
email E-mail address of the person or organisation.
experiment_external_id External, globally unique identifier for the experiment.
experiment_id Internal identifier for the experiment (unique within the metadata deposit).
experiment_label A human-readable description of the experiment, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
experiment_samples External or internal references to samples used in the experiment. Internal references should refer to Sample objects.
experiment_study_ref Internal reference to the study within which the experiment has been carried out.
experiments Information about sequencing experiments that have been carried out to generate the files.
file_collections Information about collections of files contained in this dataset, each collection defined according to some selection criteria.
file_description A human readable description of the data file.
file_external_id External, globally unique identifier for the data file.
file_id Internal identifier for the data file (unique within the metadata deposit).
file_input_sources External or internal references to data sources for the file, typically a data collection or a process that has generated the file. Internal references should lead to FileCollection, File, Experiment, or Analysis objects.
file_label A human-readable description of the data file, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
file_name A string that can be used to name a data file. This string is made up of uppercase and lowercase letters, decimal digits, hypen, period, and underscore [A-Za-z0-9.-_]. See http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282 [portable filenames].
file_size The file size in bytes.
file_type The file format of the data file.
file_version A string representing a version. (Some systems may use checksum, a RFC3339 timestamp, or an incrementing version number.).
filecollection_contact Contact point to the creator and/or maintainer of the file collection.
filecollection_description Human-readable description of the file collection.
filecollection_external_id External, globally unique identifier for the file collection (in most cases, this will not exist).
filecollection_id Internal identifier for the file collection (unique within the metadata deposit).
filecollection_input_sources References to other input sources from which this file collection was derived.
filecollection_label A human-readable description of the file collection, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
filecollection_refs Internal references to the FileCollection objects (within the deposit) that contains the data file, if any.
files Information about particular genome annotation (and other relevant) files.
genome_assembly Information about the genome assembly used to generate the genomic annotation file, consequently defining the genomic coordinate system for the annotation.
genomic_annotation_digest Content-derived digest for distributed identification of genomic annotation files. (This field is currently a placeholder, as an algorithm for generating such a digest is yet to be specified.).
has_edges Whether the sequence features are linked across positions (at least one edge between features exists).
has_gaps Whether there are gaps between the sequence features (there exists at least one gap between two features on the same sequence).
has_lengths Whether the sequence features have lengths (at least one feature spans more than 1 base pair).
has_names Whether the sequence features are named (at least one feature has a name).
has_strands Whether the sequence features are stranded (at least one feature has strand information).
has_values Whether the sequence features have associated values (at least one feature has an associated value).
headers An optional list of headers to include in the HTTP request to url. These headers can be used to provide auth tokens required to fetch the object bytes.
id External, globally unique identifier for the ontology term (in CURIE form).
inputsource_external_ref Reference to an external entity as the input source, using a globally unique identifier or an URL. External references will in most cases refer to a database, data record, data file, website or other data source. One of "inputsource_external_ref" or "inputsource_ref" must be specified.
inputsource_ref Reference to an internal object as the input source using a local identifier. Entities to be used as an internal input source includes FileCollection, Sample, Experiment, Analysis or File as restricted by the description of the field where the input source is used. One of "inputsource_external_ref" or "inputsource_ref" must be specified.
instrument Technology platform used to perform nucleic acid sequencing, including name and/or number associated with a specific sequencing instrument model. It is recommended to be as specific as possible for this property (e.g. if the model/revision are available, providing that instead of just the instrument maker).
key Key/name of the assessment value.
label Human-readable label associated to the term id in the current version of the ontology (as listed in the "ontology_versions" field of the Deposit object).
lengths_constant Whether the sequence lengths are constant (all sequence features have the same length, excluding features at the very end of a sequence).
library_layout Whether the library was built as paired-end, or single-end.
mime_type A string providing the mime-type of the data file.
molecule_type Specifies the type of source material that is being sequenced.
name Name of the person or organisation.
namespace The CURIE namespace (prefix) an ontology (e.g. "GO" for Gene Ontology).
ontology_url The version-agnostic URL of the ontology (e.g. the IRI of the ontology in OWL).
organism_tissue Part of organism (typically tissue or organ) from which the biospecimen/sample was taken, or cell line was derived from.
other_biospecimen Other biospecimen-related terms that can be used to further classify the biospecimen/sample.
phenotype Main phenotype (e.g. disease) connected to the biospecimen/sample.
project_external_ref Reference to a project within which the study was carried out (preferably a BioProject CURIE).
project_name Name of the project within which the study was carried out.
publications List of (relevant) publications containing the results of the study (in the form of DOI CURIEs).
qualified_relation A description of the relationship with the input source.
quality_assessments An array of QualityAssessment objects containing the main quality scores from assessment techniques applied to the data file.
region Name of the region in the cloud service provider that the object belongs to.
run_provenance Document detailing the provenance of the experiment or analysis run which produced the file as one of its outputs. The provenance info should include software versions, parameter settings, etc.
sample_collection_date Date of sample collection.
sample_collection_location Geographical location where the sample was collected.
sample_description Human-readable description of the biospecimen/sample and the sampling process.
sample_external_id External, globally unique identifier for the biospecimen/sample.
sample_id Internal identifier for the biospecimen/sample (unique within the metadata deposit).
sample_label A human-readable description of the sample, short enough to be used for listings within software user interfaces, tables, illustration legends, etc.
samples Information about the biospecimens/samples used as raw material for lab experiments.
sampling_protocol Protocol detailing the collection and treatment of the biospecimen/sample.
seqcol_digest Top-level sequence collection digest according to the GA4GH refget, Sequence Collections standard (v1.0). This a globally unique identifier for the genome assembly, algorithmically derivable from the genome assembly content. Usage is to uniquely identify the exact genome assembly used and allow detailed comparisons across genome assembly variants (say, variants of the GRCh38 assembly).
seqcol_ordered_coord_system Content-derived digest that uniquely identifies the ordered coordinate system of the genome assembly. (Coordinate systems with the same sequence names and lengths, but where the sequences are ordered differently, will have different ordered digests.). Usage is the ordered coordinate system digest can be used to uniquely generate a chromSizes file, useful in a number of analysis tools. Definition is the ordered coordinate system digest is defined as the level 1 digest of the name_length_pairs attribute of the sequence collection generated from the genome assembly.
seqcol_unordered_coord_system Content-derived digest that uniquely identifies the order-invariant coordinate system of the genome assembly. This digest will be shared across all coordinate systems with the same sequence names and lenghts, regardless of the order of the sequences. Usage is the order-invariant coordinate system digest can be used to uniquely describe the coordinate system of a particular genome browser instance and the annotation files that are compatible with it. Definition is the order-invariant coordinate system digest is defined as the level 1 digest of the sorted_name_length_pairs attribute of the sequence collection generated from the genome assembly.
sequence_features List of sequence features described by the genomic annotation file.
sequencing_protocol Set of rules which guides how the sequencing protocol was followed. Change-tracking services such as Protocol.io or GitHub are encouraged instead of dumping free text in this field.
sex Biological sex of the donor/organism.
species_taxon Taxonomical classification of the species of the donor/organism.
studies The scientific studies, i.e. units of research, within which experiments and/or analyses have been carried out.
study_abstract Abstract of the study.
study_contact Contact point for the study.
study_external_id External, globally unique identifier for the study (preferably a BioStudies CURIE).
study_id Internal identifier for the study (unique within the metadata deposit). Namespace: "study".
study_title Title of the study.
technical_replicate_labels Labels denoting the technical replicates within which the relation is defined, if any.
track_geometry Geometric properties of the sequence features in the genomic annotation file if considered as an one-dimensional genome browser track (also relevant for non-visual analyses).
updated_time Timestamp of content update in RFC3339, identical to created_time in systems that do not support updates. (This is the update time of the underlying content, not of the JSON object.).
url A fully resolvable URL that can be used to fetch the actual object bytes.
value Value corresponding to the assessment key.
value_type The type of values associated with the sequence features, if any.
version Version information for the retrieval from the input source.
versioned_ontology_url The versioned URL of the ontology (e.g. the "versionIRI" in OWL).

Enumerations

Enumeration Description
AccessMethods Access methods (i.e. communication protocols), following the vocabulary defined in the GA4GH DRS specification (https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.4.0/docs/#tag/AccessMethodModel/operation/getAccessMethod).
BiospecimenClassification Vocabulary from the ENCODE model describing the general category of boispecimen (see "Properties->classification" in https://www.encodeproject.org/profiles/biosample_type).
DataTypes Types of data values associated with sequence features or edges.
OutputType Vocabulary from the ENCODE model describing the purpose or content of a file (see "Properties->output_type" in https://www.encodeproject.org/profiles/file).

Types

Type Description
Boolean A binary (true or false) value
Curie a compact URI
Date a date (year, month and day) in an idealized calendar
DateOrDatetime Either a date or a datetime
Datetime The combination of a date and time
Decimal A real number with arbitrary precision that conforms to the xsd:decimal specification
Double A real number that conforms to the xsd:double specification
Float A real number that conforms to the xsd:float specification
Integer An integer
Jsonpath A string encoding a JSON Path. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded in tree form.
Jsonpointer A string encoding a JSON Pointer. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to a valid object within the current instance document when encoded in tree form.
Ncname Prefix part of CURIE
Nodeidentifier A URI, CURIE or BNODE that represents a node in a model.
Objectidentifier A URI or CURIE that represents an object in the model.
Sparqlpath A string encoding a SPARQL Property Path. The value of the string MUST conform to SPARQL syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded as RDF.
String A character string
Time A time object represents a (local) time of day, independent of any particular day
Uri a complete URI
Uriorcurie a URI or a CURIE

Subsets

Subset Description