INSDC Host Organism Sequences

Occurrence
Versão mais recente published by European Nucleotide Archive (EMBL-EBI) on abr 26, 2025 European Nucleotide Archive (EMBL-EBI)
Publication date:
26 de abril de 2025
Licença:
CC-BY 4.0

Baixe a última versão do recurso de dados, como um Darwin Core Archive (DwC-A) ou recurso de metadados, como EML ou RTF:

Dados como um arquivo DwC-A download 1.426.130 registros em English (74 MB) - Frequência de atualização: desconhecido
Metadados como um arquivo EML download em English (6 KB)
Metadados como um arquivo RTF download em English (7 KB)

Descrição

This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.

EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

The data was then processed as follows:

1. Human sequences were excluded.

2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

3. Contigs and whole genome shotgun (WGS) records were added individually.

4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

5. The records associated with the same vouchers are aggregated together.

6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

More information available here: https://github.com/gbif/embl-adapter#readme

You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

Registros de Dados

Os dados deste recurso de ocorrência foram publicados como um Darwin Core Archive (DwC-A), que é o formato padronizado para compartilhamento de dados de biodiversidade como um conjunto de uma ou mais tabelas de dados. A tabela de dados do núcleo contém 1.426.130 registros.

This IPT archives the data and thus serves as the data repository. The data and resource metadata are available for download in the downloads section. The versions table lists other versions of the resource that have been made publicly available and allows tracking changes made to the resource over time.

Versões

A tabela abaixo mostra apenas versões de recursos que são publicamente acessíveis.

Direitos

Pesquisadores devem respeitar a seguinte declaração de direitos:

O editor e o detentor dos direitos deste trabalho é European Nucleotide Archive (EMBL-EBI). This work is licensed under a Creative Commons Attribution (CC-BY 4.0) License.

GBIF Registration

Este recurso foi registrado no GBIF e atribuído ao seguinte GBIF UUID: 393b8c26-e4e0-4dd0-a218-93fc074ebf4e.  European Nucleotide Archive (EMBL-EBI) publica este recurso, e está registrado no GBIF como um publicador de dados aprovado por National Biodiversity Network.

Palavras-chave

Metadata

Contatos

European Bioinformatics Institute (EMBL-EBI)
  • Originador
  • Ponto De Contato
GBIF Helpdesk
  • Provedor Dos Metadados

Cobertura Geográfica

Worldwide

Coordenadas delimitadoras Sul Oeste [-90, -180], Norte Leste [90, 180]

Metadados Adicionais

Identificadores alternativos 393b8c26-e4e0-4dd0-a218-93fc074ebf4e
https://cloud.gbif.org/eca/resource?r=insdc-host-organism-sequences