INSDC Host Organism Sequences

Occurrence
最新版本 published by European Nucleotide Archive (EMBL-EBI) on 4月 26, 2025 European Nucleotide Archive (EMBL-EBI)
發布日期:
2025年4月26日
授權條款:
CC-BY 4.0

下載最新版本的 Darwin Core Archive (DwC-A) 資源,或資源詮釋資料的 EML 或 RTF 文字檔。

DwC-A資料集 下載 1,426,130 紀錄 在 English 中 (74 MB) - 更新頻率: 有可能更新,但不確知何時
元數據EML檔 下載 在 English 中 (6 KB)
元數據RTF文字檔 下載 在 English 中 (7 KB)

說明

This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.

EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).

The data was then processed as follows:

1. Human sequences were excluded.

2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.

3. Contigs and whole genome shotgun (WGS) records were added individually.

4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.

5. The records associated with the same vouchers are aggregated together.

6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-855757978

7. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip

More information available here: https://github.com/gbif/embl-adapter#readme

You can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

資料紀錄

此資源出現紀錄的資料已發佈為達爾文核心集檔案(DwC-A),其以一或多組資料表構成分享生物多樣性資料的標準格式。 核心資料表包含 1,426,130 筆紀錄。

此 IPT 存放資料以提供資料儲存庫服務。資料與資源的詮釋資料可由「下載」單元下載。「版本」表格列出此資源的其它公開版本,以便利追蹤其隨時間的變更。

版本

以下的表格只顯示可公開存取資源的已發布版本。

權利

研究者應尊重以下權利聲明。:

此資料的發布者及權利單位為 European Nucleotide Archive (EMBL-EBI)。 This work is licensed under a Creative Commons Attribution (CC-BY 4.0) License.

GBIF 註冊

此資源已向GBIF註冊,並指定以下之GBIF UUID: 393b8c26-e4e0-4dd0-a218-93fc074ebf4e。  European Nucleotide Archive (EMBL-EBI) 發佈此資源,並經由National Biodiversity Network同意向GBIF註冊成為資料發佈者。

關鍵字

Metadata

聯絡資訊

European Bioinformatics Institute (EMBL-EBI)
GBIF Helpdesk
  • 元數據提供者

地理涵蓋範圍

Worldwide

界定座標範圍 緯度南界 經度西界 [-90, -180], 緯度北界 經度東界 [90, 180]

額外的詮釋資料

替代的識別碼 393b8c26-e4e0-4dd0-a218-93fc074ebf4e
https://cloud.gbif.org/eca/resource?r=insdc-host-organism-sequences