INSDC Host Organism Sequences

DwC ファイルとしてのデータ	ダウンロード 1,426,130 レコード English で (74 MB) - 更新頻度: unknown
EML ファイルとしてのメタデータ	ダウンロード English で (6 KB)
RTF ファイルとしてのメタデータ	ダウンロード English で (7 KB)

説明

This dataset contains INSDC sequences associated with host organisms. The dataset is prepared periodically using the public ENA API (https://www.ebi.ac.uk/ena/portal/api/) using the methods described below.EMBL-EBI also publishes other records in separate datasets (https://www.gbif.org/publisher/ada9d123-ddb4-467d-8891-806ea8d94230).The data was then processed as follows:1. Human sequences were excluded.2. For non-CONTIG records, the sample accession number (when available) along with the scientific name were used to identify sequence records corresponding to the same individuals (or group of organism of the same species in the same sample). Only one record was kept for each scientific name/sample accession number.3. Contigs and whole genome shotgun (WGS) records were added individually.4. The records that were missing some information were excluded. Only records associated with a specimen voucher or records containing both a location AND a date were kept.5. The records associated with the same vouchers are aggregated together.6. A lot of records left corresponded to individual sequences or reads corresponding to the same organisms. In practise, these were "duplicate" occurrence records that weren't filtered out in STEP 2 because the sample accession sample was missing. To identify those potential duplicates, we grouped all the remaining records by `scientific_name`, `collection_date`, `location`, `country`, `identified_by`, `collected_by` and `sample_accession` (when available). Then we excluded the groups that contained more than 50 records. The rationale behind the choice of threshold is explained here: https://github.com/gbif/embl-adapter/issues/10#issuecomment-8557579787. To improve the matching of the EBI scientific name to the GBIF backbone taxonomy, we incorporated the ENA taxonomic information. The kingdom, Phylum, Class, Order, Family, and genus were obtained from the ENA taxonomy checklist available here: http://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zipMore information available here: https://github.com/gbif/embl-adapter#readmeYou can find the mapping used to format the EMBL data to Darwin Core Archive here: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md

データレコード

このオカレンス（観察データと標本) リソース内のデータは、1 つまたは複数のデータテーブルとして生物多様性データを共有するための標準化された形式であるダーウィンコアアーカイブ (DwC-A) として公開されています。コアデータテーブルには、1,426,130 レコードが含まれています。

この IPT はデータをアーカイブし、データリポジトリとして機能します。データとリソースのメタデータは、ダウンロードセクションからダウンロードできます。バージョンテーブルから公開可能な他のバージョンを閲覧でき、リソースに加えられた変更を知ることができます。

バージョン

次の表は、公にアクセス可能な公開バージョンのリソースのみ表示しています。

権利

研究者は権利に関する下記ステートメントを尊重する必要があります。:

パブリッシャーとライセンス保持者権利者は European Nucleotide Archive (EMBL-EBI)。 This work is licensed under a Creative Commons Attribution (CC-BY 4.0) License.

GBIF登録

このリソースをはGBIF と登録されており GBIF UUID: 393b8c26-e4e0-4dd0-a218-93fc074ebf4eが割り当てられています。 National Biodiversity Network によって承認されたデータパブリッシャーとして GBIF に登録されているEuropean Nucleotide Archive (EMBL-EBI) が、このリソースをパブリッシュしました。

キーワード

Metadata

連絡先

European Bioinformatics Institute (EMBL-EBI)

最初のデータ採集者 ●
連絡先

datasubs@ebi.ac.uk

http://www.ebi.ac.uk

GBIF Helpdesk

メタデータ提供者

helpdesk@gbif.org

地理的範囲

Worldwide

座標（緯度経度）	南西 [-90, -180], 北東 [90, 180]

追加のメタデータ

代替識別子	393b8c26-e4e0-4dd0-a218-93fc074ebf4e
	https://cloud.gbif.org/eca/resource?r=insdc-host-organism-sequences