Rmd source

Dane

Eurostat database explained

Source data is available in https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing/

There three directories:

data dict metadata

Approximately 6300 datasets and tables in two formats tsv/sdmx in most files rows are time series first row contains header

Tables: two/three-dimensional ‘tables by themes’. Most important data. secondary data

Database: multidimensional tables ‘database by theme’

Tables/Database organized into (multinomial) taxonomy with 9 top nodes:

General and regional; Economy and finance; Population and social conditions; Industry, trade and services; Agriculture, forestry and fisheries; International trade; Transport; Environment and energy; Science technology, digital society

First cell: coma-separated sequence of codes (data dimensions used for identification) followed by / and seqcode

For each of codes there is a file in the dic directory Seqcode denotes the dimension of the sequence of values (time for time-series, geo for geographicsl series)

Complete list of code files (dic files) can be found in dim1st.dic

Time dimension: YYYmNN, where 00 yearly, SS semi-annual, QQ – quarterly, MM – onthly

Columns 1st line (except 1st): coma-separated sequence of codes

Flags are attached to values separated by a blank (space) If there is no flag the value is followed by a blank (in effect the value cell is composed de facto of two entities: value and flag (usually empty).)

Decimal symbols used: dot

Toc files: https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.pdf

Contains table/database id (called code), title, link to data in tsv/zip format

Example (tourism)

Reference metadata including Euro-SDMX metadata structure https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing/metadata https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/

Table of contents in three languages and three formats (xml/txt/pdf)

Programmatical download is possible with JSON/SDMX directly or through specialized packages (ie R)

+Regional statistics by NUTS classification (reg)
+-+ Regional demographic statistics (reg_dem)
  +-+ Population and area (reg_dempoar)

Population on 1 January by age, sex and NUTS 2 region (demo_r_d2jan) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_d2jan&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_d2jan.tsv.gz

Area by NUTS 3 region (demo_r_d3area) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_d3area&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_d3area.tsv.gz

Population density by NUTS 3 region (demo_r_d3dens) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_d3dens&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_d3dens.tsv.gz

Population on 1 January by age group, sex and NUTS 2 region (demo_r_pjangroup) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_pjangroup&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_pjangroup.tsv.gz

Population on 1 January by age group, sex and NUTS 3 region (demo_r_pjangrp3) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_pjanind3&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_pjanind3.tsv.gz

+Regional statistics by NUTS classification (reg)
+-+ Regional demographic statistics (reg_dem)
  +-+ Fertility (reg_demfer)

Fertility rates by age and NUTS 2 region[demo_r_frate2] Last update: 28-02-2020 https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=demo_r_frate2&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/demo_r_frate2.tsv.gz

+Regional statistics by NUTS classification (reg)
+-+ Regional tourism statistics (reg_tour)
  +-+ Occupancy in collective accommodation establishments: domestic and inbound tourism (reg_tour_occ)

Nights spent at tourist accommodation establishments by NUTS 2 regions[tour_occ_nin2] Last update: 24-02-2020 https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=tour_occ_nin2&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tour_occ_nin2.tsv.gz

+Regional statistics by NUTS classification (reg)
+-+ Regional labour market statistics (reg_lmk)
  +-+ Regional unemployment - LFS annual series (lfst_r_lfu)

Unemployment by sex, age and NUTS 2 regions (1 000) (lfst_r_lfu3pers) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=lfst_r_lfu3pers&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/lfst_r_lfu3pers.tsv.gz

+Environment and energy 
+-+ Environment (env)   
  +-+ Emissions of greenhouse gases and air pollutants (env_air)    
    +-+ Air emissions accounts (env_air_aa) 

Air emissions accounts by NACE Rev. 2 activity (env_ac_ainah_r2) https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_ac_ainah_r2&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/env_ac_ainah_r2.tsv.gz

Greenhouse gas emissions by source sector (source: EEA) (env_air_gge)
https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_air_gge&lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/env_air_gge.tsv.gz

+-+ Energy (t_nrg)  
  +-+Sustainable Development indicators Goal 7 - Affordable and clean energy (t_nrg_sdg_07) 

Primary energy consumption (sdg_07_10)
https://ec.europa.eu/eurostat/databrowser/view/sdg_07_10/default/table?lang=en https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sdg_07_10.tsv.gz

BDL and BDM databases

BDL is an abbreviation from bank danych lokalnych (local data bank)

https://bdl.stat.gov.pl/BDL/start

BDM is bank danych makroekonomicznych (macroeconomic data bank)

https://bdm.stat.gov.pl/

Platforma Analityczna SWAiD – Dziedzinowe Bazy Wiedzy

https://stat.gov.pl/banki-i-bazy-danych/platforma-analityczna-swaid-dziedzinowe-bazy-wiedzy/

WHO database

Punkty startu: https://www.who.int/data/gho/indicator-metadata-registry

https://www.who.int/data/gho/data/themes

API https://www.odata.org/documentation/odata-version-2-0/uri-conventions/

Baza jest nieciekwa. Teoretycznie dużo wskaźników rzadko który kompletny i/lub aktualny. Dodatkowo infrastruktura udostępniająca dane jest nieadekwatna do zadania, czas oczekiwania na realizację zapytań długi wyniki niekompletne to norma.

Dobrze jest oddzielić dane od prezentacji BTW. Prezentacja są zasobożerne a udostępnianie danych nie. Zostawmy oglądanie obrazków pasjonatom ale dane udostępniajmy profesjonalnie a nie jakoś…

DB.nomics metadatabase

https://db.nomics.world/

Kaggle

https://towardsdatascience.com/increasing-kaggle-revenue-analyzing-user-data-to-recommend-the-best-new-product-f93fddbb4e0f https://www.quora.com/How-does-Kaggle-make-money

Two primary revenue streams: hosting competitions and our jobs board

We have three types of competitions:

Featured competitions: companies use these to raise their profile in the data science community, get exposure to new approaches or get difficult problems solved.

Recruiting competitions: companies use these to attract and identify talent that they may not have had exposure to otherwise.

Research competitions: researchers use it as an alternative to collaborating with a data scientist. Most useful for well specified but very challenging problems.

Narzędzia

Arkusz kalkulacyjny

https://pl.libreoffice.org/

google spreadheet

Pakiet statystyczny

Różne: STATA, SPSS, SAS

github