If you use the series_key
option in read_dataset()
, you will be able to download only a portion of the full dataflow. For large dataflows especially, this speeds up the amount of time it takes to download data from EconData. This is mainly only necessary for automated analyses that will be run again and again, especially if the dataflow is large. As the analyst, putting in the effort upfront to set up the extraction from EconData’s database, will save time down the line when you want your script to run quickly.
The simplest way to get guidance on what argument to use, is to go to the EconData web app, select the series you want in the “Select” tab, click “Submit”, then “Export”, and use the R code automatically generated by the app.
In order to address the problem somewhat of large datasets, we are now tending to split up large datasets into multiple dataflows, such as with ASISA and the QB. But, this is quite a bulky solution and does not deliver on the last mile on exactly which series you need.
series_key
Syntax
The series key is composed of a few dimensions, separated with a full-stop .
In the syntax of the
read_dataset(series_key = "...")
option, you may only include various concept codes within each dimension separated by a +
character. The function then downloads the union of all the possible combinations of the concept codes chosen. For example:
read_dataset(id = "ELECTRICITY",
tidy = TRUE, wide = FALSE, compact = FALSE,
series_key = paste0("ELE001+ELE002..S"))
You can get the data structure for this dataset by running
read_registry("data-structure", id="ELECTRICITY")
(Note the dimensions table.)
In the above example, only the two mnemonics ELE001
and ELE002
are downloaded. We simply left the measure blank, as there are not multiple varieties of the measure within each of these mnemonics. We chose to restrict the seasonal adjustment to only seasonally adjusted series, excluding the non-seasonally adjusted “Physical volume of electricity production” ELE002.I.N
.
Speed Test
Please run the following code as a speed test.
packages <- c( "econdatar",
"dplyr",
"readr",
"tidyr",
"tibble" )
invisible(lapply(packages, library, c=TRUE))
print(system.time(
ELECTRICITY_full <- read_dataset(id = "ELECTRICITY",
tidy = TRUE, wide = FALSE, compact = FALSE)
))
print(system.time(
ELECTRICITY_partial <- read_dataset(id = "ELECTRICITY",
tidy = TRUE, wide = FALSE, compact = FALSE,
series_key = paste0("ELE001+ELE002..S"))
))
print(system.time(
QB_NATLACC_full <- read_dataset(id = "QB_NATLACC",
tidy = TRUE, wide = FALSE, compact = FALSE)
))
print(system.time(
QB_NATLACC_partial <- read_dataset(id = "QB_NATLACC",
tidy = TRUE, wide = FALSE, compact = FALSE,
series_key = paste0(paste(paste0("KBP6",
# Quarterly mnemonics chosen below. Annual: J Y Z
c(paste0("006", c("C", "D", "K", "L", "N", "S" )),
paste0("007", c("C", "D", "K", "L", "N", "S" )),
paste0("009", c("C", "D", "K", "L", "N", "S" )),
paste0("010", c("C", "D", "K", "L" )),
paste0("012", c("C", "D", "K", "L", "N", "S" )),
paste0("019", c("C", "D", "K", "L" )),
paste0("045", c("C", "D", "K", "L" )),
paste0("050", c("C", "D", "K", "L" )),
paste0("055", c("C", "D", "K", "L" )),
paste0("061", c("C", "D", "K", "L" )),
paste0("109", c("C", "D", "K", "L" )),
paste0("110", c("C", "D", "K", "L" )),
paste0("114", c("C", "D", "K", "L" )),
paste0("180", "K"),
paste0("200", "K"),
paste0("203", "K"),
paste0("246", c( "K", "L", "N", "S" )),
paste0("465", "K"),
paste0("634", c( "D", "L" )),
paste0("638", c( "D", "L" ))
)), collapse="+"), ".Q..."))
))
Here are three test results (on three different computers), showing the amount of seconds the downloads took:
Tested download speed | Electricity (Full) | Electricity (Partial) | QB Nat’l Acc (Full) | QB Nat’l Acc (Partial) |
---|---|---|---|---|
70 Mbps | 1.074 | 0.449 | 6.646 | 2.239 |
32 Mbps | 2.273 | 0.571 | 6.579 | 2.453 |
26 Mbps | 6.98 | 4.13 | 13.83 | 7.91 |
From this, it is clear that selecting a subset of series to download reduces the amount of time it takes to download the data. And, this method reduces the amount of working memory (RAM) that your R session is taking up.
Compiled by Aidan Horn
##EconData ##R