How to Downsample Your Data Efficiently

Tired of spending countless mintues downsampling your data? Look no further!

In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.

The basic configuration of the DolphinDB server is:

16 CPU cores
256 GB memory
4 SSDs

A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.

The data we use is:

the level 1 quotes on August, 2007 from New York Stock Exchange
around 272 GB, with 6.48 billion records

Downsampling can be performed with a SQL statement in DolphinDB.

As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.

The script is as follows:

db = database("dfs://TAQ")
quotes = db.loadTable("quotes")
select count(*)  from quotes where date between 2007.08.01 : 2007.08.31

model=select  top 1 symbol,date, minute(time) as minute, bid, ofr from quotes where date=2007.08.01,symbol=`EBAY
if(existsTable("dfs://TAQ", "quotes_minute_sql"))
 db.dropTable("quotes_minute_sql")
db.createPartitionedTable(model, "quotes_minute_sql", `date`symbol)

timer{
 minuteQuotes=select avg(bid) as bid, avg(ofr) as ofr from quotes where data between 2007.08.01 : 2007.08.31 group by symbol,date,bar(time, 60) as minute
 loadTable("dfs://TAQ", "quotes_minute_sql").append!(minuteQuotes)
}

select count(*)  from loadTable("dfs://TAQ", "quotes_minute")

The frequency can be adjusted as needed just by modifying bar(time, 60). Here 60 means the data is downsampled to 1-minute interval as the timestamp values have seconds precision.

The table “quotes_minute_sql“ is created with createPartitionedTable and the downsampled result can be appended to this table.

We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.

When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.

DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:

Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
Compression reduces the disk I/O;
Columnar storage and vectorized computation improve the efficiency of aggregation.

To learn detailed operations of data downsampling, take a look at this demo!