Accelerating Data Analysis: Embracing the Efficient Query and Aggregation Calculations
Efficient query and calculation capabilities play a pivotal role in handling large-scale datasets, and ensuring timely and accurate data analysis. In this article, we will explore how DolphinDB’s query and advanced calculation features significantly enhance data processing and analysis workflows through practical code examples👇
In this case, we choose TSDB engine as the DolphinDB storage engine. We use a server with 16 CPU cores, 512 GB of memory, and 4 SSDs. The read speed of each disk is about 400 MB/s.
The data we use is the US Stock Exchange level 1 quotes of 4 years. The original data size is around 12.75 TB. After compression, the disk space occupied by DolphinDB is 2.12 TB.
All these data are stored in one table in DolphinDB, and there are almost 270 billion records.
login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")
First, let’s do the query.
We can get the data of Lehman Brothers at 3:59:59 PM, August 21, 2007 through the code below. It turns out that DolphinDB takes only 7.9 ms to extract one second’s data of a stock from 270 billion (12.75 TB) records.
select count(*) from quotes
timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59
The query result is assigned to the variable x, and the content of x can be shown in the data browser.
Next, let’s calculate the average bid-ask spread of Lehman Brothers in each minute on Aug 31, 2007.
The bid-ask spread is defined as the ask price minus the bid price, and then divided by the average ask price and the bid price, and we also add some filtering conditions to ensure the validity of the data.
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
Run the script, and we can see that execution takes 89 milliseconds.
The calculation result is plotted in this graph.
Then we’ll increase the data volume, remove the stock code limitation, and calculate the bid-and-ask price difference per minute.
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
The data of all stocks in the whole market for a whole day is about 10 GB. Run the script, and it takes 2.1 seconds.
For a well-designed system, the time it takes to finish a query should be linearly related to the amount of days the query consumes.
So this time, we performs the same calculation on 3 days.
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
It takes 5.47 seconds, about 3 times as long as the previous query, which means that the time consumption is directly proportional to the amount of data.
In a word, performance of DolphinDB is pretty good for both querying a small amount of data and aggregating calculations with large amounts of data.
The complete script is displayed as follows 👇👀
login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")
select count(*) from quotes
timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")