•   over 4 years ago

Grouping by date with BigSql

Hi,

We're working with MTA transit data, which is grouped in four hour chunks. I want to write a query that will group the data by day, along with by a particular station. In SQL I would write something like this:

SELECT * FROM data WHERE day > '2014-01-01 00:00:00' AND day < '2014-01-02 00:00:00' AND station='xyz' GROUP BY day, station;

This query is very slow (60 seconds or more). This article specifically mentions difficulties working with dates in Big SQL:

http://www.ibm.com/developerworks/library/bd-gobigsql/

Specifically Listings 7 and 8. However, I cannot figure out how to apply the insights in this article to my query. Can anyone help me out?

Thanks,
Kevin

  • 1 comment

  • Manager   •   over 4 years ago

    Hi Kevin Scott,

    Thanks for your question! I've passed it along to the IBM team and they had a few follow-up questions for you.

    1) Is this Big SQL v1 or Big SQL 3? The article that is mentioned in your question is relevant to Big SQL v1. Also, the query you presented is not doing date manipulation functions like in the article. Lastly, the query as shown below is not really valid SQL since you are grouping with no aggregate on the grouping keys. Can you provide the actual query that you are running?

    SELECT * FROM data WHERE day > '2014-01-01 00:00:00' AND day < '2014-01-02 00:00:00' AND station='xyz' GROUP BY day, station;

    2) How much data are you scanning and how big is your cluster? Unless the data is partitioned by station and/or date, you will always be scanning all data, and if there is a large amount of data, it may just be the nature of the beast that the query takes time. How long does the query take if you change it to:

    SELECT day, station, count(*) FROM data WHERE day > '2014-01-01 00:00:00' AND day < '2014-01-02 00:00:00' AND station='xyz' GROUP BY day, station;

    Thank you!
    Serena

Comments are closed.