http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Optimize-Order-By-Limit-Query-tp9764p9827.html
and provide only limit values to spark. But the same job spark is also
doing it just sorts the partition and gets the top values out of it. You
> Hi Ravindran,
> Thanks for your quick response. please see my answer as below
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> What if the order by column is not the first column? It needs to scan all
> blocklets to get the data out of it if the order by column is not first
> column of mdk
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Answer : if step2 doesn't filter any blocklet, you are right,It needs to
> scan all blocklets to get the data out of it if the order by column is not
> first column of mdk
> but it just scan all the order by column's data, for
> others columns data, use the lazy-load strategy and it can reduce scan
> accordingly to limit value.
> Hence you can see the performance is much better now
> after my optimization. Currently the carbondata order by + limit
> performance is very bad since it scans all data.
> in my test there are 20,000,000 data, it takes more than
> 10s, if data is much more huge, I think it is hard for user to stand such
> bad performance when they do order by + limit query?
>
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> We used to have multiple push down optimizations from spark to carbon
> like aggregation, limit, topn etc. But later it was removed because it is
> very hard to maintain for version to version. I feel it is better that
> execution engine like spark can do these type of operations.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Answer : In my opinion, I don't think "hard to maintain for version to
> version" is a good reason to give up the order by + limit optimization.
> I think it can create new class to extends current and try to reduce the
> impact for the current code. Maybe can make it is easy to maintain.
> Maybe I am wrong.
>
>
>
>
> At 2017-03-29 02:21:58, "Ravindra Pesala" <
[hidden email]> wrote:
>
>
> Hi Jarck Ma,
>
> It is great to try optimizing Carbondata.
> I think this solution comes up with many limitations. What if the order by
> column is not the first column? It needs to scan all blocklets to get the
> data out of it if the order by column is not first column of mdk.
>
> We used to have multiple push down optimizations from spark to carbon like
> aggregation, limit, topn etc. But later it was removed because it is very
> hard to maintain for version to version. I feel it is better that execution
> engine like spark can do these type of operations.
>
>
> Regards,
> Ravindra.
>
>
>
> On Tue, Mar 28, 2017, 14:28 马云 <
[hidden email]> wrote:
>
>
> Hi Carbon Dev,
>
> Currently I have done optimization for ordering by 1 dimension.
>
> my local performance test as below. Please give your suggestion.
>
>
>
>
> | data count | test sql | limit value in sql | performance(ms) |
> | optimized code | original code |
> | 20,000,000 | SELECT name, serialname, country, salary, id, date FROM t3
> ORDER BY country limit 1000 | 1000 | 677 | 10906 |
> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 10000 | 10000 | 1897 | 12108 |
> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
> serialname limit 50000 | 50000 | 2814 | 14279 |
>
> my optimization solution for order by 1 dimension + limit as below
>
> mainly filter some unnecessary blocklets and leverage the dimension's
> order stored feature to get sorted data in each partition.
>
> at last use the TakeOrderedAndProject to merge sorted data from partitions
>
> step1. change logical plan and push down the order by and limit
> information to carbon scan
>
> and change sort physical plan to TakeOrderedAndProject since
> data will be get and sorted in each partition
>
> step2. in each partition apply the limit number, blocklet's min_max index
> to filter blocklet.
>
> it can reduce scan data if some blocklets were filtered
>
> for example, SELECT name, serialname, country, salary, id, date
> FROM t3 ORDER BY serialname limit 10000
>
> supposing there are 2 blocklets , each has 32000 data, serial name is
> between serialname1 to serialname2 in the first blocklet
>
> and between serialname2 to serialname3 in the second blocklet. Actually
> we only need to scan the first blocklet
>
> since 32000 > 100 and first blocklet's serial name <= second blocklet's
> serial name
>
>
>
> step3. load the order by dimension data to scanResult. put all
> scanResults to a TreeSet for sorting
>
> Other columns' data will be lazy-loaded in step4.
>
> step4. according to the limit value, use a iterator to get the topN sorted
> data from the TreeSet. In the same time to load other columns data if
> needed.
>
> in this step it tries to reduce scanning non-sort dimension
> data.
>
> for example, SELECT name, serialname, country, salary, id, date
> FROM t3 ORDER BY serialname limit 10000
>
> supposing there are 3 blocklets , in the first 2 blocklets, serial name
> is between serialname1 to serialname100 and each has 2500 serialname1 and
> serialname2.
>
> In the third blocklet, serial name is between serialname2 to
> serialnam100, but no serialname1 in it.
>
> load serial name data for the 3 blocklets and put all to a treeset sorting
> by the min serialname.
>
> apparently use iterator to get the top 10000 sorted data, it only need to
> care the first 2 blocklets(5000 serialname1 + 5000 serialname2).
>
> In others words, it loads serial name data for the 3 blocklets.But only
> "load name, country, salary, id, date"'s data for the first 2 blocklets
>
>
>
> step5. TakeOrderedAndProject physical plan will be used to merge sorted
> data from partitions
>
>
>
> the below items also can be optimized in future
>
>
>
> • leverage mdk keys' order feature to optimize the SQL who order by
> prefix dimension columns of MDK
>
> • use the dimension order feature in blocklet lever and dimensions'
> inverted index to optimize SQL who order by multi-dimensions
>
>
>
>
>
>
>
>
>
>
>
> Jarck Ma
>
>
>
>
>
>
>
>
>