[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

qiuchenjian-2
GitHub user manishgupta88 opened a pull request:

    https://github.com/apache/carbondata/pull/2210

    [CARBONDATA-2381] Improve compaction performance by filling batch result in columnar format and performing IO at blocklet level

    Problem: Compaction performance is slow as compared to data load
    Analysis:
    1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
    2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time increases in this model.
   
    Solution:
    1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
    2. Perform IO at blocklet level and uncompression at page level.
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishgupta88/carbondata compaction_slow_fix_1.4

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2210
   
----
commit 857e5d2a75eaf16457023f5a04a2e420bb8cbff2
Author: manishgupta88 <tomanishgupta18@...>
Date:   2018-04-23T07:06:39Z

    Problem: Compaction performance is slow as compared to data load
   
    Analysis:
    1. During compaction result filling is done in row format. Due to this as the number of columns increases the dimension and measure data filling time
    increases. This happens because in row filling we are not able to take advantage of OS cacheable buffers as we continuously read data for next column.
    2. As compaction uses a page level reader flow wherein both IO and uncompression is done at page level, the IO and uncompression time
    increases in this model.
   
    Solution:
    1. Implement a columnar format filling data structure for compaction process for filling dimension and measure data.
    2. Perform IO at blocklet level and uncompression at page level.

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4145/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4150/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5329/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4176/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4491/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5348/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r183822226
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
   
    Instead of changing constructor, can we add statistics model to block execution info??


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r183822969
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/stats/QueryStatisticsConstants.java ---
    @@ -58,6 +58,28 @@
     
       String PAGE_SCANNED = "The number of page scanned";
     
    +  /**
    +   * measure filling time includes time taken for reading all measures data from a given offset
    +   * and adding each column data to an array. Includes total time for 1 query result iterator.
    +   */
    +  String MEASURE_FILLING_TIME = "measure filling time";
    +
    +  /**
    +   * dimension filling time includes time taken for reading all dimensions data from a given offset
    +   * and filling each column data to byte array. Includes total time for 1 query result iterator.
    +   */
    +  String DIMENSION_FILLING_TIME = "dimension filling time";
    --- End diff --
   
    Change this to key column filling time


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    I think we need these changes in case of DictionaryBasedResultCollector also, as when number of column is more than 100 it goes to dictionarybasedresult collector, it will improve the query performance when number of projection columns is more than 100


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184326083
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/stats/QueryStatisticsConstants.java ---
    @@ -58,6 +58,28 @@
     
       String PAGE_SCANNED = "The number of page scanned";
     
    +  /**
    +   * measure filling time includes time taken for reading all measures data from a given offset
    +   * and adding each column data to an array. Includes total time for 1 query result iterator.
    +   */
    +  String MEASURE_FILLING_TIME = "measure filling time";
    +
    +  /**
    +   * dimension filling time includes time taken for reading all dimensions data from a given offset
    +   * and filling each column data to byte array. Includes total time for 1 query result iterator.
    +   */
    +  String DIMENSION_FILLING_TIME = "dimension filling time";
    --- End diff --
   
    ok


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [WIP] [CARBONDATA-2381] Improve compaction pe...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184327009
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
   
    I think it is better to pass in constructor instead of making the blockExecutionInfo object more heavier. Also because the QueryStatisticsModel is at the reader level and one reader can have multiple blocks so in my opinion keeping in blockExecutionInfo will not be a good idea.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [WIP] [CARBONDATA-2381] Improve compaction performan...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    @kumarvishal09 ...I agree with you that we need these changes in DictionaryBasedResultCollector also to improve the query performance when number of columns are greater than 100.
    As this PR is specifically for compaction performance fix, I will raise a separate PR for this. I have raised a sub jira task under same jira to track the issue.
    https://issues.apache.org/jira/browse/CARBONDATA-2405



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5420/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4252/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4558/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184449793
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
   
    For one task we are creating only one QueryStatisticsModel even one task is handling multiple block (in case of merge small files) . Passing in constructor or keeping in blockexecutioninfo is same cost(same reference). Just to hold the reference we are changing multiple classes constructor. BlockExecutionInfo is also just a holder. If we add in abstract class constructor In future every concrete class needs to maintain querystatisticsmodel even if it doesn't use


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4265/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2210: [CARBONDATA-2381] Improve compaction performance by ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2210
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5432/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2210: [CARBONDATA-2381] Improve compaction performa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2210#discussion_r184610515
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/scan/collector/ResultCollectorFactory.java ---
    @@ -45,31 +46,37 @@
        * @return
        */
       public static AbstractScannedResultCollector getScannedResultCollector(
    -      BlockExecutionInfo blockExecutionInfo) {
    +      BlockExecutionInfo blockExecutionInfo, QueryStatisticsModel queryStatisticsModel) {
    --- End diff --
   
    ok


---
12