[GitHub] carbondata pull request #1559: [CARBONDATA-1805][Dictionary] Optimize prunin...

classic Classic list List threaded Threaded
68 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1559: [CARBONDATA-1805][Dictionary] Optimize prunin...

qiuchenjian-2
GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1559

    [CARBONDATA-1805][Dictionary] Optimize pruning for dictionary loading

    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [X] Any interfaces changed?
          `NO`
     - [X] Any backward compatibility impacted?
          `NO`
     - [X] Document update required?
          `NO`
     - [X] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            `NO TESTS ADDED, PERFORMANCE ENHANCEMENT DIDN'T AFFECT THE FUNCTIONALITY`
            - How it is tested? Please attach test report.
            `TESTED IN CLUSTER WITH REAL DATA`
            - Is it a performance related change? Please attach the performance test report.
            `PERFORMANCE ENHANCED, DICTIONARY TIME REDUCED FROM 2.9MIN TO 29SEC`
            - Any additional information to help reviewers in testing this change.
            `NO`
     - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
            `NOT RELATED`
   
    COPY FROM JIRA
    ===
   
    # SCENARIO
   
    Recently I have tried dictionary feature in Carbondata and found its dictionary generating phase in data loading is quite slow. My scenario is as below:
   
    + Input Data: 35.8GB CSV file with 199 columns and 126 Million lines
   
    + Dictionary columns: 3 columns each containing 19213,4,9 distinct values
   
    The whole data loading consumes about 2.9min for dictionary generating and 4.6min for fact data loading -- about 39% of the time are spent on dictionary.
   
    Having observed the nmon result, Ifound the CPU usage were quite high during the dictionary generating phase and the Disk, Network were quite normal.
   
    # ANALYZE
   
    After I went through the dictionary generating related code, I found Carbondata aleady prune non-dictionary columns before generating dictionary. But the problem is that `the pruning comes after data file reading`, this will cause some overhead, we can optimize it by `prune while reading data file`.
   
    # RESOLVE
   
    Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning the non-dictionary columns while reading the data file.
   
    After implementing the above optimization, the dictionary generating costs only `29s` -- **`about 6 times better than before`**(2.9min), and the fact data loading costs the same as before(4.6min), about 10% of the time are spent on dictionary.
   
    # NOTE
   
    + Currently only `load data file` will benefit from this optimization, while `load data frame` will not.
   
    + Before implementing this solution, I tried another solution -- cache dataframe of the data file, the performance was even worse -- the dictionary generating time was 5.6min.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata opt_dict_load

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1559.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1559
   
----
commit e8e49ed54085700eadde81842af0b0daecaed12a
Author: xuchuanyin <[hidden email]>
Date:   2017-11-24T03:27:02Z

    optimize pruning for dictionary loading

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
Github user ndwangsen commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    nice job,loading performance is improved obviously。


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1438/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1543/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1547/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/1945/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/1947/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1548/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1588/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1615/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1666/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/437/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1707/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1559: [CARBONDATA-1805][Dictionary] Optimize pruning for d...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/1559
 
    retest this please


---
1234