GitHub user ravipesala opened a pull request:
https://github.com/apache/incubator-carbondata/pull/358 [WIP] Adding bucketing to carbon table loading Bucketing is the useful feature when user wants to join big tables. And also it is useful in driver level partition pruning to improve query performance. User can add buckets on any dimension column (except complex types) as follows ``` CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING) CLUSTERED BY(user_id) INTO 32 BUCKETS; ``` In the above example column `user_id` is hash partitioned and creates 32 buckets/partitions files in carbondata. So while doing the join with other table on bucketed column it can select same buckets and do the join with out shuffling. Carbon creates following folder structure, since carbon is already supporting partitioning in its file format we can make use of it or we can move the partitionid to file metadata. But if we the partitionId to metadata then there would be complications in backward compatability. dbName -> tableName - > Fact -> Part<id> ->Segment_id -> carbondatafiles You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata bucket Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #358 ---- commit 76b7da3662f09a8a38544514e44f01b2c17662a1 Author: ravipesala <[hidden email]> Date: 2016-11-27T11:28:55Z Added partitioner commit a2a42c8edb77ad0bfe4ec503523f00f33105a588 Author: ravipesala <[hidden email]> Date: 2016-11-27T16:59:39Z Added bucketing in load ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 CI cannot pass right now as there are thrift changes. But I have verified in local and all tests are passing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 @jackylk Bucketing is integrated in spark 2.0 layer as well, it is working now. Please review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/358#discussion_r90791744 --- Diff: core/src/main/java/org/apache/carbondata/core/carbon/path/CarbonTablePath.java --- @@ -375,6 +378,24 @@ public static String getTaskNo(String carbonDataFileName) { } /** + * gets updated timestamp information from given carbon data file name + */ + public static String getBucketNo(String carbonDataFileName) { --- End diff -- rename input name to `carbonFilePath`, it is a full path --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 @jackylk added testcases for all scenerios of bucket join, And also fixed review comments. Please review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 please rebase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/72/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Rebased, but compilation will fail as format updations are there in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/102/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/358#discussion_r92124656 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/CarbonSource.scala --- @@ -131,7 +131,16 @@ class CarbonSource extends CreatableRelationProvider } val map = scala.collection.mutable.Map[String, String](); parameters.foreach { x => map.put(x._1, x._2) } - val cm = TableCreator.prepareTableModel(false, Option(dbName), tableName, fields, Nil, map) + val bucketFields = { + if (parameters.contains("bucketnumber") && parameters.contains("bucketcolumns")) { --- End diff -- please add these new parameters in CarbonOption and use it here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/159/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 please rebase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 @jackylk rebased. Please reveiw --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/246/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/258/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/263/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/358#discussion_r93274074 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonMultiBlockSplit.java --- @@ -44,17 +44,17 @@ /* * The location of all wrapped splits belong to the same node */ - private String location; + private String[] locations; --- End diff -- why change to an array? please modify comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/incubator-carbondata/pull/358#discussion_r93275523 --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala --- @@ -93,36 +95,53 @@ class CarbonScanRDD( var noOfTasks = 0 if (!splits.isEmpty) { - // create a list of block based on split - val blockList = splits.asScala.map(_.asInstanceOf[Distributable]) - - // get the list of executors and map blocks to executors based on locality - val activeNodes = DistributionUtil.ensureExecutorsAndGetNodeList(blockList, sparkContext) - - // divide the blocks among the tasks of the nodes as per the data locality - val nodeBlockMapping = CarbonLoaderUtil.nodeBlockTaskMapping(blockList.asJava, -1, - parallelism, activeNodes.toList.asJava) statistic.addStatistics(QueryStatisticsConstants.BLOCK_ALLOCATION, System.currentTimeMillis) statisticRecorder.recordStatisticsForDriver(statistic, queryId) statistic = new QueryStatistic() - var i = 0 - // Create Spark Partition for each task and assign blocks - nodeBlockMapping.asScala.foreach { case (node, blockList) => - blockList.asScala.foreach { blocksPerTask => - val splits = blocksPerTask.asScala.map(_.asInstanceOf[CarbonInputSplit]) - if (blocksPerTask.size() != 0) { - val multiBlockSplit = new CarbonMultiBlockSplit(identifier, splits.asJava, node) - val partition = new CarbonSparkPartition(id, i, multiBlockSplit) - result.add(partition) - i += 1 + if (bucketedTable != null) { + var i = 0 + val bucketed = --- End diff -- incorrect indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/incubator-carbondata/pull/358 Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/282/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at [hidden email] or file a JIRA ticket with INFRA. --- |
Free forum by Nabble | Edit this page |