GitHub user ajantha-bhat opened a pull request:
https://github.com/apache/carbondata/pull/2653 [CARBONDATA-2874] Support SDK writer as thread safe api Problem: Currently CarbonWriter.write() not a thread safe. if multiple threads calls .write() for one writer. Data count inconsistency is observed. root casue: As all the threads are writing to same batch of blocking queue. need to synchronize this. Else one thread data overwrite the other thread data. Solution: a) DataLoadExecutor is using only one iterator, take number of threads as input and internally create that many iterator to loop over the data. This will reduce the blocking time of queue as each iterator has its own queue. b) InputProcessor step is taking only default 2 cores (2 thread) for data load in SDK flow, can use the same number as number of threads created by user. c) writer step is using only 2 cores (2 thread). can use the same number as number of threads created by user. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? NA. Added new interface - [ ] Any backward compatibility impacted? NA - [ ] Document update required? yes, udpated - [ ] Testing done. Yes, updated the test case - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajantha-bhat/carbondata master_new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2653.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2653 ---- commit ad991851c109a1231e5aea088001f1c80097d3b3 Author: ajantha-bhat <ajanthabhat@...> Date: 2018-08-21T05:22:35Z multi-thread by iterators ---- --- |
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2653 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6357/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8006/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6729/ --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:
https://github.com/apache/carbondata/pull/2653 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8011/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6734/ --- |
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:
https://github.com/apache/carbondata/pull/2653 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8015/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6738/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2653 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6368/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8020/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212531097 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java --- @@ -268,6 +268,49 @@ public synchronized OutputCommitter getOutputCommitter(TaskAttemptContext contex executorService); } + public RecordWriter<NullWritable, ObjectArrayWritable> getMultiThreadRecordWriter( --- End diff -- Please add comment --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2653 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6743/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212531510 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java --- @@ -434,4 +477,61 @@ public CarbonLoadModel getLoadModel() { return loadModel; } } + + public static class CarbonMultiRecordWriter --- End diff -- Better move this to separate file --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212532764 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java --- @@ -434,4 +477,61 @@ public CarbonLoadModel getLoadModel() { return loadModel; } } + + public static class CarbonMultiRecordWriter --- End diff -- I think this class can extend the CarbonRecordWriter and override necessary methods --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212533163 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java --- @@ -268,6 +268,49 @@ public synchronized OutputCommitter getOutputCommitter(TaskAttemptContext contex executorService); } + public RecordWriter<NullWritable, ObjectArrayWritable> getMultiThreadRecordWriter( --- End diff -- I feel no need to add separate method, take configuration of number of threads of conf object and generate record writer --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212533573 --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java --- @@ -659,9 +659,14 @@ public static boolean isRawDataRequired(CarbonDataLoadConfiguration configuratio * @return */ public static List<CarbonIterator<Object[]>>[] partitionInputReaderIterators( - CarbonIterator<Object[]>[] inputIterators) { + CarbonIterator<Object[]>[] inputIterators, short sdkUserCores) { // Get the number of cores configured in property. - int numberOfCores = CarbonProperties.getInstance().getNumberOfCores(); + int numberOfCores; + if (sdkUserCores != 0) { --- End diff -- check `sdkUserCores>0` --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212534215 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java --- @@ -339,6 +339,28 @@ public CarbonWriter buildWriterForCSVInput(Schema schema) return new CSVCarbonWriter(loadModel); } + /** + * Build a {@link CarbonWriter}, which accepts row in CSV format + * @param schema carbon Schema object {org.apache.carbondata.sdk.file.Schema} + * @param numOfThreads number of threads() in which .write will be called. + * @return CSVCarbonWriter + * @throws IOException + * @throws InvalidLoadOptionException + */ + public CarbonWriter buildWriterForCSVInput(Schema schema, short numOfThreads) --- End diff -- Better name as `buildThreadSafeWriterForCSVInput` and update java doc like how this writer can be used in multithreaded environment --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2653#discussion_r212534244 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java --- @@ -360,6 +382,33 @@ public CarbonWriter buildWriterForAvroInput(org.apache.avro.Schema avroSchema) return new AvroCarbonWriter(loadModel); } + /** + * Build a {@link CarbonWriter}, which accepts Avro object + * @param avroSchema avro Schema object {org.apache.avro.Schema} + * @param numOfThreads number of threads() in which .write will be called. + * @return AvroCarbonWriter + * @throws IOException + * @throws InvalidLoadOptionException + */ + public CarbonWriter buildWriterForAvroInput(org.apache.avro.Schema avroSchema, short numOfThreads) --- End diff -- same comment as above --- |
Free forum by Nabble | Edit this page |