[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
GitHub user ajantha-bhat opened a pull request:

    https://github.com/apache/carbondata/pull/2653

    [CARBONDATA-2874] Support SDK writer as thread safe api  

    Problem: Currently CarbonWriter.write() not a thread safe. if multiple threads calls .write() for one writer.
    Data count inconsistency is observed.
   
    root casue: As all the threads are writing to same batch of blocking queue. need to synchronize this. Else one thread data overwrite the other thread data.
   
    Solution:
    a) DataLoadExecutor is using only one iterator, take number of threads as input and internally create that many iterator to loop over the data. This will reduce the blocking time of queue as each iterator has its own queue.
    b) InputProcessor step is taking only default 2 cores (2 thread) for data load in SDK flow, can use the same number as number of threads created by user.
    c) writer step is using only 2 cores  (2 thread). can use the same number as number of threads created by user.
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed? NA. Added new interface
     
     - [ ] Any backward compatibility impacted? NA
     
     - [ ] Document update required? yes,  udpated
   
     - [ ] Testing done. Yes, updated the test case
         
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajantha-bhat/carbondata master_new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2653
   
----
commit ad991851c109a1231e5aea088001f1c80097d3b3
Author: ajantha-bhat <ajanthabhat@...>
Date:   2018-08-21T05:22:35Z

    multi-thread by iterators

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6357/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8006/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6729/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8011/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6734/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ajantha-bhat commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    retest this please


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8015/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6738/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6368/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/8020/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212531097
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java ---
    @@ -268,6 +268,49 @@ public synchronized OutputCommitter getOutputCommitter(TaskAttemptContext contex
             executorService);
       }
     
    +  public RecordWriter<NullWritable, ObjectArrayWritable> getMultiThreadRecordWriter(
    --- End diff --
   
    Please add comment


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2653: [CARBONDATA-2874] Support SDK writer as thread safe ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2653
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6743/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212531510
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java ---
    @@ -434,4 +477,61 @@ public CarbonLoadModel getLoadModel() {
           return loadModel;
         }
       }
    +
    +  public static class CarbonMultiRecordWriter
    --- End diff --
   
    Better move this to separate file


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212532764
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java ---
    @@ -434,4 +477,61 @@ public CarbonLoadModel getLoadModel() {
           return loadModel;
         }
       }
    +
    +  public static class CarbonMultiRecordWriter
    --- End diff --
   
    I think this class can extend the CarbonRecordWriter and override necessary methods


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212533163
 
    --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableOutputFormat.java ---
    @@ -268,6 +268,49 @@ public synchronized OutputCommitter getOutputCommitter(TaskAttemptContext contex
             executorService);
       }
     
    +  public RecordWriter<NullWritable, ObjectArrayWritable> getMultiThreadRecordWriter(
    --- End diff --
   
    I feel no need to add separate method, take configuration of number of threads of conf object and generate record writer


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212533573
 
    --- Diff: processing/src/main/java/org/apache/carbondata/processing/util/CarbonDataProcessorUtil.java ---
    @@ -659,9 +659,14 @@ public static boolean isRawDataRequired(CarbonDataLoadConfiguration configuratio
        * @return
        */
       public static List<CarbonIterator<Object[]>>[] partitionInputReaderIterators(
    -      CarbonIterator<Object[]>[] inputIterators) {
    +      CarbonIterator<Object[]>[] inputIterators, short sdkUserCores) {
         // Get the number of cores configured in property.
    -    int numberOfCores = CarbonProperties.getInstance().getNumberOfCores();
    +    int numberOfCores;
    +    if (sdkUserCores != 0) {
    --- End diff --
   
    check `sdkUserCores>0`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212534215
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java ---
    @@ -339,6 +339,28 @@ public CarbonWriter buildWriterForCSVInput(Schema schema)
         return new CSVCarbonWriter(loadModel);
       }
     
    +  /**
    +   * Build a {@link CarbonWriter}, which accepts row in CSV format
    +   * @param schema carbon Schema object {org.apache.carbondata.sdk.file.Schema}
    +   * @param numOfThreads number of threads() in which .write will be called.
    +   * @return CSVCarbonWriter
    +   * @throws IOException
    +   * @throws InvalidLoadOptionException
    +   */
    +  public CarbonWriter buildWriterForCSVInput(Schema schema, short numOfThreads)
    --- End diff --
   
    Better name as `buildThreadSafeWriterForCSVInput` and update java doc like how this writer can be used in multithreaded environment


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2653: [CARBONDATA-2874] Support SDK writer as threa...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2653#discussion_r212534244
 
    --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/file/CarbonWriterBuilder.java ---
    @@ -360,6 +382,33 @@ public CarbonWriter buildWriterForAvroInput(org.apache.avro.Schema avroSchema)
         return new AvroCarbonWriter(loadModel);
       }
     
    +  /**
    +   * Build a {@link CarbonWriter}, which accepts Avro object
    +   * @param avroSchema avro Schema object {org.apache.avro.Schema}
    +   * @param numOfThreads number of threads() in which .write will be called.
    +   * @return AvroCarbonWriter
    +   * @throws IOException
    +   * @throws InvalidLoadOptionException
    +   */
    +  public CarbonWriter buildWriterForAvroInput(org.apache.avro.Schema avroSchema, short numOfThreads)
    --- End diff --
   
    same comment as above


---
123