[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

classic Classic list List threaded Threaded
48 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
GitHub user ravipesala opened a pull request:

    https://github.com/apache/carbondata/pull/2275

    [WIP] Fix lucene datasize and performance

    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ravipesala/incubator-carbondata lucene-improv

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2275
   
----
commit 1990603f2ba6351b0059b4dc658aa59ebafbac66
Author: ravipesala <ravi.pesala@...>
Date:   2018-05-06T18:12:09Z

    Fix lucene datasize and performance

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4752/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4515/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5675/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5684/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4524/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4760/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5700/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4540/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4778/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5730/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4569/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4613/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2275
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5772/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187768992
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
    @@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc
         // optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO);
         // optimizedOperations.add(ExpressionType.NOT);
         optimizedOperations.add(ExpressionType.TEXT_MATCH);
    -    this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations);
    -
    +    this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations);
         // get analyzer
         // TODO: how to get analyzer ?
         analyzer = new StandardAnalyzer();
       }
     
    +  public static int validateAndGetWriteCacheSize(DataMapSchema schema) {
    +    String cacheStr = schema.getProperties().get(FLUSH_CACHE);
    +    if (cacheStr == null) {
    +      cacheStr = "-1";
    --- End diff --
   
    Better to provide the property default value right below the property name.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187769012
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
    @@ -61,6 +61,9 @@
     @InterfaceAudience.Internal
     abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> {
     
    +  static final String FLUSH_CACHE = "flush_cache";
    --- End diff --
   
    Can you explain the intention of these two properties just like you wrote in PR description.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187769000
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
    @@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc
         // optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO);
         // optimizedOperations.add(ExpressionType.NOT);
         optimizedOperations.add(ExpressionType.TEXT_MATCH);
    -    this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations);
    -
    +    this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations);
         // get analyzer
         // TODO: how to get analyzer ?
         analyzer = new StandardAnalyzer();
       }
     
    +  public static int validateAndGetWriteCacheSize(DataMapSchema schema) {
    +    String cacheStr = schema.getProperties().get(FLUSH_CACHE);
    +    if (cacheStr == null) {
    +      cacheStr = "-1";
    +    }
    +    int cacheSize;
    +    try {
    +      cacheSize = Integer.parseInt(cacheStr);
    +    } catch (NumberFormatException e) {
    +      cacheSize = -1;
    +    }
    +    return cacheSize;
    +  }
    +
    +  public static boolean validateAndGetStoreBlockletWise(DataMapSchema schema) {
    +    String splitBlockletStr = schema.getProperties().get(SPLIT_BLOCKLET);
    +    if (splitBlockletStr == null) {
    +      splitBlockletStr = "false";
    --- End diff --
   
    Same as above


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187769074
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java ---
    @@ -114,108 +118,36 @@ public void initialize() throws IOException {
         indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer));
       }
     
    -  private IndexWriter createPageIndexWriter() throws IOException {
    -    // save index data into ram, write into disk after one page finished
    -    RAMDirectory ramDir = new RAMDirectory();
    -    return new IndexWriter(ramDir, new IndexWriterConfig(analyzer));
    -  }
    -
    -  private void addPageIndex(IndexWriter pageIndexWriter) throws IOException {
    -
    -    Directory directory = pageIndexWriter.getDirectory();
    -
    -    // close ram writer
    -    pageIndexWriter.close();
    -
    -    // add ram index data into disk
    -    indexWriter.addIndexes(directory);
    -
    -    // delete this ram data
    -    directory.close();
    -  }
    -
    -  @Override
    -  public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException {
    -    if (rowId == 0) {
    -      if (pageIndexWriter != null) {
    -        addPageIndex(pageIndexWriter);
    -      }
    -      pageIndexWriter = createPageIndexWriter();
    -    }
    -
    -    // create a new document
    -    Document doc = new Document();
    -
    -    // add blocklet Id
    -    doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
    -    doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
    -
    -    // add page id
    -    doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
    -    doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
    -
    -    // add row id
    -    doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId));
    -    doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId));
    +  @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values)
    +      throws IOException {
     
         // add other fields
    +    LuceneDataMapWriter.LuceneColumnKeys columns =
    +        new LuceneDataMapWriter.LuceneColumnKeys(columnsCount);
         for (int colIdx = 0; colIdx < columnsCount; colIdx++) {
    -      CarbonColumn column = indexColumns.get(colIdx);
    -      addField(doc, column.getColName(), column.getDataType(), values[colIdx]);
    +      columns.getColValues()[colIdx] = values[colIdx];
    +    }
    +    if (writeCacheSize > 0) {
    +      addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise);
    +      flushCacheIfCan();
    +    } else {
    +      addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns,
    +          storeBlockletWise);
         }
     
    -    pageIndexWriter.addDocument(doc);
       }
     
    -  private boolean addField(Document doc, String fieldName, DataType type, Object value) {
    -    if (type == DataTypes.STRING) {
    -      doc.add(new TextField(fieldName, (String) value, Field.Store.NO));
    -    } else if (type == DataTypes.BYTE) {
    -      // byte type , use int range to deal with byte, lucene has no byte type
    -      IntRangeField field =
    -          new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE });
    -      field.setIntValue((int) value);
    -      doc.add(field);
    -    } else if (type == DataTypes.SHORT) {
    -      // short type , use int range to deal with short type, lucene has no short type
    -      IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE },
    -          new int[] { Short.MAX_VALUE });
    -      field.setShortValue((short) value);
    -      doc.add(field);
    -    } else if (type == DataTypes.INT) {
    -      // int type , use int point to deal with int type
    -      doc.add(new IntPoint(fieldName, (int) value));
    -    } else if (type == DataTypes.LONG) {
    -      // long type , use long point to deal with long type
    -      doc.add(new LongPoint(fieldName, (long) value));
    -    } else if (type == DataTypes.FLOAT) {
    -      doc.add(new FloatPoint(fieldName, (float) value));
    -    } else if (type == DataTypes.DOUBLE) {
    -      doc.add(new DoublePoint(fieldName, (double) value));
    -    } else if (type == DataTypes.DATE) {
    -      // TODO: how to get data value
    -    } else if (type == DataTypes.TIMESTAMP) {
    -      // TODO: how to get
    -    } else if (type == DataTypes.BOOLEAN) {
    -      IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 });
    -      field.setIntValue((boolean) value ? 1 : 0);
    -      doc.add(field);
    -    } else {
    -      LOGGER.error("unsupport data type " + type);
    -      throw new RuntimeException("unsupported data type " + type);
    +  private void flushCacheIfCan() throws IOException {
    --- End diff --
   
    better to name it `flushCacheIfPossible`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187769271
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
    @@ -61,6 +61,9 @@
     @InterfaceAudience.Internal
     abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> {
     
    +  static final String FLUSH_CACHE = "flush_cache";
    --- End diff --
   
    besides, what's the unit of this value? Better to add description for it.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2275#discussion_r187769080
 
    --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java ---
    @@ -114,108 +118,36 @@ public void initialize() throws IOException {
         indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer));
       }
     
    -  private IndexWriter createPageIndexWriter() throws IOException {
    -    // save index data into ram, write into disk after one page finished
    -    RAMDirectory ramDir = new RAMDirectory();
    -    return new IndexWriter(ramDir, new IndexWriterConfig(analyzer));
    -  }
    -
    -  private void addPageIndex(IndexWriter pageIndexWriter) throws IOException {
    -
    -    Directory directory = pageIndexWriter.getDirectory();
    -
    -    // close ram writer
    -    pageIndexWriter.close();
    -
    -    // add ram index data into disk
    -    indexWriter.addIndexes(directory);
    -
    -    // delete this ram data
    -    directory.close();
    -  }
    -
    -  @Override
    -  public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException {
    -    if (rowId == 0) {
    -      if (pageIndexWriter != null) {
    -        addPageIndex(pageIndexWriter);
    -      }
    -      pageIndexWriter = createPageIndexWriter();
    -    }
    -
    -    // create a new document
    -    Document doc = new Document();
    -
    -    // add blocklet Id
    -    doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
    -    doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
    -
    -    // add page id
    -    doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
    -    doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
    -
    -    // add row id
    -    doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId));
    -    doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId));
    +  @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values)
    +      throws IOException {
     
         // add other fields
    +    LuceneDataMapWriter.LuceneColumnKeys columns =
    +        new LuceneDataMapWriter.LuceneColumnKeys(columnsCount);
         for (int colIdx = 0; colIdx < columnsCount; colIdx++) {
    -      CarbonColumn column = indexColumns.get(colIdx);
    -      addField(doc, column.getColName(), column.getDataType(), values[colIdx]);
    +      columns.getColValues()[colIdx] = values[colIdx];
    +    }
    +    if (writeCacheSize > 0) {
    +      addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise);
    +      flushCacheIfCan();
    +    } else {
    +      addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns,
    +          storeBlockletWise);
         }
     
    -    pageIndexWriter.addDocument(doc);
       }
     
    -  private boolean addField(Document doc, String fieldName, DataType type, Object value) {
    -    if (type == DataTypes.STRING) {
    -      doc.add(new TextField(fieldName, (String) value, Field.Store.NO));
    -    } else if (type == DataTypes.BYTE) {
    -      // byte type , use int range to deal with byte, lucene has no byte type
    -      IntRangeField field =
    -          new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE });
    -      field.setIntValue((int) value);
    -      doc.add(field);
    -    } else if (type == DataTypes.SHORT) {
    -      // short type , use int range to deal with short type, lucene has no short type
    -      IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE },
    -          new int[] { Short.MAX_VALUE });
    -      field.setShortValue((short) value);
    -      doc.add(field);
    -    } else if (type == DataTypes.INT) {
    -      // int type , use int point to deal with int type
    -      doc.add(new IntPoint(fieldName, (int) value));
    -    } else if (type == DataTypes.LONG) {
    -      // long type , use long point to deal with long type
    -      doc.add(new LongPoint(fieldName, (long) value));
    -    } else if (type == DataTypes.FLOAT) {
    -      doc.add(new FloatPoint(fieldName, (float) value));
    -    } else if (type == DataTypes.DOUBLE) {
    -      doc.add(new DoublePoint(fieldName, (double) value));
    -    } else if (type == DataTypes.DATE) {
    -      // TODO: how to get data value
    -    } else if (type == DataTypes.TIMESTAMP) {
    -      // TODO: how to get
    -    } else if (type == DataTypes.BOOLEAN) {
    -      IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 });
    -      field.setIntValue((boolean) value ? 1 : 0);
    -      doc.add(field);
    -    } else {
    -      LOGGER.error("unsupport data type " + type);
    -      throw new RuntimeException("unsupported data type " + type);
    +  private void flushCacheIfCan() throws IOException {
    +    if (cache.size() > writeCacheSize) {
    --- End diff --
   
    Is it `>` or `>=`?


---
123