Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

Classic

List

48 messages Options

Options

123

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

GitHub user ravipesala opened a pull request:

https://github.com/apache/carbondata/pull/2275

[WIP] Fix lucene datasize and performance

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata lucene-improv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2275

----
commit 1990603f2ba6351b0059b4dc658aa59ebafbac66
Author: ravipesala <ravi.pesala@...>
Date: 2018-05-06T18:12:09Z

Fix lucene datasize and performance

----

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2275

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4752/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4515/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5675/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5684/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4524/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2275

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4760/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5700/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4540/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2275

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4778/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5730/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4569/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4613/

---

[GitHub] carbondata issue #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2275

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5772/

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187768992

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
@@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc
// optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO);
// optimizedOperations.add(ExpressionType.NOT);
optimizedOperations.add(ExpressionType.TEXT_MATCH);
- this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations);
-
+ this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations);
// get analyzer
// TODO: how to get analyzer ?
analyzer = new StandardAnalyzer();
}

+ public static int validateAndGetWriteCacheSize(DataMapSchema schema) {
+ String cacheStr = schema.getProperties().get(FLUSH_CACHE);
+ if (cacheStr == null) {
+ cacheStr = "-1";
--- End diff --

Better to provide the property default value right below the property name.

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187769012

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
@@ -61,6 +61,9 @@
@InterfaceAudience.Internal
abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> {

+ static final String FLUSH_CACHE = "flush_cache";
--- End diff --

Can you explain the intention of these two properties just like you wrote in PR description.

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187769000

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
@@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc
// optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO);
// optimizedOperations.add(ExpressionType.NOT);
optimizedOperations.add(ExpressionType.TEXT_MATCH);
- this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations);
-
+ this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations);
// get analyzer
// TODO: how to get analyzer ?
analyzer = new StandardAnalyzer();
}

+ public static int validateAndGetWriteCacheSize(DataMapSchema schema) {
+ String cacheStr = schema.getProperties().get(FLUSH_CACHE);
+ if (cacheStr == null) {
+ cacheStr = "-1";
+ }
+ int cacheSize;
+ try {
+ cacheSize = Integer.parseInt(cacheStr);
+ } catch (NumberFormatException e) {
+ cacheSize = -1;
+ }
+ return cacheSize;
+ }
+
+ public static boolean validateAndGetStoreBlockletWise(DataMapSchema schema) {
+ String splitBlockletStr = schema.getProperties().get(SPLIT_BLOCKLET);
+ if (splitBlockletStr == null) {
+ splitBlockletStr = "false";
--- End diff --

Same as above

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187769074

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java ---
@@ -114,108 +118,36 @@ public void initialize() throws IOException {
indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer));
}

- private IndexWriter createPageIndexWriter() throws IOException {
- // save index data into ram, write into disk after one page finished
- RAMDirectory ramDir = new RAMDirectory();
- return new IndexWriter(ramDir, new IndexWriterConfig(analyzer));
- }
-
- private void addPageIndex(IndexWriter pageIndexWriter) throws IOException {
-
- Directory directory = pageIndexWriter.getDirectory();
-
- // close ram writer
- pageIndexWriter.close();
-
- // add ram index data into disk
- indexWriter.addIndexes(directory);
-
- // delete this ram data
- directory.close();
- }
-
- @Override
- public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException {
- if (rowId == 0) {
- if (pageIndexWriter != null) {
- addPageIndex(pageIndexWriter);
- }
- pageIndexWriter = createPageIndexWriter();
- }
-
- // create a new document
- Document doc = new Document();
-
- // add blocklet Id
- doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
- doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
-
- // add page id
- doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
- doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
-
- // add row id
- doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId));
- doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId));
+ @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values)
+ throws IOException {

// add other fields
+ LuceneDataMapWriter.LuceneColumnKeys columns =
+ new LuceneDataMapWriter.LuceneColumnKeys(columnsCount);
for (int colIdx = 0; colIdx < columnsCount; colIdx++) {
- CarbonColumn column = indexColumns.get(colIdx);
- addField(doc, column.getColName(), column.getDataType(), values[colIdx]);
+ columns.getColValues()[colIdx] = values[colIdx];
+ }
+ if (writeCacheSize > 0) {
+ addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise);
+ flushCacheIfCan();
+ } else {
+ addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns,
+ storeBlockletWise);
}

- pageIndexWriter.addDocument(doc);
}

- private boolean addField(Document doc, String fieldName, DataType type, Object value) {
- if (type == DataTypes.STRING) {
- doc.add(new TextField(fieldName, (String) value, Field.Store.NO));
- } else if (type == DataTypes.BYTE) {
- // byte type , use int range to deal with byte, lucene has no byte type
- IntRangeField field =
- new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE });
- field.setIntValue((int) value);
- doc.add(field);
- } else if (type == DataTypes.SHORT) {
- // short type , use int range to deal with short type, lucene has no short type
- IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE },
- new int[] { Short.MAX_VALUE });
- field.setShortValue((short) value);
- doc.add(field);
- } else if (type == DataTypes.INT) {
- // int type , use int point to deal with int type
- doc.add(new IntPoint(fieldName, (int) value));
- } else if (type == DataTypes.LONG) {
- // long type , use long point to deal with long type
- doc.add(new LongPoint(fieldName, (long) value));
- } else if (type == DataTypes.FLOAT) {
- doc.add(new FloatPoint(fieldName, (float) value));
- } else if (type == DataTypes.DOUBLE) {
- doc.add(new DoublePoint(fieldName, (double) value));
- } else if (type == DataTypes.DATE) {
- // TODO: how to get data value
- } else if (type == DataTypes.TIMESTAMP) {
- // TODO: how to get
- } else if (type == DataTypes.BOOLEAN) {
- IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 });
- field.setIntValue((boolean) value ? 1 : 0);
- doc.add(field);
- } else {
- LOGGER.error("unsupport data type " + type);
- throw new RuntimeException("unsupported data type " + type);
+ private void flushCacheIfCan() throws IOException {
--- End diff --

better to name it `flushCacheIfPossible`

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187769271

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java ---
@@ -61,6 +61,9 @@
@InterfaceAudience.Internal
abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> {

+ static final String FLUSH_CACHE = "flush_cache";
--- End diff --

besides, what's the unit of this value? Better to add description for it.

---

[GitHub] carbondata pull request #2275: [WIP] Fix lucene datasize and performance

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2275#discussion_r187769080

--- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java ---
@@ -114,108 +118,36 @@ public void initialize() throws IOException {
indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer));
}

- private IndexWriter createPageIndexWriter() throws IOException {
- // save index data into ram, write into disk after one page finished
- RAMDirectory ramDir = new RAMDirectory();
- return new IndexWriter(ramDir, new IndexWriterConfig(analyzer));
- }
-
- private void addPageIndex(IndexWriter pageIndexWriter) throws IOException {
-
- Directory directory = pageIndexWriter.getDirectory();
-
- // close ram writer
- pageIndexWriter.close();
-
- // add ram index data into disk
- indexWriter.addIndexes(directory);
-
- // delete this ram data
- directory.close();
- }
-
- @Override
- public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException {
- if (rowId == 0) {
- if (pageIndexWriter != null) {
- addPageIndex(pageIndexWriter);
- }
- pageIndexWriter = createPageIndexWriter();
- }
-
- // create a new document
- Document doc = new Document();
-
- // add blocklet Id
- doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
- doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount]));
-
- // add page id
- doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
- doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1]));
-
- // add row id
- doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId));
- doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId));
+ @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values)
+ throws IOException {

// add other fields
+ LuceneDataMapWriter.LuceneColumnKeys columns =
+ new LuceneDataMapWriter.LuceneColumnKeys(columnsCount);
for (int colIdx = 0; colIdx < columnsCount; colIdx++) {
- CarbonColumn column = indexColumns.get(colIdx);
- addField(doc, column.getColName(), column.getDataType(), values[colIdx]);
+ columns.getColValues()[colIdx] = values[colIdx];
+ }
+ if (writeCacheSize > 0) {
+ addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise);
+ flushCacheIfCan();
+ } else {
+ addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns,
+ storeBlockletWise);
}

- pageIndexWriter.addDocument(doc);
}

- private boolean addField(Document doc, String fieldName, DataType type, Object value) {
- if (type == DataTypes.STRING) {
- doc.add(new TextField(fieldName, (String) value, Field.Store.NO));
- } else if (type == DataTypes.BYTE) {
- // byte type , use int range to deal with byte, lucene has no byte type
- IntRangeField field =
- new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE });
- field.setIntValue((int) value);
- doc.add(field);
- } else if (type == DataTypes.SHORT) {
- // short type , use int range to deal with short type, lucene has no short type
- IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE },
- new int[] { Short.MAX_VALUE });
- field.setShortValue((short) value);
- doc.add(field);
- } else if (type == DataTypes.INT) {
- // int type , use int point to deal with int type
- doc.add(new IntPoint(fieldName, (int) value));
- } else if (type == DataTypes.LONG) {
- // long type , use long point to deal with long type
- doc.add(new LongPoint(fieldName, (long) value));
- } else if (type == DataTypes.FLOAT) {
- doc.add(new FloatPoint(fieldName, (float) value));
- } else if (type == DataTypes.DOUBLE) {
- doc.add(new DoublePoint(fieldName, (double) value));
- } else if (type == DataTypes.DATE) {
- // TODO: how to get data value
- } else if (type == DataTypes.TIMESTAMP) {
- // TODO: how to get
- } else if (type == DataTypes.BOOLEAN) {
- IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 });
- field.setIntValue((boolean) value ? 1 : 0);
- doc.add(field);
- } else {
- LOGGER.error("unsupport data type " + type);
- throw new RuntimeException("unsupported data type " + type);
+ private void flushCacheIfCan() throws IOException {
+ if (cache.size() > writeCacheSize) {
--- End diff --

Is it `>` or `>=`?

---

123