GitHub user ravipesala opened a pull request:
https://github.com/apache/carbondata/pull/2275 [WIP] Fix lucene datasize and performance Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata lucene-improv Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2275 ---- commit 1990603f2ba6351b0059b4dc658aa59ebafbac66 Author: ravipesala <ravi.pesala@...> Date: 2018-05-06T18:12:09Z Fix lucene datasize and performance ---- --- |
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2275 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4752/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4515/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5675/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5684/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4524/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2275 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4760/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5700/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4540/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2275 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4778/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5730/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4569/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4613/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2275 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5772/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187768992 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java --- @@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc // optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO); // optimizedOperations.add(ExpressionType.NOT); optimizedOperations.add(ExpressionType.TEXT_MATCH); - this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations); - + this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations); // get analyzer // TODO: how to get analyzer ? analyzer = new StandardAnalyzer(); } + public static int validateAndGetWriteCacheSize(DataMapSchema schema) { + String cacheStr = schema.getProperties().get(FLUSH_CACHE); + if (cacheStr == null) { + cacheStr = "-1"; --- End diff -- Better to provide the property default value right below the property name. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187769012 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java --- @@ -61,6 +61,9 @@ @InterfaceAudience.Internal abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> { + static final String FLUSH_CACHE = "flush_cache"; --- End diff -- Can you explain the intention of these two properties just like you wrote in PR description. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187769000 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java --- @@ -107,13 +118,39 @@ public LuceneDataMapFactoryBase(CarbonTable carbonTable, DataMapSchema dataMapSc // optimizedOperations.add(ExpressionType.LESSTHAN_EQUALTO); // optimizedOperations.add(ExpressionType.NOT); optimizedOperations.add(ExpressionType.TEXT_MATCH); - this.dataMapMeta = new DataMapMeta(indexedColumns, optimizedOperations); - + this.dataMapMeta = new DataMapMeta(indexedCarbonColumns, optimizedOperations); // get analyzer // TODO: how to get analyzer ? analyzer = new StandardAnalyzer(); } + public static int validateAndGetWriteCacheSize(DataMapSchema schema) { + String cacheStr = schema.getProperties().get(FLUSH_CACHE); + if (cacheStr == null) { + cacheStr = "-1"; + } + int cacheSize; + try { + cacheSize = Integer.parseInt(cacheStr); + } catch (NumberFormatException e) { + cacheSize = -1; + } + return cacheSize; + } + + public static boolean validateAndGetStoreBlockletWise(DataMapSchema schema) { + String splitBlockletStr = schema.getProperties().get(SPLIT_BLOCKLET); + if (splitBlockletStr == null) { + splitBlockletStr = "false"; --- End diff -- Same as above --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187769074 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java --- @@ -114,108 +118,36 @@ public void initialize() throws IOException { indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer)); } - private IndexWriter createPageIndexWriter() throws IOException { - // save index data into ram, write into disk after one page finished - RAMDirectory ramDir = new RAMDirectory(); - return new IndexWriter(ramDir, new IndexWriterConfig(analyzer)); - } - - private void addPageIndex(IndexWriter pageIndexWriter) throws IOException { - - Directory directory = pageIndexWriter.getDirectory(); - - // close ram writer - pageIndexWriter.close(); - - // add ram index data into disk - indexWriter.addIndexes(directory); - - // delete this ram data - directory.close(); - } - - @Override - public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException { - if (rowId == 0) { - if (pageIndexWriter != null) { - addPageIndex(pageIndexWriter); - } - pageIndexWriter = createPageIndexWriter(); - } - - // create a new document - Document doc = new Document(); - - // add blocklet Id - doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount])); - doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount])); - - // add page id - doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1])); - doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1])); - - // add row id - doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId)); - doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId)); + @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values) + throws IOException { // add other fields + LuceneDataMapWriter.LuceneColumnKeys columns = + new LuceneDataMapWriter.LuceneColumnKeys(columnsCount); for (int colIdx = 0; colIdx < columnsCount; colIdx++) { - CarbonColumn column = indexColumns.get(colIdx); - addField(doc, column.getColName(), column.getDataType(), values[colIdx]); + columns.getColValues()[colIdx] = values[colIdx]; + } + if (writeCacheSize > 0) { + addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise); + flushCacheIfCan(); + } else { + addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns, + storeBlockletWise); } - pageIndexWriter.addDocument(doc); } - private boolean addField(Document doc, String fieldName, DataType type, Object value) { - if (type == DataTypes.STRING) { - doc.add(new TextField(fieldName, (String) value, Field.Store.NO)); - } else if (type == DataTypes.BYTE) { - // byte type , use int range to deal with byte, lucene has no byte type - IntRangeField field = - new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE }); - field.setIntValue((int) value); - doc.add(field); - } else if (type == DataTypes.SHORT) { - // short type , use int range to deal with short type, lucene has no short type - IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE }, - new int[] { Short.MAX_VALUE }); - field.setShortValue((short) value); - doc.add(field); - } else if (type == DataTypes.INT) { - // int type , use int point to deal with int type - doc.add(new IntPoint(fieldName, (int) value)); - } else if (type == DataTypes.LONG) { - // long type , use long point to deal with long type - doc.add(new LongPoint(fieldName, (long) value)); - } else if (type == DataTypes.FLOAT) { - doc.add(new FloatPoint(fieldName, (float) value)); - } else if (type == DataTypes.DOUBLE) { - doc.add(new DoublePoint(fieldName, (double) value)); - } else if (type == DataTypes.DATE) { - // TODO: how to get data value - } else if (type == DataTypes.TIMESTAMP) { - // TODO: how to get - } else if (type == DataTypes.BOOLEAN) { - IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 }); - field.setIntValue((boolean) value ? 1 : 0); - doc.add(field); - } else { - LOGGER.error("unsupport data type " + type); - throw new RuntimeException("unsupported data type " + type); + private void flushCacheIfCan() throws IOException { --- End diff -- better to name it `flushCacheIfPossible` --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187769271 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapFactoryBase.java --- @@ -61,6 +61,9 @@ @InterfaceAudience.Internal abstract class LuceneDataMapFactoryBase<T extends DataMap> extends DataMapFactory<T> { + static final String FLUSH_CACHE = "flush_cache"; --- End diff -- besides, what's the unit of this value? Better to add description for it. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2275#discussion_r187769080 --- Diff: datamap/lucene/src/main/java/org/apache/carbondata/datamap/lucene/LuceneDataMapRefresher.java --- @@ -114,108 +118,36 @@ public void initialize() throws IOException { indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(analyzer)); } - private IndexWriter createPageIndexWriter() throws IOException { - // save index data into ram, write into disk after one page finished - RAMDirectory ramDir = new RAMDirectory(); - return new IndexWriter(ramDir, new IndexWriterConfig(analyzer)); - } - - private void addPageIndex(IndexWriter pageIndexWriter) throws IOException { - - Directory directory = pageIndexWriter.getDirectory(); - - // close ram writer - pageIndexWriter.close(); - - // add ram index data into disk - indexWriter.addIndexes(directory); - - // delete this ram data - directory.close(); - } - - @Override - public void addRow(int blockletId, int pageId, int rowId, Object[] values) throws IOException { - if (rowId == 0) { - if (pageIndexWriter != null) { - addPageIndex(pageIndexWriter); - } - pageIndexWriter = createPageIndexWriter(); - } - - // create a new document - Document doc = new Document(); - - // add blocklet Id - doc.add(new IntPoint(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount])); - doc.add(new StoredField(LuceneDataMapWriter.BLOCKLETID_NAME, (int) values[columnsCount])); - - // add page id - doc.add(new IntPoint(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1])); - doc.add(new StoredField(LuceneDataMapWriter.PAGEID_NAME, (int) values[columnsCount + 1])); - - // add row id - doc.add(new IntPoint(LuceneDataMapWriter.ROWID_NAME, rowId)); - doc.add(new StoredField(LuceneDataMapWriter.ROWID_NAME, rowId)); + @Override public void addRow(int blockletId, int pageId, int rowId, Object[] values) + throws IOException { // add other fields + LuceneDataMapWriter.LuceneColumnKeys columns = + new LuceneDataMapWriter.LuceneColumnKeys(columnsCount); for (int colIdx = 0; colIdx < columnsCount; colIdx++) { - CarbonColumn column = indexColumns.get(colIdx); - addField(doc, column.getColName(), column.getDataType(), values[colIdx]); + columns.getColValues()[colIdx] = values[colIdx]; + } + if (writeCacheSize > 0) { + addToCache(columns, rowId, pageId, blockletId, cache, intBuffer, storeBlockletWise); + flushCacheIfCan(); + } else { + addData(columns, rowId, pageId, blockletId, intBuffer, indexWriter, indexColumns, + storeBlockletWise); } - pageIndexWriter.addDocument(doc); } - private boolean addField(Document doc, String fieldName, DataType type, Object value) { - if (type == DataTypes.STRING) { - doc.add(new TextField(fieldName, (String) value, Field.Store.NO)); - } else if (type == DataTypes.BYTE) { - // byte type , use int range to deal with byte, lucene has no byte type - IntRangeField field = - new IntRangeField(fieldName, new int[] { Byte.MIN_VALUE }, new int[] { Byte.MAX_VALUE }); - field.setIntValue((int) value); - doc.add(field); - } else if (type == DataTypes.SHORT) { - // short type , use int range to deal with short type, lucene has no short type - IntRangeField field = new IntRangeField(fieldName, new int[] { Short.MIN_VALUE }, - new int[] { Short.MAX_VALUE }); - field.setShortValue((short) value); - doc.add(field); - } else if (type == DataTypes.INT) { - // int type , use int point to deal with int type - doc.add(new IntPoint(fieldName, (int) value)); - } else if (type == DataTypes.LONG) { - // long type , use long point to deal with long type - doc.add(new LongPoint(fieldName, (long) value)); - } else if (type == DataTypes.FLOAT) { - doc.add(new FloatPoint(fieldName, (float) value)); - } else if (type == DataTypes.DOUBLE) { - doc.add(new DoublePoint(fieldName, (double) value)); - } else if (type == DataTypes.DATE) { - // TODO: how to get data value - } else if (type == DataTypes.TIMESTAMP) { - // TODO: how to get - } else if (type == DataTypes.BOOLEAN) { - IntRangeField field = new IntRangeField(fieldName, new int[] { 0 }, new int[] { 1 }); - field.setIntValue((boolean) value ? 1 : 0); - doc.add(field); - } else { - LOGGER.error("unsupport data type " + type); - throw new RuntimeException("unsupported data type " + type); + private void flushCacheIfCan() throws IOException { + if (cache.size() > writeCacheSize) { --- End diff -- Is it `>` or `>=`? --- |
Free forum by Nabble | Edit this page |