GitHub user jackylk opened a pull request:
https://github.com/apache/carbondata/pull/2254 [CARBONDATA-2415] Support Refresh DataMap command for all Index datamap 1. Refactor DataMapWriter interface to accept row instead of column page when adding data 2. Add REFRESH DATAMAP support for all index datamap including Lucene and Bloom 3. Make IndexDataMapRefreshRDD generic for all index datamap - [X] Any interfaces changed? DataMapFactory developer interface has added new function - [X] Any backward compatibility impacted? No - [X] Document update required? Yes - [X] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. test case added - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/jackylk/incubator-carbondata refresh-datamap Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2254.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2254 ---- commit edd16b7fd53309afa4b2053e43ecba2fb97891c6 Author: Jacky Li <jacky.likun@...> Date: 2018-05-01T06:51:08Z support refresh datamap for all index datamap ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5553/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4390/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5554/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4391/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4392/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5555/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2254 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4649/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2254 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4650/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2254 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/4651/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/4399/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2254 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/5562/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185227763 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/DataMapChooser.java --- @@ -291,8 +295,8 @@ private boolean contains(DataMapMeta mapMeta, List<ColumnExpression> columnExpre } boolean contains = true; for (ColumnExpression expression : columnExpressions) { - if (!mapMeta.getIndexedColumns().contains(expression.getColumnName()) || !mapMeta - .getOptimizedOperation().containsAll(expressionTypes)) { + if (!mapMeta.getIndexedColumnNames().contains(expression.getColumnName()) || + !mapMeta.getOptimizedOperation().containsAll(expressionTypes)) { --- End diff -- What does `expressionTypes` mean here? In an expression, we just if the column in ***this*** expression is in indexedColumns as well as the expressionType in ***this*** exception. But it seems that the `expressionTypes` here does not belong to ***this*** expression. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185231346 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapFactory.java --- @@ -27,74 +29,124 @@ import org.apache.carbondata.core.features.TableOperation; import org.apache.carbondata.core.metadata.schema.table.CarbonTable; import org.apache.carbondata.core.metadata.schema.table.DataMapSchema; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; import org.apache.carbondata.events.Event; +import org.apache.commons.lang.StringUtils; + /** - * Interface for datamap factory, it is responsible for creating the datamap. + * Interface for datamap of index type, it is responsible for creating the datamap. */ -public interface DataMapFactory<T extends DataMap> { +public abstract class DataMapFactory<T extends DataMap> { + + public static final String INDEX_COLUMNS = "INDEX_COLUMNS"; + protected CarbonTable carbonTable; + + public DataMapFactory(CarbonTable carbonTable) { + this.carbonTable = carbonTable; + } /** * Initialization of Datamap factory with the carbonTable and datamap name */ - void init(CarbonTable carbonTable, DataMapSchema dataMapSchema) + public abstract void init(DataMapSchema dataMapSchema) throws IOException, MalformedDataMapCommandException; /** * Return a new write for this datamap */ - DataMapWriter createWriter(Segment segment, String writeDirectoryPath); + public abstract DataMapWriter createWriter(Segment segment, String shardName) + throws IOException; + + public abstract DataMapRefresher createRefresher(Segment segment, String shardName) + throws IOException; /** * Get the datamap for segmentid */ - List<T> getDataMaps(Segment segment) throws IOException; + public abstract List<T> getDataMaps(Segment segment) throws IOException; /** * Get datamaps for distributable object. */ - List<T> getDataMaps(DataMapDistributable distributable) + public abstract List<T> getDataMaps(DataMapDistributable distributable) throws IOException; /** * Get all distributable objects of a segmentid * @return */ - List<DataMapDistributable> toDistributable(Segment segment); + public abstract List<DataMapDistributable> toDistributable(Segment segment); /** * * @param event */ - void fireEvent(Event event); + public abstract void fireEvent(Event event); /** * Clears datamap of the segment */ - void clear(Segment segment); + public abstract void clear(Segment segment); /** * Clear all datamaps from memory */ - void clear(); + public abstract void clear(); /** * Return metadata of this datamap */ - DataMapMeta getMeta(); + public abstract DataMapMeta getMeta(); /** * Type of datamap whether it is FG or CG */ - DataMapLevel getDataMapType(); + public abstract DataMapLevel getDataMapLevel(); /** * delete datamap data if any */ - void deleteDatamapData(); + public abstract void deleteDatamapData(); /** * This function should return true is the input operation enum will make the datamap become stale */ - boolean willBecomeStale(TableOperation operation); + public abstract boolean willBecomeStale(TableOperation operation); + + /** + * Validate INDEX_COLUMNS property and return a array containing index column name + * Following will be validated + * 1. require INDEX_COLUMNS property + * 2. INDEX_COLUMNS can't contains illegal argument(empty, blank) + * 3. INDEX_COLUMNS can't contains duplicate same columns + * 4. INDEX_COLUMNS should be exists in table columns + */ + public void validateIndexedColumns(DataMapSchema dataMapSchema, + CarbonTable carbonTable) throws MalformedDataMapCommandException { + List<CarbonColumn> indexColumns = carbonTable.getIndexedColumns(dataMapSchema); + Set<String> unique = new HashSet<>(); + for (CarbonColumn indexColumn : indexColumns) { + unique.add(indexColumn.getColName()); + } + if (unique.size() != indexColumns.size()) { + throw new MalformedDataMapCommandException(INDEX_COLUMNS + " has duplicate column"); + } + } + + public static String[] getIndexColumns(DataMapSchema dataMapSchema) + throws MalformedDataMapCommandException { + String columns = dataMapSchema.getProperties().get(INDEX_COLUMNS); + if (columns == null) { + columns = dataMapSchema.getProperties().get(INDEX_COLUMNS.toLowerCase()); + } + if (columns == null) { + throw new MalformedDataMapCommandException(INDEX_COLUMNS + " DMPROPERTY is required"); + } else if (StringUtils.isBlank(columns)) { --- End diff -- So, if we have a datamap which will create index for all the columns, we must specify the index_columns, right? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185231977 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapFactory.java --- @@ -27,74 +29,124 @@ import org.apache.carbondata.core.features.TableOperation; import org.apache.carbondata.core.metadata.schema.table.CarbonTable; import org.apache.carbondata.core.metadata.schema.table.DataMapSchema; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; import org.apache.carbondata.events.Event; +import org.apache.commons.lang.StringUtils; + /** - * Interface for datamap factory, it is responsible for creating the datamap. + * Interface for datamap of index type, it is responsible for creating the datamap. */ -public interface DataMapFactory<T extends DataMap> { +public abstract class DataMapFactory<T extends DataMap> { + + public static final String INDEX_COLUMNS = "INDEX_COLUMNS"; + protected CarbonTable carbonTable; + + public DataMapFactory(CarbonTable carbonTable) { + this.carbonTable = carbonTable; + } /** * Initialization of Datamap factory with the carbonTable and datamap name */ - void init(CarbonTable carbonTable, DataMapSchema dataMapSchema) + public abstract void init(DataMapSchema dataMapSchema) throws IOException, MalformedDataMapCommandException; /** * Return a new write for this datamap */ - DataMapWriter createWriter(Segment segment, String writeDirectoryPath); + public abstract DataMapWriter createWriter(Segment segment, String shardName) + throws IOException; + + public abstract DataMapRefresher createRefresher(Segment segment, String shardName) + throws IOException; /** * Get the datamap for segmentid */ - List<T> getDataMaps(Segment segment) throws IOException; + public abstract List<T> getDataMaps(Segment segment) throws IOException; /** * Get datamaps for distributable object. */ - List<T> getDataMaps(DataMapDistributable distributable) + public abstract List<T> getDataMaps(DataMapDistributable distributable) throws IOException; /** * Get all distributable objects of a segmentid * @return */ - List<DataMapDistributable> toDistributable(Segment segment); + public abstract List<DataMapDistributable> toDistributable(Segment segment); /** * * @param event */ - void fireEvent(Event event); + public abstract void fireEvent(Event event); /** * Clears datamap of the segment */ - void clear(Segment segment); + public abstract void clear(Segment segment); /** * Clear all datamaps from memory */ - void clear(); + public abstract void clear(); /** * Return metadata of this datamap */ - DataMapMeta getMeta(); + public abstract DataMapMeta getMeta(); /** * Type of datamap whether it is FG or CG */ - DataMapLevel getDataMapType(); + public abstract DataMapLevel getDataMapLevel(); /** * delete datamap data if any */ - void deleteDatamapData(); + public abstract void deleteDatamapData(); /** * This function should return true is the input operation enum will make the datamap become stale */ - boolean willBecomeStale(TableOperation operation); + public abstract boolean willBecomeStale(TableOperation operation); + + /** + * Validate INDEX_COLUMNS property and return a array containing index column name + * Following will be validated + * 1. require INDEX_COLUMNS property + * 2. INDEX_COLUMNS can't contains illegal argument(empty, blank) + * 3. INDEX_COLUMNS can't contains duplicate same columns + * 4. INDEX_COLUMNS should be exists in table columns + */ + public void validateIndexedColumns(DataMapSchema dataMapSchema, --- End diff -- Is it required to be `public`? Maybe `provide` is enough for it. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185229106 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/DataMapProvider.java --- @@ -42,67 +43,80 @@ * * <br>Currently CarbonData supports following provider: * <ol> - * <li> preaggregate: one type of MVDataMap that do pre-aggregate of single table </li> - * <li> timeseries: one type of MVDataMap that do pre-aggregate based on time dimension - * of the table </li> - * <li> class name of {@link org.apache.carbondata.core.datamap.dev.DataMapFactory} - * implementation: Developer can implement new type of DataMap by extending - * {@link org.apache.carbondata.core.datamap.dev.DataMapFactory} </li> + * <li> preaggregate: pre-aggregate table of single table </li> + * <li> timeseries: pre-aggregate table based on time dimension of the table </li> + * <li> lucene: index backed by Apache Lucene </li> + * <li> bloomfilter: index backed by Bloom Filter </li> * </ol> * * @since 1.4.0 */ @InterfaceAudience.Internal -public interface DataMapProvider { +public abstract class DataMapProvider { + + private CarbonTable mainTable; + private DataMapSchema dataMapSchema; + + public DataMapProvider(CarbonTable mainTable, DataMapSchema dataMapSchema) { + this.mainTable = mainTable; + this.dataMapSchema = dataMapSchema; + } + + protected final CarbonTable getMainTable() { + return mainTable; + } + + protected final DataMapSchema getDataMapSchema() { + return dataMapSchema; + } /** * Initialize a datamap's metadata. * This is called when user creates datamap, for example "CREATE DATAMAP dm ON TABLE mainTable" * Implementation should initialize metadata for datamap, like creating table */ - void initMeta(CarbonTable mainTable, DataMapSchema dataMapSchema, String ctasSqlStatement) - throws MalformedDataMapCommandException, IOException; + public abstract void initMeta(String ctasSqlStatement) throws MalformedDataMapCommandException, + IOException; /** * Initialize a datamap's data. * This is called when user creates datamap, for example "CREATE DATAMAP dm ON TABLE mainTable" * Implementation should initialize data for datamap, like creating data folders */ - void initData(CarbonTable mainTable); + public abstract void initData(); /** - * Opposite operation of {@link #initMeta(CarbonTable, DataMapSchema, String)}. + * Opposite operation of {@link #initMeta(String)}. * This is called when user drops datamap, for example "DROP DATAMAP dm ON TABLE mainTable" * Implementation should clean all meta for the datamap */ - void freeMeta(CarbonTable mainTable, DataMapSchema dataMapSchema) throws IOException; + public abstract void freeMeta() throws IOException; --- End diff -- What about the name 'cleanUpMeta' and 'cleanUpData'? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185234523 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java --- @@ -986,4 +990,26 @@ public boolean canAllow(CarbonTable carbonTable, TableOperation operation) { return true; } + /** + * Get all index columns specified by dataMapSchema + */ + public List<CarbonColumn> getIndexedColumns(DataMapSchema dataMapSchema) + throws MalformedDataMapCommandException { + String[] columns = DataMapFactory.getIndexColumns(dataMapSchema); + List<CarbonColumn> indexColumn = new ArrayList<>(columns.length); + for (String column : columns) { + CarbonColumn carbonColumn = getColumnByName(getTableName(), column.trim().toLowerCase()); + if (carbonColumn == null) { + throw new MalformedDataMapCommandException(String.format( + "column '%s' does not exist in table. Please check create DataMap statement.", + column)); + } + if (carbonColumn.getColName().isEmpty()) { + throw new MalformedDataMapCommandException( + DataMapFactory.INDEX_COLUMNS + " contains invalid column name"); --- End diff -- I think the `INDEX_COLUMNS` can be moved to `CarbonCommonConstants`. Many other properties (create table/create datamap) are defined there. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185236611 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java --- @@ -66,22 +85,22 @@ public DataMapWriter(AbsoluteTableIdentifier identifier, Segment segment, * * @param blockletId sequence number of blocklet in the block */ - public abstract void onBlockletStart(int blockletId); + public abstract void onBlockletStart(int blockletId) throws IOException; /** * End of blocklet notification * * @param blockletId sequence number of blocklet in the block */ - public abstract void onBlockletEnd(int blockletId); + public abstract void onBlockletEnd(int blockletId) throws IOException; /** - * Add the column pages row to the datamap, order of pages is same as `indexColumns` in + * Add row data to the datamap, order of field is same as `indexColumns` in * DataMapMeta returned in DataMapFactory. - * Implementation should copy the content of `pages` as needed, because `pages` memory - * may be freed after this method returns, if using unsafe column page. + * Implementation should copy the content of `row` as needed, because its memory + * may be freed after this method returns, in case of unsafe memory */ - public abstract void onPageAdded(int blockletId, int pageId, ColumnPage[] pages) + public abstract void addRow(int blockletId, int pageId, int rowId, CarbonRow row) --- End diff -- Block-Blocklet-Page-Row is the four steps to locate a row, so why remove the `Page` level? Besides, why not call it `onRowAdded`? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185232242 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapFactory.java --- @@ -27,74 +29,124 @@ import org.apache.carbondata.core.features.TableOperation; import org.apache.carbondata.core.metadata.schema.table.CarbonTable; import org.apache.carbondata.core.metadata.schema.table.DataMapSchema; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; import org.apache.carbondata.events.Event; +import org.apache.commons.lang.StringUtils; + /** - * Interface for datamap factory, it is responsible for creating the datamap. + * Interface for datamap of index type, it is responsible for creating the datamap. */ -public interface DataMapFactory<T extends DataMap> { +public abstract class DataMapFactory<T extends DataMap> { + + public static final String INDEX_COLUMNS = "INDEX_COLUMNS"; + protected CarbonTable carbonTable; + + public DataMapFactory(CarbonTable carbonTable) { + this.carbonTable = carbonTable; + } /** * Initialization of Datamap factory with the carbonTable and datamap name */ - void init(CarbonTable carbonTable, DataMapSchema dataMapSchema) + public abstract void init(DataMapSchema dataMapSchema) throws IOException, MalformedDataMapCommandException; /** * Return a new write for this datamap */ - DataMapWriter createWriter(Segment segment, String writeDirectoryPath); + public abstract DataMapWriter createWriter(Segment segment, String shardName) + throws IOException; + + public abstract DataMapRefresher createRefresher(Segment segment, String shardName) + throws IOException; /** * Get the datamap for segmentid */ - List<T> getDataMaps(Segment segment) throws IOException; + public abstract List<T> getDataMaps(Segment segment) throws IOException; /** * Get datamaps for distributable object. */ - List<T> getDataMaps(DataMapDistributable distributable) + public abstract List<T> getDataMaps(DataMapDistributable distributable) throws IOException; /** * Get all distributable objects of a segmentid * @return */ - List<DataMapDistributable> toDistributable(Segment segment); + public abstract List<DataMapDistributable> toDistributable(Segment segment); /** * * @param event */ - void fireEvent(Event event); + public abstract void fireEvent(Event event); /** * Clears datamap of the segment */ - void clear(Segment segment); + public abstract void clear(Segment segment); /** * Clear all datamaps from memory */ - void clear(); + public abstract void clear(); /** * Return metadata of this datamap */ - DataMapMeta getMeta(); + public abstract DataMapMeta getMeta(); /** * Type of datamap whether it is FG or CG */ - DataMapLevel getDataMapType(); + public abstract DataMapLevel getDataMapLevel(); /** * delete datamap data if any */ - void deleteDatamapData(); + public abstract void deleteDatamapData(); /** * This function should return true is the input operation enum will make the datamap become stale */ - boolean willBecomeStale(TableOperation operation); + public abstract boolean willBecomeStale(TableOperation operation); + + /** + * Validate INDEX_COLUMNS property and return a array containing index column name + * Following will be validated + * 1. require INDEX_COLUMNS property + * 2. INDEX_COLUMNS can't contains illegal argument(empty, blank) + * 3. INDEX_COLUMNS can't contains duplicate same columns + * 4. INDEX_COLUMNS should be exists in table columns + */ + public void validateIndexedColumns(DataMapSchema dataMapSchema, + CarbonTable carbonTable) throws MalformedDataMapCommandException { + List<CarbonColumn> indexColumns = carbonTable.getIndexedColumns(dataMapSchema); + Set<String> unique = new HashSet<>(); + for (CarbonColumn indexColumn : indexColumns) { + unique.add(indexColumn.getColName()); + } + if (unique.size() != indexColumns.size()) { + throw new MalformedDataMapCommandException(INDEX_COLUMNS + " has duplicate column"); + } + } + + public static String[] getIndexColumns(DataMapSchema dataMapSchema) --- End diff -- Same as above, maybe `provide` scope is enough. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2254#discussion_r185233625 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java --- @@ -986,4 +990,26 @@ public boolean canAllow(CarbonTable carbonTable, TableOperation operation) { return true; } + /** + * Get all index columns specified by dataMapSchema + */ + public List<CarbonColumn> getIndexedColumns(DataMapSchema dataMapSchema) --- End diff -- I think it's better to use `columnNames` as the input parameter rather than `dataMapSchema`. The method in CarbonTable class does not need to know more information than `columnNames` from `dataMapSchema`, so better to pass less information. --- |
Free forum by Nabble | Edit this page |