GitHub user sounakr opened a pull request:
https://github.com/apache/carbondata/pull/1359 [CARBONDATA-1480]Min Max DataMap Datamap Example. Implementation of Min Max Index through Datamap. And Using the Index while prunning. --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/sounakr/incubator-carbondata minmax Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1359.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1359 ---- commit a46e3b7c609e070f052017edabef9355668cf00a Author: sounakr <[hidden email]> Date: 2017-09-13T11:57:23Z Min Max DataMap ---- --- |
Github user QACarbonData commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Success with Spark 1.6, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/32/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/153/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1359 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/782/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139058897 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java --- @@ -32,7 +32,12 @@ /** * End of block notification */ - void onBlockEnd(String blockId); + void onBlockEnd(String blockId, String directoryPath); + + /** + * End of block notification when index got created. + */ + void onBlockEndWithIndex(String blockId, String directoryPath); --- End diff -- Why is this method required, why not `onBlockEnd` is enough? --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139059342 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMap.java --- @@ -31,7 +31,8 @@ /** * It is called to load the data map to memory or to initialize it. */ - void init(String filePath) throws MemoryException, IOException; + void init(String blockletIndexPath, String customIndexPath, String segmentId) --- End diff -- The `filepath` supposed to be either index folder name or index file name, so I don't think this extra information is required here. And also `blockletIndexPath` is not supposed passed as we have carbonIndex exists in other datamap and we supposed to use it. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139059518 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/MinMaxDataMapFactory.java --- @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.core.cache.Cache; +import org.apache.carbondata.core.cache.CacheProvider; +import org.apache.carbondata.core.cache.CacheType; +import org.apache.carbondata.core.datamap.DataMapDistributable; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.TableDataMap; +import org.apache.carbondata.core.datamap.dev.DataMap; +import org.apache.carbondata.core.datamap.dev.DataMapFactory; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.filesystem.CarbonFile; +import org.apache.carbondata.core.datastore.filesystem.CarbonFileFilter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.events.ChangeEvent; +import org.apache.carbondata.core.indexstore.TableBlockIndexUniqueIdentifier; +import org.apache.carbondata.core.indexstore.blockletindex.BlockletDataMap; +import org.apache.carbondata.core.indexstore.schema.FilterType; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; + + +/** + * Table map for blocklet + */ +public class MinMaxDataMapFactory implements DataMapFactory { + + private AbsoluteTableIdentifier identifier; + + // segmentId -> list of index file + private Map<String, List<TableBlockIndexUniqueIdentifier>> segmentMap = new HashMap<>(); + + private Cache<TableBlockIndexUniqueIdentifier, DataMap> cache; + + @Override + public void init(AbsoluteTableIdentifier identifier, String dataMapName) { + this.identifier = identifier; + cache = CacheProvider.getInstance() --- End diff -- what is the use of this cache when don't use anywhere --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139068734 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java --- @@ -32,7 +32,12 @@ /** * End of block notification */ - void onBlockEnd(String blockId); + void onBlockEnd(String blockId, String directoryPath); + + /** + * End of block notification when index got created. + */ + void onBlockEndWithIndex(String blockId, String directoryPath); --- End diff -- onBlockEnd Method is called once the block is written. onBlockEndWithIndex is called once the index is also written after the carbondata is written out. --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139092331 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMap.java --- @@ -31,7 +31,8 @@ /** * It is called to load the data map to memory or to initialize it. */ - void init(String filePath) throws MemoryException, IOException; + void init(String blockletIndexPath, String customIndexPath, String segmentId) --- End diff -- For Min Max Index creation like segment properties and other things i am taking input from regular carbonindex file too. So by design we can have one parameter as primitive index path other can be of the new custom index file path. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139122030 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java --- @@ -32,7 +32,12 @@ /** * End of block notification */ - void onBlockEnd(String blockId); + void onBlockEnd(String blockId, String directoryPath); + + /** + * End of block notification when index got created. + */ + void onBlockEndWithIndex(String blockId, String directoryPath); --- End diff -- I did not get the meaning of index. it is supposed to be independent of other indexes. I think onBlockEnd event is enough for writing the index file. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139122092 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMap.java --- @@ -31,7 +31,8 @@ /** * It is called to load the data map to memory or to initialize it. */ - void init(String filePath) throws MemoryException, IOException; + void init(String blockletIndexPath, String customIndexPath, String segmentId) --- End diff -- it should be independent of other indexes --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139123481 --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/MinMaxDataMapFactory.java --- @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.examples; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.core.cache.Cache; +import org.apache.carbondata.core.cache.CacheProvider; +import org.apache.carbondata.core.cache.CacheType; +import org.apache.carbondata.core.datamap.DataMapDistributable; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.TableDataMap; +import org.apache.carbondata.core.datamap.dev.DataMap; +import org.apache.carbondata.core.datamap.dev.DataMapFactory; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.filesystem.CarbonFile; +import org.apache.carbondata.core.datastore.filesystem.CarbonFileFilter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.events.ChangeEvent; +import org.apache.carbondata.core.indexstore.TableBlockIndexUniqueIdentifier; +import org.apache.carbondata.core.indexstore.blockletindex.BlockletDataMap; +import org.apache.carbondata.core.indexstore.schema.FilterType; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; + + +/** + * Table map for blocklet + */ +public class MinMaxDataMapFactory implements DataMapFactory { + + private AbsoluteTableIdentifier identifier; + + // segmentId -> list of index file + private Map<String, List<TableBlockIndexUniqueIdentifier>> segmentMap = new HashMap<>(); + + private Cache<TableBlockIndexUniqueIdentifier, DataMap> cache; + + @Override + public void init(AbsoluteTableIdentifier identifier, String dataMapName) { + this.identifier = identifier; + cache = CacheProvider.getInstance() --- End diff -- Removed. --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139123880 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMapWriter.java --- @@ -32,7 +32,12 @@ /** * End of block notification */ - void onBlockEnd(String blockId); + void onBlockEnd(String blockId, String directoryPath); + + /** + * End of block notification when index got created. + */ + void onBlockEndWithIndex(String blockId, String directoryPath); --- End diff -- But during onBlockEnd as the carbonIndex is not yet written, we wont be able to access the carbonIndex files. In the example i am gathering informations from CarbonIndex Files too. Better to keep hook after writing Index Files also. In future we may need some more hooks at different points. --- |
In reply to this post by qiuchenjian-2
Github user sounakr commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1359#discussion_r139124564 --- Diff: core/src/main/java/org/apache/carbondata/core/datamap/dev/DataMap.java --- @@ -31,7 +31,8 @@ /** * It is called to load the data map to memory or to initialize it. */ - void init(String filePath) throws MemoryException, IOException; + void init(String blockletIndexPath, String customIndexPath, String segmentId) --- End diff -- In this example Along with Min and Max Information i am keeping few more information for building the BlockLet. Both indexes are independent but with the current example implementation i read the Min and Max index and and then read the carbonindex index also in order to get the column cardanality and segmentproperties. These values are used to form the blocklet used for pruning. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/171/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Success with Spark 1.6, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/47/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1359 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/801/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Success with Spark 1.6, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/105/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1359 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/229/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1359 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/860/ --- |
Free forum by Nabble | Edit this page |