Hi all,
In general, database may contain geographical location data. For instance, Telecom operators require to perform analytics based on a particular region, cell tower IDs(within a region) and/or may include geographical locations for a particular period of time. At present, Carbon do not have native support to store geographical locations/coordinates and to do filter queries based on them. Yet, longitude and latitude of coordinates can be treated as independent columns, sort hierarchically and store them. But, when longitude and latitude are treated independently, 2D space is linearized i.e., points in the two dimensional domain are ordered by sorting first on longitide and then on latitude. Thus, data is not ordered by geospatial proximity. Hence range queries require lot of IO operations and query performance is degraded. To alleviate it, we can use z-order curve to store geospatial data points. This ensures that geographically nearer points are present at same block/blocklet. This reduces the IO operations for range queries and improves query performance. Also can support polygon queries for geodata. Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and attached design document to it. Request you to please have a look. Welcome your opinion and suggestions. Thanks, |
definitely +1.
Before going through the design doc, I have two questions: 1. In this domain, there are some opensource solutions with SQL extension or DSL designed for geographical analytic, such as geomesa (it also works with spark). So is there considerations to integration with geomesa also? Can geomesa user benefit from CarbonData spatial index? 2. Besides Z-order curve, there are other curve maybe useful in some use case, like Hilbert curve. To maximize the extensionbility for CarbonData, is it possible to have a framework to support different curve implementation? Regards, Jacky On 2019/10/16 11:31:35, Venu Reddy <[hidden email]> wrote: > Hi all, > > In general, database may contain geographical location data. For instance, > Telecom operators require to perform analytics based on a particular > region, cell tower IDs(within a region) and/or may include geographical > locations for a particular period of time. At present, Carbon do not have > native support to store geographical locations/coordinates and to do filter > queries based on them. Yet, longitude and latitude of coordinates can be > treated as independent columns, sort hierarchically and store them. > > But, when longitude and latitude are treated independently, 2D > space is linearized i.e., points in the two dimensional domain are ordered > by sorting first on longitide and then on latitude. Thus, data is not > ordered by geospatial proximity. Hence range queries require lot of IO > operations and query performance is degraded. > > To alleviate it, we can use z-order curve to store geospatial data > points. This ensures that geographically nearer points are present at same > block/blocklet. This reduces the IO operations for range queries and > improves query performance. Also can support polygon queries for geodata. > > Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and > attached design document to it. Request you to please have a look. Welcome > your opinion and suggestions. > > Thanks, > |
In reply to this post by VenuReddy
Hi Venu,
1. Would table with geospatial location column be allowed to be updated with non-geospatial data and vice verca . Or would it according to the existing behavior and any unsupported data in type/column would be treated as bad records ? 2. Would there be any limitations with respect to using targetColumn column configured as local dictionary,inverted index,cache column or range column in table properties ? 3. Would only measure data types be supported for targetDataType parameter ? Supported types can be mentioned in design doc. Regards Chetan On 2019/10/16 11:31:35, Venu Reddy <[hidden email]> wrote: > Hi all, > > In general, database may contain geographical location data. For instance, > Telecom operators require to perform analytics based on a particular > region, cell tower IDs(within a region) and/or may include geographical > locations for a particular period of time. At present, Carbon do not have > native support to store geographical locations/coordinates and to do filter > queries based on them. Yet, longitude and latitude of coordinates can be > treated as independent columns, sort hierarchically and store them. > > But, when longitude and latitude are treated independently, 2D > space is linearized i.e., points in the two dimensional domain are ordered > by sorting first on longitide and then on latitude. Thus, data is not > ordered by geospatial proximity. Hence range queries require lot of IO > operations and query performance is degraded. > > To alleviate it, we can use z-order curve to store geospatial data > points. This ensures that geographically nearer points are present at same > block/blocklet. This reduces the IO operations for range queries and > improves query performance. Also can support polygon queries for geodata. > > Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and > attached design document to it. Request you to please have a look. Welcome > your opinion and suggestions. > > Thanks, > |
This post was updated on .
In reply to this post by VenuReddy
Hi Venu,
I have some questions regarding this feature. 1. Does geospatial index supports on streaming table?. If so, will there be any impact on generating geoIndex on steaming data? 2. Does it have any restrictions on sort_scope? 3. Apart from Point and Polygon queries, will geospatial index also support Aggregation queries on geographical location data? Thanks & Regards, Indhumathi -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by chetdb
1. Would table with geospatial location column be allowed to be updated with
non-geospatial data and vice verca . Or would it according to the existing behavior and any unsupported data in type/column would be treated as bad records ? => Location columns cannot be allowed with invalid datatypes. It can be treated as bad records with unsupported data in type/column. 2. Would there be any limitations with respect to using targetColumn column configured as local dictionary,inverted index,cache column or range column in table properties ? => I think, there shouldn't be any such restriction. TargetColumn is just an additional column internally generated when INDEX property is specified. 3. Would only measure data types be supported for targetDataType parameter ? Supported types can be mentioned in design doc. => We can treat the generated geohash column as dimension column as it should be part of sort columns. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Indhumathi
Hi Jacky, we have checked about geomesa a. Geomesa is tightly coupled with key-value pair databases like Accumulo, HBase, Google Bigtable and Cassandra databases and used for OLTP queries. b. Geomesa current spark integration is only in query flow, load from spark is not supported. spark can be used for analytics on geomesa store. Here they override spark catalyst optimizer code to intercept filter from logical relation and they push down to geomesa server. All the query logic like spatial time curve building (z curve, quadtree) doesn't happen at spark layer. It happens in geoserver layer which is coupled with key-value pair databases. c. Geomesa is for spatio-temporal data , not just a spatial data. so, we cannot integrate carbon with geo mesa directly, but we can reuse some of the logic present in it like quadtree formation and look up. Also I found another alternative "GeoSpark", this project is not coupled with any store. so, we will check further about integrating carbon to GeoSpark or reusing some of the code from this. Also regarding the second point, yes, we can have carbon implementation as a generic framework where we can plugin the different logic. Thanks, Ajantha On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <[hidden email]> wrote: Hi Venu, |
Thanks for the analysis. Please be careful of the code reuse from other "opensource" repo, especially for the License.
Regards, Jacky On 2019/10/24 06:25:40, Ajantha Bhat <[hidden email]> wrote: > Hi Jacky, > > we have checked about geomesa > > [image: Screenshot from 2019-10-23 16-25-23.png] > > a. Geomesa is tightly coupled with key-value pair databases like Accumulo, > HBase, Google Bigtable and Cassandra databases and used for OLTP queries. > b. Geomesa current spark integration is only in query flow, load from spark > is not supported. spark can be used for analytics on geomesa store. > Here they override spark catalyst optimizer code to intercept filter from > logical relation and they push down to geomesa server. > All the query logic like spatial time curve building (z curve, quadtree) > doesn't happen at spark layer. It happens in geoserver layer which is > coupled with key-value pair databases. > https://www.geomesa.org/documentation/user/architecture.html > > https://www.geomesa.org/documentation/user/spark/architecture.html > > https://www.youtube.com/watch?v=Otf2jwdNaUY > > c. Geomesa is for spatio-temporal data , not just a spatial data. > so, we cannot integrate carbon with geo mesa directly, but we can reuse > some of the logic present in it like quadtree formation and look up. > > Also I found *another alternative* "*GeoSpark", *this project is not > coupled with any store. > https://datasystemslab.github.io/GeoSpark/ > > https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf > so, we will check further about integrating carbon to GeoSpark or reusing > some of the code from this. > > Also regarding the second point, yes, we can have carbon implementation as > a generic framework where we can plugin the different logic. > > Thanks, > Ajantha > > > > > > On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <[hidden email]> wrote: > > > Hi Venu, > > > > I have some questions regarding this feature. > > > > 1. Does geospatial index supports on streaming table?. If so, will there be > > any impact on generating > > geoIndex on steaming data? > > 2. Does it have any restrictions on sort_scope? > > 3. Apart from Point and Polygon queries, will geospatial index also support > > Aggregation queries on > > geographical location data? > > > > Thanks & Regards, > > Indhumathi > > > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > |
Thanks for proposing i would suggest to explore and think of integrating
already avail lib like Apache Spatial Information System rather than developing : https://sis.apache.org/ -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I am not familliar with Apache SIS, is it already integrated with other
storage system? Is there any pointer to learn about this? In my opinion, this thread was discussing the indexing part in the CarbonData to accelerate geosptial related queries. If Apache SIS offers integration framework and can provide more APIs for application, I'd like to explore more possibility to enlarge CarbonData's usage. Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by VenuReddy
Sorry that I cannot access the document in jira.
In my opinion, both for the SORT_COLUMNS in current implementation and for the LOCATION_COLUMNS in the proposal, carbondata tries to organize the data in some order. So the kernel of the proposal is that, for the SORT_COLUMNS, we can specify a algorithm for the order of sort. By default it is MIN_MAX-INCREASE sort, and more sort can be introduced such as INVERSE_MIN_MAX-INCRESAE sort or Z-ORDER sort. For the implementation, we can define a sort factory and implement some, and also keep it open for the users' customization just like column_compressor we implemented before. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Indhumathi
@Indhumathi Please find my reply inline
1. Does geospatial index supports on streaming table?. If so, will there be any impact on generating geoIndex on steaming data? => Yes. We can support for steaming tables as well. But we shall restrict it for now and enhance in the future. 2. Does it have any restrictions on sort_scope? => There is no restriction on sort_scope. Same existing sort_scope applies to it. 3. Apart from Point and Polygon queries, will geospatial index also support Aggregation queries on geographical location data? => For now, we shall restrict to polygon. But IMO, can extend it for multiple types of queries. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
@xuchuanyin
With this new property, we can create a non-schema column internally and can generate the customized value to it upon each row add from the existing schema columns values(i.e., from source column values). Note that source columns are specified with property. Since the intent of this column creation was to use it as sort column too, we can implicitly add it to existing configured sort column list. During the table creation, we can append to the existing sort column list. And, if we want to change the order of sort columns, we can use existing alter table set table properties for sort columns. The way sorting of data happens is still same as present except the fact that it considers another non-schema column also into account during sort. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by VenuReddy
Hi all,
I've refreshed the design document in jira. Have incorporated changes to table properties and fixed review comments. Please find the latest design doc at https://issues.apache.org/jira/browse/CARBONDATA-3548 Request review and let me know your opinion. Thanks, Venu Reddy -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Venu,
1. Please keep the default implementation independent of grid size and other parameters. I mean below parameters. 'INDEX_HANDLER.xxx.gridSize', 'INDEX_HANDLER.xxx.minLongitude', 'INDEX_HANDLER.xxx.maxLongitude', 'INDEX_HANDLER.xxx.minLatitude', 'INDEX_HANDLER.xxx.maxLatitude', *It should work on just longitude , latitude. index type and float data type as default longitude and latitude. * *Quadtree* logic can be generic, which takes geohash id and return ranges. Can work for all implementations. Can add custom implementation for gridsize and other parameters if required. 2. In describe formatted table, Instead of non-schema columns, can show it as Custom Index Information. And better to show the custom index handler name and source columns used also in describe. *# Custom Index Information* *custom index Handler Class :* *custom index Handler type:* *custom index column name : * *custom index column data type : * *custom index source columns :* we can skip display itself if property is not configured. Thanks, Ajantha On Tue, Nov 26, 2019 at 8:38 PM VenuReddy <[hidden email]> wrote: > Hi all, > > I've refreshed the design document in jira. Have incorporated changes to > table properties and fixed review comments. > Please find the latest design doc at > https://issues.apache.org/jira/browse/CARBONDATA-3548 > Request review and let me know your opinion. > > Thanks, > Venu Reddy > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
@Ajantha
Agreed. Have updated the design doc as suggested. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
@venu: ok. +1
On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <[hidden email]> wrote: > @Ajantha > Agreed. Have updated the design doc as suggested. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
+1
Regards Kumar Vishal On Fri, Nov 29, 2019 at 4:36 PM Ajantha Bhat <[hidden email]> wrote: > @venu: ok. +1 > > On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <[hidden email]> > wrote: > > > @Ajantha > > Agreed. Have updated the design doc as suggested. > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > >
kumar vishal
|
Free forum by Nabble | Edit this page |