[DISCUSSION]Support for Geospatial indexing

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION]Support for Geospatial indexing

VenuReddy
Hi all,

In general, database may contain geographical location data. For instance,
Telecom operators require to perform analytics based on a particular
region, cell tower IDs(within a region) and/or may include geographical
locations for a particular period of time. At present, Carbon do not have
native support to store geographical locations/coordinates and to do filter
queries based on them. Yet, longitude and latitude of coordinates can be
treated as independent columns, sort hierarchically and store them.

         But, when longitude and latitude are treated independently, 2D
space is linearized i.e., points in the two dimensional domain are ordered
by sorting first on longitide and then on latitude. Thus, data is not
ordered by geospatial proximity. Hence range queries require lot of IO
operations and query performance is degraded.

        To alleviate it, we can use z-order curve to store geospatial data
points. This ensures that geographically nearer points are present at same
block/blocklet. This reduces the IO operations for range queries and
improves query performance. Also can support polygon queries for geodata.

Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
attached design document to it. Request you to please have a look. Welcome
your opinion and suggestions.

Thanks,
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Jacky Li-3
definitely +1.

Before going through the design doc, I have two questions:
1. In this domain, there are some opensource solutions with SQL extension or DSL designed for geographical analytic, such as geomesa (it also works with spark). So is there considerations to integration with geomesa also? Can geomesa user benefit from CarbonData spatial index?

2. Besides Z-order curve, there are other curve maybe useful in some use case, like Hilbert curve. To maximize the extensionbility for CarbonData, is it possible to have a framework to support different curve implementation?

Regards,
Jacky

On 2019/10/16 11:31:35, Venu Reddy <[hidden email]> wrote:

> Hi all,
>
> In general, database may contain geographical location data. For instance,
> Telecom operators require to perform analytics based on a particular
> region, cell tower IDs(within a region) and/or may include geographical
> locations for a particular period of time. At present, Carbon do not have
> native support to store geographical locations/coordinates and to do filter
> queries based on them. Yet, longitude and latitude of coordinates can be
> treated as independent columns, sort hierarchically and store them.
>
>          But, when longitude and latitude are treated independently, 2D
> space is linearized i.e., points in the two dimensional domain are ordered
> by sorting first on longitide and then on latitude. Thus, data is not
> ordered by geospatial proximity. Hence range queries require lot of IO
> operations and query performance is degraded.
>
>         To alleviate it, we can use z-order curve to store geospatial data
> points. This ensures that geographically nearer points are present at same
> block/blocklet. This reduces the IO operations for range queries and
> improves query performance. Also can support polygon queries for geodata.
>
> Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
> attached design document to it. Request you to please have a look. Welcome
> your opinion and suggestions.
>
> Thanks,
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

chetdb
In reply to this post by VenuReddy
Hi Venu,

1. Would table with geospatial location column be allowed to be updated with non-geospatial data and vice verca . Or would it according to the existing behavior and any unsupported data in type/column would be treated as bad records ?
2. Would there be any limitations with respect to using targetColumn column configured as local dictionary,inverted index,cache column or range column in table properties ?
3. Would only measure data types be supported for targetDataType parameter ? Supported types can be mentioned in design doc.

Regards
Chetan

On 2019/10/16 11:31:35, Venu Reddy <[hidden email]> wrote:

> Hi all,
>
> In general, database may contain geographical location data. For instance,
> Telecom operators require to perform analytics based on a particular
> region, cell tower IDs(within a region) and/or may include geographical
> locations for a particular period of time. At present, Carbon do not have
> native support to store geographical locations/coordinates and to do filter
> queries based on them. Yet, longitude and latitude of coordinates can be
> treated as independent columns, sort hierarchically and store them.
>
>          But, when longitude and latitude are treated independently, 2D
> space is linearized i.e., points in the two dimensional domain are ordered
> by sorting first on longitide and then on latitude. Thus, data is not
> ordered by geospatial proximity. Hence range queries require lot of IO
> operations and query performance is degraded.
>
>         To alleviate it, we can use z-order curve to store geospatial data
> points. This ensures that geographically nearer points are present at same
> block/blocklet. This reduces the IO operations for range queries and
> improves query performance. Also can support polygon queries for geodata.
>
> Have raised a jira https://issues.apache.org/jira/browse/CARBONDATA-3548 and
> attached design document to it. Request you to please have a look. Welcome
> your opinion and suggestions.
>
> Thanks,
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Indhumathi
This post was updated on .
In reply to this post by VenuReddy
Hi Venu,

I have some questions regarding this feature.

1. Does geospatial index supports on streaming table?. If so, will there be
any impact on generating geoIndex on steaming data?
2. Does it have any restrictions on sort_scope?
3. Apart from Point and Polygon queries, will geospatial index also support
Aggregation queries on geographical location data?

Thanks & Regards,
Indhumathi




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

VenuReddy
In reply to this post by chetdb
1. Would table with geospatial location column be allowed to be updated with
non-geospatial data and vice verca . Or would it according to the existing
behavior and any unsupported data in type/column would be treated as bad
records ?
=> Location columns cannot be allowed with invalid datatypes. It can be
treated as bad records with unsupported data in type/column.

2. Would there be any limitations with respect to using targetColumn column
configured as local dictionary,inverted index,cache column or range column
in table properties ?
=> I think, there shouldn't be any such restriction. TargetColumn is just an
additional column internally generated when INDEX property is specified.

3. Would only measure data types be supported for targetDataType parameter ?
Supported types can be mentioned in design doc.
=> We can treat the generated geohash column as dimension column as it
should be part of sort columns.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Ajantha Bhat
In reply to this post by Indhumathi
Hi Jacky,

we have checked about geomesa 



a. Geomesa is tightly coupled with  key-value pair databases like Accumulo, HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
b. Geomesa current spark integration is only in query flow, load from spark is not supported. spark can be used for analytics on geomesa store. 
Here they override spark catalyst optimizer code to intercept filter from logical relation and they push down to geomesa server. 
All the query logic like spatial time curve building (z curve, quadtree) doesn't happen at spark layer. It happens in geoserver layer which is coupled with key-value pair databases.



c. Geomesa is for spatio-temporal data , not just a spatial data.
so, we cannot integrate carbon with  geo mesa directly, but we can reuse some of the logic present in it like quadtree formation and look up.

Also I found another alternative "GeoSpark", this project is not coupled with any store. 

so, we will check further about integrating carbon to GeoSpark or reusing some of the code from this.

Also regarding the second point, yes, we can have carbon implementation as a generic framework where we can plugin the different logic. 

Thanks,
Ajantha





On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <[hidden email]> wrote:
Hi Venu,

I have some questions regarding this feature.

1. Does geospatial index supports on streaming table?. If so, will there be
any impact on generating
    geoIndex on steaming data?
2. Does it have any restrictions on sort_scope?
3. Apart from Point and Polygon queries, will geospatial index also support
Aggregation queries on
    geographical location data?

Thanks & Regards,
Indhumathi




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Jacky Li-3
Thanks for the analysis. Please be careful of the code reuse from other "opensource" repo, especially for the License.

Regards,
Jacky

On 2019/10/24 06:25:40, Ajantha Bhat <[hidden email]> wrote:

> Hi Jacky,
>
> we have checked about geomesa
>
> [image: Screenshot from 2019-10-23 16-25-23.png]
>
> a. Geomesa is tightly coupled with  key-value pair databases like Accumulo,
> HBase, Google Bigtable and Cassandra databases and used for OLTP queries.
> b. Geomesa current spark integration is only in query flow, load from spark
> is not supported. spark can be used for analytics on geomesa store.
> Here they override spark catalyst optimizer code to intercept filter from
> logical relation and they push down to geomesa server.
> All the query logic like spatial time curve building (z curve, quadtree)
> doesn't happen at spark layer. It happens in geoserver layer which is
> coupled with key-value pair databases.
> https://www.geomesa.org/documentation/user/architecture.html
>
> https://www.geomesa.org/documentation/user/spark/architecture.html
>
> https://www.youtube.com/watch?v=Otf2jwdNaUY
>
> c. Geomesa is for spatio-temporal data , not just a spatial data.
> so, we cannot integrate carbon with  geo mesa directly, but we can reuse
> some of the logic present in it like quadtree formation and look up.
>
> Also I found *another alternative* "*GeoSpark", *this project is not
> coupled with any store.
> https://datasystemslab.github.io/GeoSpark/
>
> https://www.public.asu.edu/~jiayu2/presentation/jia-icde19-tutorial.pdf
> so, we will check further about integrating carbon to GeoSpark or reusing
> some of the code from this.
>
> Also regarding the second point, yes, we can have carbon implementation as
> a generic framework where we can plugin the different logic.
>
> Thanks,
> Ajantha
>
>
>
>
>
> On Mon, Oct 21, 2019 at 6:34 PM Indhumathi <[hidden email]> wrote:
>
> > Hi Venu,
> >
> > I have some questions regarding this feature.
> >
> > 1. Does geospatial index supports on streaming table?. If so, will there be
> > any impact on generating
> >     geoIndex on steaming data?
> > 2. Does it have any restrictions on sort_scope?
> > 3. Apart from Point and Polygon queries, will geospatial index also support
> > Aggregation queries on
> >     geographical location data?
> >
> > Thanks & Regards,
> > Indhumathi
> >
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

brijoobopanna
Thanks for proposing i would suggest to explore and think of integrating
already avail lib like Apache Spatial Information System rather than
developing : https://sis.apache.org/



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Jacky Li
I am not familliar with Apache SIS, is it already integrated with other
storage system? Is there any pointer to learn about this?

In my opinion, this thread was discussing the indexing part in the
CarbonData to accelerate geosptial related queries. If Apache SIS offers
integration framework and can provide more APIs for application, I'd like to
explore more possibility to enlarge CarbonData's usage.

Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

xuchuanyin
In reply to this post by VenuReddy
Sorry that I cannot access the document in jira.

In my opinion, both for the SORT_COLUMNS in current implementation and for
the LOCATION_COLUMNS in the proposal, carbondata tries to organize the data
in some order.

So the kernel of the proposal is that, for the SORT_COLUMNS, we can specify
a algorithm for the order of sort. By default it is MIN_MAX-INCREASE sort,
and more sort can be introduced such as INVERSE_MIN_MAX-INCRESAE sort or
Z-ORDER sort.

For the implementation, we can define a sort factory and implement some, and
also keep it open for the users' customization just like column_compressor
we implemented before.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

VenuReddy
In reply to this post by Indhumathi
@Indhumathi Please find my reply inline

1. Does geospatial index supports on streaming table?. If so, will there be
any impact on generating geoIndex on steaming data?
=> Yes. We can support for steaming tables as well. But we shall restrict it
for now and enhance in the future.
2. Does it have any restrictions on sort_scope?
=> There is no restriction on sort_scope. Same existing sort_scope applies
to it.
3. Apart from Point and Polygon queries, will geospatial index also support
Aggregation queries on geographical location data?
=> For now, we shall restrict to polygon. But IMO, can extend it for
multiple types of queries.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

VenuReddy
In reply to this post by xuchuanyin
@xuchuanyin
With this new property, we can create a non-schema column internally and can
generate the customized value to it upon each row add from the existing
schema columns values(i.e., from source column values). Note that source
columns are specified with property.
             Since the intent of this column creation was to use it as sort
column too, we can implicitly add it to existing configured sort column
list. During the table creation, we can append to the existing sort column
list. And, if we want to change the order of sort columns, we can use
existing alter table set table properties for sort columns.

The way sorting of data happens is still same as present except the fact
that it considers another non-schema column also into account during sort.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

VenuReddy
In reply to this post by VenuReddy
Hi all,

I've refreshed the design document in jira. Have incorporated changes to
table properties and fixed review comments.
Please find  the latest design doc at
https://issues.apache.org/jira/browse/CARBONDATA-3548
Request review and let me know your opinion.

Thanks,
Venu Reddy



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Ajantha Bhat
Hi Venu,

1. Please keep the default implementation independent of grid size and
other parameters.
I mean below parameters.
'INDEX_HANDLER.xxx.gridSize',
'INDEX_HANDLER.xxx.minLongitude',
'INDEX_HANDLER.xxx.maxLongitude',
'INDEX_HANDLER.xxx.minLatitude',
'INDEX_HANDLER.xxx.maxLatitude',

*It should work on just longitude , latitude. index type and float data
type as default longitude and latitude. *
*Quadtree* logic can be generic, which takes geohash id and  return ranges.
Can work for all implementations.

Can add custom implementation for gridsize and other parameters if required.

2. In describe formatted table, Instead of non-schema columns, can show it
as Custom Index Information.
And better to show the custom index handler name and source columns used
also in describe.

*# Custom Index Information*

*custom index Handler Class :*

*custom index Handler type:*
*custom index column name : *

*custom index column data type : *
*custom index source columns :*

we can skip display itself if property is not configured.

Thanks,
Ajantha



On Tue, Nov 26, 2019 at 8:38 PM VenuReddy <[hidden email]> wrote:

> Hi all,
>
> I've refreshed the design document in jira. Have incorporated changes to
> table properties and fixed review comments.
> Please find  the latest design doc at
> https://issues.apache.org/jira/browse/CARBONDATA-3548
> Request review and let me know your opinion.
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

VenuReddy
@Ajantha
Agreed. Have updated the design doc as suggested.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

Ajantha Bhat
@venu: ok. +1

On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <[hidden email]> wrote:

> @Ajantha
> Agreed. Have updated the design doc as suggested.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION]Support for Geospatial indexing

kumarvishal09
+1
Regards
Kumar Vishal


On Fri, Nov 29, 2019 at 4:36 PM Ajantha Bhat <[hidden email]> wrote:

> @venu: ok. +1
>
> On Fri, Nov 29, 2019 at 3:48 PM VenuReddy <[hidden email]>
> wrote:
>
> > @Ajantha
> > Agreed. Have updated the design doc as suggested.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>
kumar vishal