Apache CarbonData Dev Mailing List archive

[Discussion]SDK support to load data from parquet, ORC, CSV, Avro and JSON file.

Classic

List

Threaded

3 messages Options

Nihal

[Discussion]SDK support to load data from parquet, ORC, CSV, Avro and JSON file.

*Background*: This feature will support the carbondata SDK to load data from
Parquet, ORC, CSV, Avro and JSON file.

Details of solution and implementation are mentioned in the document
attached to JIRA.
https://issues.apache.org/jira/browse/CARBONDATA-3855

NOTE: This design handles load data from Parquet, ORC, CSV, and Avro file.
JSON file will be handled later.

Thanks,
Nihal ojha

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xubo245

Re: [Discussion]SDK support to load data from parquet, ORC, CSV, Avro and JSON file.

+1,

If carbondata sdk can support load data from parquet, ORC, CSV, Avro and
JSON file, it will more convenient for users to use CarbonData. It avoid
every user to parser different fileformat and convert to carbondata format
by coding.

CarbonData SDK can refer spark-sql implementation， but CarbonSDK should
dependency Spark to implement this function because carbonData SDK is
usually used for no spark environment. CarbonData can integrate Parquet,
Json, CSV parser SDK to implement it.

For API design:

1. For load parquet, avro , json, csv files:
(1) it's better to support load many times data after create CarbonWriter.
(2) it's better to support schema map, for example, carbondata and parquet
has different schema, such as column name is different, column name is col1
in carbon , but column name is col2 in parquet, it's better for CarbonSDK to
support schema map, map parquet col2 to carbon col1.

2. API name:

public CarbonWriterBuilder withParquetFile(String filePath)

=>suggestion:
Parquet maybe has many files in the same folder. withParquetFile is easily
misunderstand one parquet file. so we can use withParquet(String filePath)
or withParquetPath

withOrcFile and withCsvFile is the same.

and it better to support load files list , for example, CarbonData SDK can
support load three files of five files in the same folder, which can be
selected by users.

3. lic void write() throws IOException;
=> we should support load data after create CarbonWriter, and load data to
memory and convert to carbon after call load method.

4. now we write data to disk/obs/hdfs by calling close method, it's better
to support flush.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Nihal

Re: [Discussion]SDK support to load data from parquet, ORC, CSV, Avro and JSON file.

Updated the design document in Jira after addressing the latest comment.
1. Changed the API name as suggested.
2. Now user can load the single file or all files at a given directory or
selected files under the directory.
3. First, we create the carbon writer, and then we called the write()
method. Under this method, we read the data of file iteratively and write in
carbon format.
4. We are using the current interface for the close() method. Flush we will
support later.

Thanks and Regards,
Nihal Ojha

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/