Data science: uniform python apis for arrays and data frames planned

Data Science: Uniform Python APIs for Arrays and Data Frames planned

The newly rounded consortium for python data api standards wants to create uniform interfaces for machine learning frameworks and python libraries. The focus is on the one hand on arrays or tensors and on the other hand on dataframes.

Initiator of the consortium is quanight labs, and the sponsors in the basic is intel, microsoft, the d. E. Shaw group, google research and quansight. In the long term, the consortium should grow to a project and okosystem-umbrella organization, which relates to apis and mechanisms for data exchange and standardized.

Fragmented libraries

Many functions are similarly integrated into numerous libraries, but have a lot of differences in detail. As an example, the blog contribution to the start of the consortium has the functions for calculating the arithmetic via an array implemented in the individual libraries as follows:

numpy: mean (a, axis = none, dtype = none, out = none, keepdims =) dask.Array: mean (a, axis = none, dtype = none, out = none, keepdims =) cupy: mean (a, axis = none, dtype = none, out = none, keepdims = false) jax.Numpy: mean (a, axis = none, dtype = none, out = none, keepdims = false) mxnet.Np: mean (a, axis = none, dtype = none, out = none, keepdims = false) sparse: s.Mean (axis = none, keepdims = false, dtype = none, out = none) torch: mean (input, dim, keepdim = false, out = none) tensorflow: reduce_mean (input_tensor, axis = none, keepdims = none, name = none, reduction_indices = none, keep_dims = none)

The functions of the above libraries have largely identical signatures, apart from the default specification for keepdims. However, they also offer a different semantics in equal signatures, which is more difficult to determine. Mxnet documents the deviations from numpy even explicitly: as an array parameter, only a ndarray is allowed, and the data type for numbers is float32.

Conservative decisions

The consortium would first check the requirements for standardization and starts with a requiring engineering phase. It is about recognizing which areas require a standardization and how the standards are to be implemented.

To avoid wild growth due to too many special application trap, the standardization process should focus on the functions that exist in most libraries in any form. In addition, the consortium wants to evaluate which functions use data scientists in practice. For the latter, a tool in the github repository for the data api standards exist, which determines which python modules uses another module.

Api comparison and cooperation

Another tool in the repository reads and processes the publicly available html documentation for array libraries, compares the existing functions as well as their signatures and shows the result in a html table.

Data Science: Uniform Python APIs for Arrays and Data Frames planned

Output of apis determined with the api comparison tool uber make view intersection.

In addition, the initiators of the standardization process hope for the participation of the developers of the important python libraries and frameworks as well as the community. Until the 15th. September, the consortium wants to inform the rfc (request for comments) for the array apis and contribute to the community review process. On the 15th. November should follow the rfc for dataframe apis. Further details can be found in the blog post to start the consortium.

Leave a Reply

Your email address will not be published.