lleaves.data_processing module

lleaves.data_processing.data_to_ndarray(data, pd_traintime_categories: List[List] | None = None)

Convert the given data to a numpy ndarray

For pandas dataframes categories are mapped to floats. This mapping needs to be the same as it was during model training, which is achieved via pandas_categorical.

Example for two columns with two categories each: pd_traintime_categories = [["a", "b"], ["b", "a"]]. These are two different columns and result in different mappings: “a” -> 0.0, “b” -> 1.0, vs “b” -> 0.0, “a” -> 1.0.

LightGBM generates this list of lists at traintime like so:

pd_traintime_categories = [
  list(df[col].cat.categories)
  for col in df.select_dtypes(include=['category']).columns
]

The result is appended via json.dump to the model.txt under the ‘pandas_categorical’ key. You can extract it from there using lleaves.data_processing.extract_pandas_traintime_categories().

Parameters:

data – Pandas dataframe, numpy array or Python list. No dimension checking occurs. If a dataframe is passed the number of categorical columns needs to equal len(pd_traintime_categories).
pd_traintime_categories – For each categorical column in dataframe, a list of its categories. The ordering of columns and of categories within each column should match the training dataset. Ignored if data is not a pandas DataFrame.

Returns:

numpy ndarray

lleaves.data_processing.extract_model_global_features(file_path)

Extract number of features, number of classes and number of trees of this model

Parameters:: file_path – path to model.txt
Returns:: dict with “n_args”, “n_classes”, “n_trees”

lleaves.data_processing.extract_pandas_traintime_categories(file_path)

Scan the model.txt from the back to extract the ‘pandas_categorical’ field.

This is a list of lists that stores the ordering of categories from the pd.DataFrame used for training. Storing this list is necessary as LightGBM encodes categories as integer indices and we need to guarantee that the mapping (<category string> -> <integer idx>) is the same during inference as it was during training.

Example (pandas categoricals were present in training): pandas_categorical:[["a", "b", "c"], ["b", "c", "d"], ["w", "x", "y", "z"]]
Example (no pandas categoricals during training): pandas_categorical:[] or pandas_categorical=null

Parameters:: file_path – path to model.txt
Returns:: list of list. For each pd.categorical column encountered during training, a list of the categories.

lleaves.data_processing.ndarray_to_ptr(data: ndarray, use_fp64: bool = True)

Takes a 2D numpy array, converts it to either float64 or float32 depending on the use_fp64 flag, and returns a pointer to the data.

Parameters:

data – 2D numpy array. Copying is avoided if possible.
use_fp64 – Bool. Casting to float64 if True, otherwise float32.

Returns:

pointer to 1D array of type float64 if use_fp64 is True, otherwise float32.