neptoon.io.read

data_ingest¶

Classes:

FileCollectionConfig
ManageFileCollection
ParseFilesIntoDataFrame
InputColumnDataType
NeutronCountUnits
PressureUnits
MergeMethod
InputColumnMetaData
InputDataFrameFormattingConfig
FormatDataForCRNSDataHub
CollectAndParseRawData

Functions:

path_to_config
path_to_config
data_location
data_location
data_source
separator
separator
remove_prefix
remove_prefix
decimal
decimal
dump_tar
dump_zip
build_from_config
get_list_of_files
filter_files
create_file_list
make_dataframe
add_column_meta_data
import_config
build_from_config
assign_merge_methods
add_meteo_columns
add_date_time_column_info
data_frame
config
data_frame
extract_date_time_column
convert_time_zone
date_time_as_index
data_frame_to_numeric
get_conversion_factor_to_cph
standardise_units_of_pressure
merge_multiple_meteo_columns
prepare_key_columns
prepare_neutron_count_columns
clean_raw_dataframe
calc_neutron_uncertainty
snip_data_frame
format_data_and_return_data_frame
path_to_config
path_to_config
create_data_frame

FileCollectionConfig ¶

FileCollectionConfig(path_to_config=None, data_location=None, column_names=None, prefix='', suffix='', encoding='cp850', skip_lines=0, separator=',', decimal='.', skip_initial_space=True, parser_kw_strip_left=True, parser_kw_digit_first=True, starts_with='', multi_header=False, strip_names=True, remove_prefix='//')

Configuration class for file collection and parsing settings.

This class holds all the necessary parameters for locating, reading, and parsing data files, providing a centralized configuration for the data ingestion process.

Initial parameters for data collection and merging

Parameters:

Name	Type	Description	Default
`path_to_config`	`Union[str, Path]`	The location of the sensor configuration file. Can be either a string or Path object	`None`
`data_location`	`Union[str, Path]`	The location of the data files. Can be either a string or Path object	`None`
`column_names`	`list`	List of column names for the data, by default None	`None`
`prefix`	`str`	Start of file name for file filtering, by default None	`''`
`suffix`	`str`	End of file name - used for file filtering, by default None	`''`
`encoding`	`str`	Encoder used for file encoding, by default "cp850"	`'cp850'`
`skip_lines`	`int`	Whether lines should be skipped when parsing files, by default 0	`0`
`seperator`	`str`	Column seperator in the files, by default ","	required
`decimal`	`str`	The default decimal character for floating point numbers , by default "."	`'.'`
`skip_initial_space`	`bool`	Whether to skip intial space when creating dataframe, by default True	`True`
`parser_kw`	`dict`	Dictionary with parser values to use when parsing data, by default dict( strip_left=True, digit_first=True, )	required
`starts_with`	`any`	String that headers must start with, by default ""	`''`
`multi_header`	`bool`	Whether to look for multiple header lines, by default False	`False`
`strip_names`	`bool`	Whether to strip whitespace from column names, by default True	`True`
`remove_prefix`	`str`	Prefix to remove from column names, by default "//"	`'//'`

dump_tar ¶

dump_tar()

Create temporary directory and extract

dump_zip ¶

dump_zip()

Create temporary directory and extract

build_from_config ¶

build_from_config(path_to_config=None)

Imports the attributes for the instance of FileCollectionConfig from a pre-configured YAML file

Parameters:

Name	Type	Description	Default
`path_to_config`	`Union[Path, str]`	Path to the sensor configuration file, by default None	`None`

Raises:

Type	Description
`ValueError`	If no suitable path given

ManageFileCollection ¶

ManageFileCollection(config, files=None)

Manages the collection of files in preperation for parsing them into a DataFrame for the CRNSDataHub.

Example:

config = FileCollectionConfig(data_location="/path/to/folder") file_manager = ManageFileCollection(config) file_manager.get_list_of_files() file_manager.filter_files()

Initial parameters

Parameters:

Name	Type	Description	Default
`config`	`FileCollectionConfig[str, Path]`	The config file holding key information for collection	required
`files`	`List`	Placeholder for files	`None`

get_list_of_files ¶

get_list_of_files()

Lists the files found at the data_location and assigns these to the file attribute.

filter_files ¶

filter_files()

Filters the files found in the data location using the prefix or suffix given during initialisation. Both of these default to None.

This method updates the files attribute of the class with the filtered list.

TODO maybe add regexp or * functionality

create_file_list ¶

create_file_list()

Create clean file list

ParseFilesIntoDataFrame ¶

ParseFilesIntoDataFrame(file_manager, config)

Parses raw files into a single pandas DataFrame.

This class takes instances of ManageFileCollection and FileCollectionConfig to process and combine multiple data files into a single DataFrame, handling various file formats and parsing configurations.

Example

config = FileCollectionConfig(data_location='/path/to/data/folder/') file_manager = ManageFileCollection(config=config) file_parser = ParseFilesIntoDataFrame(file_manager, config) df = file_parser.make_dataframe()

Initialisation files.

Parameters:

Name	Type	Description	Default
`file_manager`	`ManageFileCollection`	An instance fo the ManageFileCollection class	required
`config`	`FileCollectionConfig`	The config file containing information to support processing.	required

make_dataframe ¶

make_dataframe(column_names=None)

Merges, parses and converts data it to a single DataFrame.

Parameters:

Name	Type	Description	Default
`column_names`	`list`	Can supply custom column_names for saving file, by default None	`None`

Returns:

Type	Description
`DataFrame`	DataFrame with all data

InputDataFrameFormattingConfig ¶

InputDataFrameFormattingConfig(path_to_config=None, pressure_merge_method=PRIORITY, pressure_units=HECTOPASCALS, temperature_merge_method=PRIORITY, relative_humidity_merge_method=PRIORITY, neutron_count_units=ABSOLUTE_COUNT, date_time_columns=None, date_time_format='%Y/%m/%d %H:%M:%S', initial_time_zone='utc', convert_time_zone_to='utc', is_timestamp=False, decimal='.', start_date_of_data=None)

Configuration class storing necessary attributes to format a DataFrame using the FormatDataForCRNSDataHub.

A class storing information supporting automated processing of raw input CRNS data files into a ready for neptoon dataframe (for use in FormatDataForCRNSDataHub)

Parameters:

Name	Type	Description	Default
`path_to_config`	`Union[str, Path]`	path to the sensor configuration file by default None	`None`
`output_resolution`	`str`	The desired time resolution of the dataframe to aggregate to. If None no time aggregation is done. Otherwise in format , by default None	required
`aggregate_method`	`Literal['fagg', 'bagg', 'nagg']`	Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or “surrounding” interval).	required
`aggregate_func`	`str`	Aggregation function. By default mean.	required
`aggregate_maxna_fraction`	`float`	Maximum fraction of values in the aggregation period that can be NaN. If set to 0.3 only 30% of the values can be NaN by default 0.5	required
`align_timestamps`		Whether to align the time stamps to a regular time. E.g., If time_resolution is 1hour, 13:10, becomes 13:00, by default False.	required
`align_method`		The alignment method to use by default "time", see https://rdm-software.pages.ufz.de/saqc/_api/saqc.SaQC.html#saqc.SaQC.align	required
`pressure_merge_method`	`MergeMethod`	Method used to merge multiple pressure columns, by default MergeMethod.PRIORITY	`PRIORITY`
`pressure_units`	`PressureUnits`	States the units of pressure for input data, will be converted to HECTOPASCALS	`HECTOPASCALS`
`temperature_merge_method`	`MergeMethod`	Method used to merge multiple temperature columns,, by default MergeMethod.PRIORITY	`PRIORITY`
`relative_humidity_merge_method`	`MergeMethod`	Method used to merge multiple relative humidity columns,, by default MergeMethod.PRIORITY	`PRIORITY`
`neutron_count_units`	`NeutronCountUnits`	The units of neutron counts, by default NeutronCountUnits.ABSOLUTE_COUNT	`ABSOLUTE_COUNT`
`date_time_columns`	`List[str]`	Names of date time columns, if more than one expects DATE + TIME, by default None	`None`
`date_time_format`	`str`	Format of the date time column, by default "%Y/%m/%d %H:%M:%S"	`'%Y/%m/%d %H:%M:%S'`
`initial_time_zone`	`str`	Initial time zone, by default "utc"	`'utc'`
`convert_time_zone_to`	`str`	Desired time zone, by default "utc"	`'utc'`
`is_timestamp`	`bool`	Whether time stamp, by default False	`False`
`decimal`	`str`	Decimal divider, by default "."	`'.'`
`start_date_of_data`	`str \| DateTime`	The beginning date from which data should be processed. All data before this date is removed during parsing. Should always be in format: "%Y-%m-%d" e.g., 2024-04-22	`None`

Notes

For time_resolution, is a positive integer and is one of: - For minutes: "min", "minute", "minutes" - For hours: "hour", "hours", "hr", "hrs" - For days: "day", "days" The parsing is case-insensitive.

For *_merge_method parameters: - Mergemethod.MEAN: Average of all columns with the same data type. - Mergemethod.PRIORITY: Select one column from available columns based on predefined priority.

add_column_meta_data ¶

add_column_meta_data(initial_name, variable_type, unit, priority)

Adds an InputColumnMetaData class to the column_data attribute.

Parameters:

Name	Type	Description	Default
`initial_name`	`str`	The name of the column from the original raw data	required
`variable_type`	`InputColumnDataType`	Enum of the column data type: see InputColumnDataType	required
`unit`	`str`	The units of the column e.g., "hectopascals"	required
`priority`	`int`	The priority of the column - 1 being highest. Needed when multiple columns are present and the user wants to use the priority merge method (i.e., choose the best column for a data type).	required

import_config ¶

import_config(path_to_config=None)

Automatically assigns the internal attributes using a provided YAML file.

Parameters:

Name	Type	Description	Default
`path_to_config`	`str`	Location of the YAML file, if not supplied here it expects that the self.path_to_config attribute is already set, by default None	`None`

Raises:

Type	Description
`ValueError`	When no path is given but the method is called.

build_from_config ¶

build_from_config()

Assign attributes using the YAML information.

assign_merge_methods ¶

assign_merge_methods(column_data_type, merge_method)

Assigns the merge method for each of the input columns.

Parameters:

Name	Type	Description	Default
`column_data_type`	`InputColumnDataType`	The variable being assinged (as a InputColumnDataType)	required
`merge_method`	`str`	The selected merge methodq	required

add_meteo_columns ¶

add_meteo_columns(meteo_columns, meteo_type, unit)

Adds column meta data to the class instance. Intended for use when importing attributes with the YAML file.

There can be more than one column recording the same variable. These are recorded in the YAML in priority order e.g.,:

pressure_columns:
    - P4_mb # first priorty goes first
    - P3_mb
    - P1_mb

This method will go through the list in priority order, create a InputColumnMetaData class for each column, assign the appropriate values, and add it to self.column_data using the method self.add_column_meta_data.

Parameters:

Name	Type	Description	Default
`meteo_columns`	`List`	A list of column names	required
`meteo_type`	`InputColumnDataType`	The type of column being attributed	required
`unit`	`str`	The units associated with the column	required

add_date_time_column_info ¶

add_date_time_column_info(date_time_columns, date_time_format, initial_time_zone, convert_time_zone_to='UTC')

Adds datetime column information. Intended for use when importing attributes with the YAML file.

Parameters:

Name	Type	Description	Default
`date_time_columns`	`List`	Names of date time columns	required
`date_time_format`	`str`	The expected format of the date time values.	required
`initial_time_zone`	`str`	The intial time zone of the data	required
`convert_time_zone_to`	`str`	The desired time zone, by default "UTC"	`'UTC'`

FormatDataForCRNSDataHub ¶

FormatDataForCRNSDataHub(data_frame, config)

Formats a DataFrame into the required format to work in neptoon.

Key features: - Combines multiple datetime columns (e.g., DATE + TIME) into a single date_time column - Converts time zone (default UTC) - Ensures date time index - Ensures columns are numeric - Organises columns when multiple are present

Attributes:

Name	Type	Description
`data_frame`	`DataFrame`	The time series dataframe
`config`	`InputDataFrameFormattingConfig`	Config object with information about the dataframe, which supports formatting

Attributes of class

Parameters:

Name	Type	Description	Default
`data_frame`	`DataFrame`	The un-formatted dataframe	required
`config`	`InputDataFrameConfig`	Config Object which sets the options for formatting, by default None	required

extract_date_time_column ¶

extract_date_time_column()

Create a Datetime column, merge columns if necessary (e.g., when columns are split into date and time)

Returns: pd.Series: the Datetime column.

convert_time_zone ¶

convert_time_zone(date_time_series)

Convert the timezone of a date time time series. Uses the attributes initial_time_zone (the actual time zone the data is currently in) and convert_time_zone_to which is the desired time zone. This is default set the UTC time.

Parameters:

Name	Type	Description	Default
`date_time_series`	`Series`	The date_time_series that is converted	required

Returns:

Type	Description
`Series`	The converted date_time series in the correct time zone

date_time_as_index ¶

date_time_as_index()

Sets a date_time column as the index of the contained DataFrame

Returns: pd.DataFrame: data with a DatetimeIndex

data_frame_to_numeric ¶

data_frame_to_numeric()

Convert DataFrame columns to numeric values.

get_conversion_factor_to_cph ¶

get_conversion_factor_to_cph(timestep_seconds)

Figures out the factor needed to multiply a count rate by to convert it to counts per hour. Uses the time_resolution attribute for this calculation.

Returns:

Type	Description
`float`	The factor to convert to counts per hour

standardise_units_of_pressure ¶

standardise_units_of_pressure()

Standardises units of pressure to hectopascals

merge_multiple_meteo_columns ¶

merge_multiple_meteo_columns(column_data_type)

Merges columns when multiple are available. Many CRNS have multiple sensors available in the input dataset (e.g., 2 or more pressure sensors at the site). We need only one value for each of these variables. This method uses the settings in the DataFrameConfig class to produce a single sensor value for the selected sensor.

Current Options (set in the Config file): mean - create an average of all the pressure sensors priority - use one sensor selected as priority

Future Options: priority_filled - use one sensor as priorty and fill values from alternative seno

Parameters:

Name	Type	Description	Default
`column_data_type`	`InputColumnDataType`	One of the possible InputColumnDataTypes that can be used here.	required

Raises:

Type	Description
`ValueError`	If an incompatible InputColumnDataType is given

prepare_key_columns ¶

prepare_key_columns()

Prepares the key columns if all the information has been supplied.

prepare_neutron_count_columns ¶

prepare_neutron_count_columns(neutron_column_type)

Prepares the neutron columns for usage in neptoon. Performs several steps:

- Finds the columns labeled with neutron_column_type
- If more than one it will sum them into a new column
- Check the units and convert to counts per hour.

Parameters:

Name	Type	Description	Default
`neutron_column_type`	`Literal[EPI_NEUTRON_COUNT, THERM_NEUTRON_COUNT]`	`Literal[ InputColumnDataType.EPI_NEUTRON_COUNT, InputColumnDataType.THERM_NEUTRON_COUNT, ]` The type of neutron data being processed	required

clean_raw_dataframe ¶

clean_raw_dataframe()

Cleans raw DataFrame by removing NaT values and duplicated rows.

calc_neutron_uncertainty ¶

calc_neutron_uncertainty()

Creates a column with the statistical uncertainty of the neutron column and converts this value to counts per hour.

snip_data_frame ¶

snip_data_frame()

Removes data from before the defined install date.

format_data_and_return_data_frame ¶

format_data_and_return_data_frame()

Completes the whole process of formatting the dataframe. Expects the settings to be fully implemented.

Returns:

Type	Description
`DataFrame`	DataFrame

CollectAndParseRawData ¶

CollectAndParseRawData(path_to_config, file_collection_config=None, input_formatter_config=None)

Central class which allows us to do the entire ingest and formatting routine. Designed to work with a YAML file.

create_data_frame ¶

create_data_frame()

Creates the data frame by parsing raw data files into a DataFrame. It expects to use a YAML file.

Returns:

Type	Description
`_type_`	description