Skip to content

neptoon.io.read

data_ingest

Classes:

Functions:

FileCollectionConfig

FileCollectionConfig(path_to_config=None, data_location=None, column_names=None, prefix='', suffix='', encoding='cp850', skip_lines=0, separator=',', decimal='.', skip_initial_space=True, parser_kw_strip_left=True, parser_kw_digit_first=True, starts_with='', multi_header=False, strip_names=True, remove_prefix='//')

Configuration class for file collection and parsing settings.

This class holds all the necessary parameters for locating, reading, and parsing data files, providing a centralized configuration for the data ingestion process.

Initial parameters for data collection and merging

Parameters:

Name Type Description Default
path_to_config Union[str, Path]

The location of the sensor configuration file. Can be either a string or Path object

None
data_location Union[str, Path]

The location of the data files. Can be either a string or Path object

None
column_names list

List of column names for the data, by default None

None
prefix str

Start of file name for file filtering, by default None

''
suffix str

End of file name - used for file filtering, by default None

''
encoding str

Encoder used for file encoding, by default "cp850"

'cp850'
skip_lines int

Whether lines should be skipped when parsing files, by default 0

0
seperator str

Column seperator in the files, by default ","

required
decimal str

The default decimal character for floating point numbers , by default "."

'.'
skip_initial_space bool

Whether to skip intial space when creating dataframe, by default True

True
parser_kw dict

Dictionary with parser values to use when parsing data, by default dict( strip_left=True, digit_first=True, )

required
starts_with any

String that headers must start with, by default ""

''
multi_header bool

Whether to look for multiple header lines, by default False

False
strip_names bool

Whether to strip whitespace from column names, by default True

True
remove_prefix str

Prefix to remove from column names, by default "//"

'//'

dump_tar

dump_tar()

Create temporary directory and extract

dump_zip

dump_zip()

Create temporary directory and extract

build_from_config

build_from_config(path_to_config=None)

Imports the attributes for the instance of FileCollectionConfig from a pre-configured YAML file

Parameters:

Name Type Description Default
path_to_config Union[Path, str]

Path to the sensor configuration file, by default None

None

Raises:

Type Description
ValueError

If no suitable path given

ManageFileCollection

ManageFileCollection(config, files=None)

Manages the collection of files in preperation for parsing them into a DataFrame for the CRNSDataHub.

Example:

config = FileCollectionConfig(data_location="/path/to/folder") file_manager = ManageFileCollection(config) file_manager.get_list_of_files() file_manager.filter_files()

Initial parameters

Parameters:

Name Type Description Default
config FileCollectionConfig[str, Path]

The config file holding key information for collection

required
files List

Placeholder for files

None

get_list_of_files

get_list_of_files()

Lists the files found at the data_location and assigns these to the file attribute.

filter_files

filter_files()

Filters the files found in the data location using the prefix or suffix given during initialisation. Both of these default to None.

This method updates the files attribute of the class with the filtered list.

TODO maybe add regexp or * functionality

create_file_list

create_file_list()

Create clean file list

ParseFilesIntoDataFrame

ParseFilesIntoDataFrame(file_manager, config)

Parses raw files into a single pandas DataFrame.

This class takes instances of ManageFileCollection and FileCollectionConfig to process and combine multiple data files into a single DataFrame, handling various file formats and parsing configurations.

Example

config = FileCollectionConfig(data_location='/path/to/data/folder/') file_manager = ManageFileCollection(config=config) file_parser = ParseFilesIntoDataFrame(file_manager, config) df = file_parser.make_dataframe()

Initialisation files.

Parameters:

Name Type Description Default
file_manager ManageFileCollection

An instance fo the ManageFileCollection class

required
config FileCollectionConfig

The config file containing information to support processing.

required

make_dataframe

make_dataframe(column_names=None)

Merges, parses and converts data it to a single DataFrame.

Parameters:

Name Type Description Default
column_names list

Can supply custom column_names for saving file, by default None

None

Returns:

Type Description
DataFrame

DataFrame with all data

InputDataFrameFormattingConfig

InputDataFrameFormattingConfig(path_to_config=None, pressure_merge_method=PRIORITY, pressure_units=HECTOPASCALS, temperature_merge_method=PRIORITY, relative_humidity_merge_method=PRIORITY, neutron_count_units=ABSOLUTE_COUNT, date_time_columns=None, date_time_format='%Y/%m/%d %H:%M:%S', initial_time_zone='utc', convert_time_zone_to='utc', is_timestamp=False, decimal='.', start_date_of_data=None)

Configuration class storing necessary attributes to format a DataFrame using the FormatDataForCRNSDataHub.

A class storing information supporting automated processing of raw input CRNS data files into a ready for neptoon dataframe (for use in FormatDataForCRNSDataHub)

Parameters:

Name Type Description Default
path_to_config Union[str, Path]

path to the sensor configuration file by default None

None
output_resolution str

The desired time resolution of the dataframe to aggregate to. If None no time aggregation is done. Otherwise in format , by default None

required
aggregate_method Literal['fagg', 'bagg', 'nagg']

Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or “surrounding” interval).

required
aggregate_func str

Aggregation function. By default mean.

required
aggregate_maxna_fraction float

Maximum fraction of values in the aggregation period that can be NaN. If set to 0.3 only 30% of the values can be NaN by default 0.5

required
align_timestamps

Whether to align the time stamps to a regular time. E.g., If time_resolution is 1hour, 13:10, becomes 13:00, by default False.

required
align_method

The alignment method to use by default "time", see https://rdm-software.pages.ufz.de/saqc/_api/saqc.SaQC.html#saqc.SaQC.align

required
pressure_merge_method MergeMethod

Method used to merge multiple pressure columns, by default MergeMethod.PRIORITY

PRIORITY
pressure_units PressureUnits

States the units of pressure for input data, will be converted to HECTOPASCALS

HECTOPASCALS
temperature_merge_method MergeMethod

Method used to merge multiple temperature columns,, by default MergeMethod.PRIORITY

PRIORITY
relative_humidity_merge_method MergeMethod

Method used to merge multiple relative humidity columns,, by default MergeMethod.PRIORITY

PRIORITY
neutron_count_units NeutronCountUnits

The units of neutron counts, by default NeutronCountUnits.ABSOLUTE_COUNT

ABSOLUTE_COUNT
date_time_columns List[str]

Names of date time columns, if more than one expects DATE + TIME, by default None

None
date_time_format str

Format of the date time column, by default "%Y/%m/%d %H:%M:%S"

'%Y/%m/%d %H:%M:%S'
initial_time_zone str

Initial time zone, by default "utc"

'utc'
convert_time_zone_to str

Desired time zone, by default "utc"

'utc'
is_timestamp bool

Whether time stamp, by default False

False
decimal str

Decimal divider, by default "."

'.'
start_date_of_data str | DateTime

The beginning date from which data should be processed. All data before this date is removed during parsing. Should always be in format: "%Y-%m-%d" e.g., 2024-04-22

None
Notes

For time_resolution, is a positive integer and is one of: - For minutes: "min", "minute", "minutes" - For hours: "hour", "hours", "hr", "hrs" - For days: "day", "days" The parsing is case-insensitive.

For *_merge_method parameters: - Mergemethod.MEAN: Average of all columns with the same data type. - Mergemethod.PRIORITY: Select one column from available columns based on predefined priority.

add_column_meta_data

add_column_meta_data(initial_name, variable_type, unit, priority)

Adds an InputColumnMetaData class to the column_data attribute.

Parameters:

Name Type Description Default
initial_name str

The name of the column from the original raw data

required
variable_type InputColumnDataType

Enum of the column data type: see InputColumnDataType

required
unit str

The units of the column e.g., "hectopascals"

required
priority int

The priority of the column - 1 being highest. Needed when multiple columns are present and the user wants to use the priority merge method (i.e., choose the best column for a data type).

required

import_config

import_config(path_to_config=None)

Automatically assigns the internal attributes using a provided YAML file.

Parameters:

Name Type Description Default
path_to_config str

Location of the YAML file, if not supplied here it expects that the self.path_to_config attribute is already set, by default None

None

Raises:

Type Description
ValueError

When no path is given but the method is called.

build_from_config

build_from_config()

Assign attributes using the YAML information.

assign_merge_methods

assign_merge_methods(column_data_type, merge_method)

Assigns the merge method for each of the input columns.

Parameters:

Name Type Description Default
column_data_type InputColumnDataType

The variable being assinged (as a InputColumnDataType)

required
merge_method str

The selected merge methodq

required

add_meteo_columns

add_meteo_columns(meteo_columns, meteo_type, unit)

Adds column meta data to the class instance. Intended for use when importing attributes with the YAML file.

There can be more than one column recording the same variable. These are recorded in the YAML in priority order e.g.,:

pressure_columns:
    - P4_mb # first priorty goes first
    - P3_mb
    - P1_mb

This method will go through the list in priority order, create a InputColumnMetaData class for each column, assign the appropriate values, and add it to self.column_data using the method self.add_column_meta_data.

Parameters:

Name Type Description Default
meteo_columns List

A list of column names

required
meteo_type InputColumnDataType

The type of column being attributed

required
unit str

The units associated with the column

required

add_date_time_column_info

add_date_time_column_info(date_time_columns, date_time_format, initial_time_zone, convert_time_zone_to='UTC')

Adds datetime column information. Intended for use when importing attributes with the YAML file.

Parameters:

Name Type Description Default
date_time_columns List

Names of date time columns

required
date_time_format str

The expected format of the date time values.

required
initial_time_zone str

The intial time zone of the data

required
convert_time_zone_to str

The desired time zone, by default "UTC"

'UTC'

FormatDataForCRNSDataHub

FormatDataForCRNSDataHub(data_frame, config)

Formats a DataFrame into the required format to work in neptoon.

Key features: - Combines multiple datetime columns (e.g., DATE + TIME) into a single date_time column - Converts time zone (default UTC) - Ensures date time index - Ensures columns are numeric - Organises columns when multiple are present

Attributes:

Name Type Description
data_frame DataFrame

The time series dataframe

config InputDataFrameFormattingConfig

Config object with information about the dataframe, which supports formatting

Attributes of class

Parameters:

Name Type Description Default
data_frame DataFrame

The un-formatted dataframe

required
config InputDataFrameConfig

Config Object which sets the options for formatting, by default None

required

extract_date_time_column

extract_date_time_column()

Create a Datetime column, merge columns if necessary (e.g., when columns are split into date and time)

Returns: pd.Series: the Datetime column.

convert_time_zone

convert_time_zone(date_time_series)

Convert the timezone of a date time time series. Uses the attributes initial_time_zone (the actual time zone the data is currently in) and convert_time_zone_to which is the desired time zone. This is default set the UTC time.

Parameters:

Name Type Description Default
date_time_series Series

The date_time_series that is converted

required

Returns:

Type Description
Series

The converted date_time series in the correct time zone

date_time_as_index

date_time_as_index()

Sets a date_time column as the index of the contained DataFrame

Returns: pd.DataFrame: data with a DatetimeIndex

data_frame_to_numeric

data_frame_to_numeric()

Convert DataFrame columns to numeric values.

get_conversion_factor_to_cph

get_conversion_factor_to_cph(timestep_seconds)

Figures out the factor needed to multiply a count rate by to convert it to counts per hour. Uses the time_resolution attribute for this calculation.

Returns:

Type Description
float

The factor to convert to counts per hour

standardise_units_of_pressure

standardise_units_of_pressure()

Standardises units of pressure to hectopascals

merge_multiple_meteo_columns

merge_multiple_meteo_columns(column_data_type)

Merges columns when multiple are available. Many CRNS have multiple sensors available in the input dataset (e.g., 2 or more pressure sensors at the site). We need only one value for each of these variables. This method uses the settings in the DataFrameConfig class to produce a single sensor value for the selected sensor.

Current Options (set in the Config file): mean - create an average of all the pressure sensors priority - use one sensor selected as priority

Future Options: priority_filled - use one sensor as priorty and fill values from alternative seno

Parameters:

Name Type Description Default
column_data_type InputColumnDataType

One of the possible InputColumnDataTypes that can be used here.

required

Raises:

Type Description
ValueError

If an incompatible InputColumnDataType is given

prepare_key_columns

prepare_key_columns()

Prepares the key columns if all the information has been supplied.

prepare_neutron_count_columns

prepare_neutron_count_columns(neutron_column_type)

Prepares the neutron columns for usage in neptoon. Performs several steps:

- Finds the columns labeled with neutron_column_type
- If more than one it will sum them into a new column
- Check the units and convert to counts per hour.

Parameters:

Name Type Description Default
neutron_column_type Literal[EPI_NEUTRON_COUNT, THERM_NEUTRON_COUNT]
    Literal[
        InputColumnDataType.EPI_NEUTRON_COUNT,
        InputColumnDataType.THERM_NEUTRON_COUNT,
            ]

The type of neutron data being processed

required

clean_raw_dataframe

clean_raw_dataframe()

Cleans raw DataFrame by removing NaT values and duplicated rows.

calc_neutron_uncertainty

calc_neutron_uncertainty()

Creates a column with the statistical uncertainty of the neutron column and converts this value to counts per hour.

snip_data_frame

snip_data_frame()

Removes data from before the defined install date.

format_data_and_return_data_frame

format_data_and_return_data_frame()

Completes the whole process of formatting the dataframe. Expects the settings to be fully implemented.

Returns:

Type Description
DataFrame

DataFrame

CollectAndParseRawData

CollectAndParseRawData(path_to_config, file_collection_config=None, input_formatter_config=None)

Central class which allows us to do the entire ingest and formatting routine. Designed to work with a YAML file.

create_data_frame

create_data_frame()

Creates the data frame by parsing raw data files into a DataFrame. It expects to use a YAML file.

Returns:

Type Description
_type_

description