neptoon.io.read
data_ingest¶
Classes:
- FileCollectionConfig
- ManageFileCollection
- ParseFilesIntoDataFrame
- InputColumnDataType
- NeutronCountUnits
- PressureUnits
- MergeMethod
- InputColumnMetaData
- InputDataFrameFormattingConfig
- FormatDataForCRNSDataHub
- CollectAndParseRawData
Functions:
- path_to_config
- path_to_config
- data_location
- data_location
- data_source
- separator
- separator
- remove_prefix
- remove_prefix
- decimal
- decimal
- dump_tar
- dump_zip
- build_from_config
- get_list_of_files
- filter_files
- create_file_list
- make_dataframe
- add_column_meta_data
- import_config
- build_from_config
- assign_merge_methods
- add_meteo_columns
- add_date_time_column_info
- data_frame
- config
- data_frame
- extract_date_time_column
- convert_time_zone
- date_time_as_index
- data_frame_to_numeric
- get_conversion_factor_to_cph
- standardise_units_of_pressure
- merge_multiple_meteo_columns
- prepare_key_columns
- prepare_neutron_count_columns
- clean_raw_dataframe
- calc_neutron_uncertainty
- snip_data_frame
- format_data_and_return_data_frame
- path_to_config
- path_to_config
- create_data_frame
FileCollectionConfig ¶
FileCollectionConfig(path_to_config=None, data_location=None, column_names=None, prefix='', suffix='', encoding='cp850', skip_lines=0, separator=',', decimal='.', skip_initial_space=True, parser_kw_strip_left=True, parser_kw_digit_first=True, starts_with='', multi_header=False, strip_names=True, remove_prefix='//')
Configuration class for file collection and parsing settings.
This class holds all the necessary parameters for locating, reading, and parsing data files, providing a centralized configuration for the data ingestion process.
Initial parameters for data collection and merging
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_to_config
|
Union[str, Path]
|
The location of the sensor configuration file. Can be either a string or Path object |
None
|
data_location
|
Union[str, Path]
|
The location of the data files. Can be either a string or Path object |
None
|
column_names
|
list
|
List of column names for the data, by default None |
None
|
prefix
|
str
|
Start of file name for file filtering, by default None |
''
|
suffix
|
str
|
End of file name - used for file filtering, by default None |
''
|
encoding
|
str
|
Encoder used for file encoding, by default "cp850" |
'cp850'
|
skip_lines
|
int
|
Whether lines should be skipped when parsing files, by default 0 |
0
|
seperator
|
str
|
Column seperator in the files, by default "," |
required |
decimal
|
str
|
The default decimal character for floating point numbers , by default "." |
'.'
|
skip_initial_space
|
bool
|
Whether to skip intial space when creating dataframe, by default True |
True
|
parser_kw
|
dict
|
Dictionary with parser values to use when parsing data, by default dict( strip_left=True, digit_first=True, ) |
required |
starts_with
|
any
|
String that headers must start with, by default "" |
''
|
multi_header
|
bool
|
Whether to look for multiple header lines, by default False |
False
|
strip_names
|
bool
|
Whether to strip whitespace from column names, by default True |
True
|
remove_prefix
|
str
|
Prefix to remove from column names, by default "//" |
'//'
|
build_from_config ¶
Imports the attributes for the instance of FileCollectionConfig from a pre-configured YAML file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_to_config
|
Union[Path, str]
|
Path to the sensor configuration file, by default None |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no suitable path given |
ManageFileCollection ¶
Manages the collection of files in preperation for parsing them into a DataFrame for the CRNSDataHub.
Example:
config = FileCollectionConfig(data_location="/path/to/folder") file_manager = ManageFileCollection(config) file_manager.get_list_of_files() file_manager.filter_files()
Initial parameters
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
FileCollectionConfig[str, Path]
|
The config file holding key information for collection |
required |
files
|
List
|
Placeholder for files |
None
|
get_list_of_files ¶
Lists the files found at the data_location and assigns these to the file attribute.
filter_files ¶
Filters the files found in the data location using the prefix or suffix given during initialisation. Both of these default to None.
This method updates the files attribute of the class with the
filtered list.
TODO maybe add regexp or * functionality
ParseFilesIntoDataFrame ¶
Parses raw files into a single pandas DataFrame.
This class takes instances of ManageFileCollection and FileCollectionConfig to process and combine multiple data files into a single DataFrame, handling various file formats and parsing configurations.
Example
config = FileCollectionConfig(data_location='/path/to/data/folder/') file_manager = ManageFileCollection(config=config) file_parser = ParseFilesIntoDataFrame(file_manager, config) df = file_parser.make_dataframe()
Initialisation files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_manager
|
ManageFileCollection
|
An instance fo the ManageFileCollection class |
required |
config
|
FileCollectionConfig
|
The config file containing information to support processing. |
required |
make_dataframe ¶
Merges, parses and converts data it to a single DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column_names
|
list
|
Can supply custom column_names for saving file, by default None |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with all data |
InputDataFrameFormattingConfig ¶
InputDataFrameFormattingConfig(path_to_config=None, pressure_merge_method=PRIORITY, pressure_units=HECTOPASCALS, temperature_merge_method=PRIORITY, relative_humidity_merge_method=PRIORITY, neutron_count_units=ABSOLUTE_COUNT, date_time_columns=None, date_time_format='%Y/%m/%d %H:%M:%S', initial_time_zone='utc', convert_time_zone_to='utc', is_timestamp=False, decimal='.', start_date_of_data=None)
Configuration class storing necessary attributes to format a DataFrame using the FormatDataForCRNSDataHub.
A class storing information supporting automated processing of raw input CRNS data files into a ready for neptoon dataframe (for use in FormatDataForCRNSDataHub)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_to_config
|
Union[str, Path]
|
path to the sensor configuration file by default None |
None
|
output_resolution
|
str
|
The desired time resolution of the dataframe to aggregate
to. If None no time aggregation is done. Otherwise in format
|
required |
aggregate_method
|
Literal['fagg', 'bagg', 'nagg']
|
Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or “surrounding” interval). |
required |
aggregate_func
|
str
|
Aggregation function. By default mean. |
required |
aggregate_maxna_fraction
|
float
|
Maximum fraction of values in the aggregation period that can be NaN. If set to 0.3 only 30% of the values can be NaN by default 0.5 |
required |
align_timestamps
|
Whether to align the time stamps to a regular time. E.g., If time_resolution is 1hour, 13:10, becomes 13:00, by default False. |
required | |
align_method
|
The alignment method to use by default "time", see https://rdm-software.pages.ufz.de/saqc/_api/saqc.SaQC.html#saqc.SaQC.align |
required | |
pressure_merge_method
|
MergeMethod
|
Method used to merge multiple pressure columns, by default MergeMethod.PRIORITY |
PRIORITY
|
pressure_units
|
PressureUnits
|
States the units of pressure for input data, will be converted to HECTOPASCALS |
HECTOPASCALS
|
temperature_merge_method
|
MergeMethod
|
Method used to merge multiple temperature columns,, by default MergeMethod.PRIORITY |
PRIORITY
|
relative_humidity_merge_method
|
MergeMethod
|
Method used to merge multiple relative humidity columns,, by default MergeMethod.PRIORITY |
PRIORITY
|
neutron_count_units
|
NeutronCountUnits
|
The units of neutron counts, by default NeutronCountUnits.ABSOLUTE_COUNT |
ABSOLUTE_COUNT
|
date_time_columns
|
List[str]
|
Names of date time columns, if more than one expects DATE + TIME, by default None |
None
|
date_time_format
|
str
|
Format of the date time column, by default "%Y/%m/%d %H:%M:%S" |
'%Y/%m/%d %H:%M:%S'
|
initial_time_zone
|
str
|
Initial time zone, by default "utc" |
'utc'
|
convert_time_zone_to
|
str
|
Desired time zone, by default "utc" |
'utc'
|
is_timestamp
|
bool
|
Whether time stamp, by default False |
False
|
decimal
|
str
|
Decimal divider, by default "." |
'.'
|
start_date_of_data
|
str | DateTime
|
The beginning date from which data should be processed. All data before this date is removed during parsing. Should always be in format: "%Y-%m-%d" e.g., 2024-04-22 |
None
|
Notes
For time_resolution,
For *_merge_method parameters: - Mergemethod.MEAN: Average of all columns with the same data type. - Mergemethod.PRIORITY: Select one column from available columns based on predefined priority.
add_column_meta_data ¶
Adds an InputColumnMetaData class to the column_data attribute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
initial_name
|
str
|
The name of the column from the original raw data |
required |
variable_type
|
InputColumnDataType
|
Enum of the column data type: see InputColumnDataType |
required |
unit
|
str
|
The units of the column e.g., "hectopascals" |
required |
priority
|
int
|
The priority of the column - 1 being highest. Needed when multiple columns are present and the user wants to use the priority merge method (i.e., choose the best column for a data type). |
required |
import_config ¶
Automatically assigns the internal attributes using a provided YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_to_config
|
str
|
Location of the YAML file, if not supplied here it expects that the self.path_to_config attribute is already set, by default None |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
When no path is given but the method is called. |
assign_merge_methods ¶
Assigns the merge method for each of the input columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column_data_type
|
InputColumnDataType
|
The variable being assinged (as a InputColumnDataType) |
required |
merge_method
|
str
|
The selected merge methodq |
required |
add_meteo_columns ¶
Adds column meta data to the class instance. Intended for use when importing attributes with the YAML file.
There can be more than one column recording the same variable. These are recorded in the YAML in priority order e.g.,:
pressure_columns:
- P4_mb # first priorty goes first
- P3_mb
- P1_mb
This method will go through the list in priority order, create a InputColumnMetaData class for each column, assign the appropriate values, and add it to self.column_data using the method self.add_column_meta_data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meteo_columns
|
List
|
A list of column names |
required |
meteo_type
|
InputColumnDataType
|
The type of column being attributed |
required |
unit
|
str
|
The units associated with the column |
required |
add_date_time_column_info ¶
add_date_time_column_info(date_time_columns, date_time_format, initial_time_zone, convert_time_zone_to='UTC')
Adds datetime column information. Intended for use when importing attributes with the YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_time_columns
|
List
|
Names of date time columns |
required |
date_time_format
|
str
|
The expected format of the date time values. |
required |
initial_time_zone
|
str
|
The intial time zone of the data |
required |
convert_time_zone_to
|
str
|
The desired time zone, by default "UTC" |
'UTC'
|
FormatDataForCRNSDataHub ¶
Formats a DataFrame into the required format to work in neptoon.
Key features: - Combines multiple datetime columns (e.g., DATE + TIME) into a single date_time column - Converts time zone (default UTC) - Ensures date time index - Ensures columns are numeric - Organises columns when multiple are present
Attributes:
| Name | Type | Description |
|---|---|---|
data_frame |
DataFrame
|
The time series dataframe |
config |
InputDataFrameFormattingConfig
|
Config object with information about the dataframe, which supports formatting |
Attributes of class
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_frame
|
DataFrame
|
The un-formatted dataframe |
required |
config
|
InputDataFrameConfig
|
Config Object which sets the options for formatting, by default None |
required |
extract_date_time_column ¶
Create a Datetime column, merge columns if necessary (e.g., when columns are split into date and time)
Returns: pd.Series: the Datetime column.
convert_time_zone ¶
Convert the timezone of a date time time series. Uses the attributes initial_time_zone (the actual time zone the data is currently in) and convert_time_zone_to which is the desired time zone. This is default set the UTC time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_time_series
|
Series
|
The date_time_series that is converted |
required |
Returns:
| Type | Description |
|---|---|
Series
|
The converted date_time series in the correct time zone |
date_time_as_index ¶
Sets a date_time column as the index of the contained DataFrame
Returns: pd.DataFrame: data with a DatetimeIndex
get_conversion_factor_to_cph ¶
Figures out the factor needed to multiply a count rate by to convert it to counts per hour. Uses the time_resolution attribute for this calculation.
Returns:
| Type | Description |
|---|---|
float
|
The factor to convert to counts per hour |
standardise_units_of_pressure ¶
Standardises units of pressure to hectopascals
merge_multiple_meteo_columns ¶
Merges columns when multiple are available. Many CRNS have multiple sensors available in the input dataset (e.g., 2 or more pressure sensors at the site). We need only one value for each of these variables. This method uses the settings in the DataFrameConfig class to produce a single sensor value for the selected sensor.
Current Options (set in the Config file): mean - create an average of all the pressure sensors priority - use one sensor selected as priority
Future Options: priority_filled - use one sensor as priorty and fill values from alternative seno
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column_data_type
|
InputColumnDataType
|
One of the possible InputColumnDataTypes that can be used here. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an incompatible InputColumnDataType is given |
prepare_key_columns ¶
Prepares the key columns if all the information has been supplied.
prepare_neutron_count_columns ¶
Prepares the neutron columns for usage in neptoon. Performs several steps:
- Finds the columns labeled with neutron_column_type
- If more than one it will sum them into a new column
- Check the units and convert to counts per hour.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
neutron_column_type
|
Literal[EPI_NEUTRON_COUNT, THERM_NEUTRON_COUNT]
|
The type of neutron data being processed |
required |
clean_raw_dataframe ¶
Cleans raw DataFrame by removing NaT values and duplicated rows.
calc_neutron_uncertainty ¶
Creates a column with the statistical uncertainty of the neutron column and converts this value to counts per hour.
format_data_and_return_data_frame ¶
Completes the whole process of formatting the dataframe. Expects the settings to be fully implemented.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame |
CollectAndParseRawData ¶
Central class which allows us to do the entire ingest and formatting routine. Designed to work with a YAML file.
create_data_frame ¶
Creates the data frame by parsing raw data files into a DataFrame. It expects to use a YAML file.
Returns:
| Type | Description |
|---|---|
_type_
|
description |