graphdoc.data package

Submodules

graphdoc.data.helper module

graphdoc.data.helper.check_directory_path(directory_path: str | Path) None[source]

Check if the provided path resolves to a valid directory.

Parameters:

directory_path (Union[str, Path]) – The path to check.

Raises:

ValueError – If the path does not resolve to a valid directory.

Returns:

None

Return type:

None

graphdoc.data.helper.check_file_path(file_path: str | Path) None[source]

Check if the provided path resolves to a valid file.

Parameters:

file_path (Union[str, Path]) – The path to check.

Raises:

ValueError – If the path does not resolve to a valid file.

Returns:

None

Return type:

None

graphdoc.data.helper._env_constructor(loader: SafeLoader, node: ScalarNode) str[source]

Custom constructor for environment variables.

Parameters:
  • loader (yaml.SafeLoader) – The YAML loader.

  • node (yaml.nodes.ScalarNode) – The node to construct.

Returns:

The environment variable value.

Return type:

str

Raises:

ValueError – If the environment variable is not set.

graphdoc.data.helper.load_yaml_config(file_path: str | Path, use_env: bool = True) dict[source]

Load a YAML configuration file.

Parameters:
  • file_path (Union[str, Path]) – The path to the YAML file.

  • use_env (bool) – Whether to use environment variables.

Returns:

The YAML configuration.

Return type:

dict

Raises:

ValueError – If the path does not resolve to a valid file or the environment variable is not set.

graphdoc.data.helper.load_yaml_config_redacted(file_path: str | Path, replace_value: str = 'redacted') dict[source]

Load a YAML configuration file with environment variables redacted.

Parameters:
  • file_path (Union[str, Path]) – The path to the YAML file.

  • replace_value (str) – The value to replace the environment variables with.

Returns:

The YAML configuration with env vars replaced by “redacted”.

Return type:

dict

Raises:

ValueError – If the path does not resolve to a valid file.

graphdoc.data.helper.setup_logging(log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'])[source]

Setup logging for the application.

Parameters:

log_level (Literal["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]) – The log level.

graphdoc.data.local module

class graphdoc.data.local.LocalDataHelper(schema_directory_path: str | ~pathlib._local.Path | None = None, categories: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, ratings: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, categories_ratings: ~typing.Callable = <function SchemaCategoryRatingMapping.get_rating>)[source]

Bases: object

A helper class for loading data from a directory.

Parameters:
  • schema_directory_path (Union[str, Path] Defaults to the path to the schemas in the graphdoc package.) – The path to the directory containing the schemas

  • categories (Type[Enum]) – The categories of the schemas. Defaults to SchemaCategory.

  • ratings (Type[Enum]) – The ratings of the schemas. Defaults to SchemaRating.

  • categories_ratings – A callable that maps categories to ratings. Defaults to SchemaCategoryRatingMapping.get_rating.

schema_objects_from_folder(category: str, rating: int, folder_path: str | Path) dict[str, SchemaObject][source]

Load schemas from a folder, keeping the difficulty tag.

Parameters:
  • category (str) – The category of the schemas

  • rating (int) – The rating of the schemas

  • folder_path (Union[str, Path]) – The path to the folder containing the schemas

Returns:

A dictionary of schemas

Return type:

dict[str, SchemaObject]

schema_objects_from_folder_of_folders(folder_paths: ~typing.Type[~enum.Enum] | None = <enum 'SchemaCategoryPath'>) Dict[str, SchemaObject] | None[source]

Load a folder of folders containing schemas, keeping the difficulty tag.

Parameters:

folder_paths (Optional[Type[Enum]]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.

Returns:

Dictionary of loaded schemas

Return type:

Union[Dict[str, SchemaObject], None]

folder_to_dataset(category: str, folder_path: str | Path, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset[source]

Load a folder of schemas, keeping the difficulty tag.

Parameters:
  • category (str) – The category of the schemas

  • folder_path (Union[str, Path]) – The path to the folder containing the schemas

  • parse_objects (bool) – Whether to parse the objects from the schemas

  • type_mapping (Optional[dict[type, str]]) – A dictionary mapping types to strings

Returns:

A dataset containing the schemas

Return type:

Dataset

folder_of_folders_to_dataset(folder_paths: ~typing.Type[~enum.Enum] = <enum 'SchemaCategoryPath'>, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset[source]

Load a folder of folders containing schemas, keeping the difficulty tag.

Parameters:
  • folder_paths (Type[Enum]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.

  • parse_objects (bool) – Whether to parse the objects from the schemas

  • type_mapping (Optional[dict[type, str]]) – A dictionary mapping graphql-ast node values to strings

Returns:

A dataset containing the schemas

Return type:

Dataset

graphdoc.data.parser module

class graphdoc.data.parser.Parser(type_mapping: dict[type, str] | None = None)[source]

Bases: object

A class for parsing and handling of GraphQL objects.

DEFAULT_NODE_TYPES = {<class 'graphql.language.ast.DocumentNode'>: 'full schema', <class 'graphql.language.ast.EnumTypeDefinitionNode'>: 'enum schema', <class 'graphql.language.ast.EnumValueDefinitionNode'>: 'enum value', <class 'graphql.language.ast.ObjectTypeDefinitionNode'>: 'table schema'}
static _check_node_type(node: Node, type_mapping: dict[type, str] | None = None) str[source]

Check the type of a schema node.

Parameters:
  • node (Node) – The schema node to check

  • type_mapping (Optional[dict[type, str]]) – Custom mapping of node types to strings. Defaults to DEFAULT_NODE_TYPES

Returns:

The type of the schema node

Return type:

str

static parse_schema_from_file(schema_file: str | Path, schema_directory_path: str | Path | None = None) DocumentNode[source]

Parse a schema from a file.

Parameters:
  • schema_file (Union[str, Path]) – The name of the schema file

  • schema_directory_path (Optional[Union[str, Path]]) – A path to a directory containing schemas

Returns:

The parsed schema

Return type:

DocumentNode

Raises:

Exception – If the schema cannot be parsed

static update_node_descriptions(node: Node, new_value: str | None = None) Node[source]

Given a GraphQL node, recursively traverse the node and its children, updating all descriptions with the new value. Can also be used to remove descriptions by passing None as the new value.

Parameters:
  • node (Node) – The GraphQL node to update

  • new_value (Optional[str]) – The new description value. If None, the description will be removed.

Returns:

The updated node

Return type:

Node

static count_description_pattern_matching(node: Node, pattern: str) dict[str, int][source]

Counts the number of times a pattern matches a description in a node and its children.

Parameters:
  • node (Node) – The GraphQL node to count the pattern matches in

  • pattern (str) – The pattern to count the matches of

Returns:

A dictionary with the counts of matches

Return type:

dict[str, int]

static fill_empty_descriptions(node: Node, new_column_value: str = 'Description for column: {}', new_table_value: str = 'Description for table: {}', use_value_name: bool = True, value_name: str | None = None)[source]

Recursively traverse the node and its children, filling in empty descriptions with the new column or table value. Do not update descriptions that already have a value. Default values are provided for the new column and table descriptions.

Parameters:
  • node (Node) – The GraphQL node to update

  • new_column_value (str) – The new column description value

  • new_table_value (str) – The new table description value

  • use_value_name (bool) – Whether to use the value name in the description

  • value_name (Optional[str]) – The name of the value

Returns:

The updated node

Return type:

Node

static schema_equality_check(gold_node: Node, check_node: Node) bool[source]

A method to check if two schema nodes are equal. Only checks that the schemas structures are equal, not the descriptions.

Parameters:
  • gold_node (Node) – The gold standard schema node

  • check_node (Node) – The schema node to check

Returns:

Whether the schemas are equal

Return type:

bool

static schema_object_from_file(schema_file: str | Path, category: str | None = None, rating: int | None = None) SchemaObject[source]

Parse a schema object from a file.

static parse_objects_from_full_schema_object(schema: SchemaObject, type_mapping: dict[type, str] | None = None) dict[str, SchemaObject] | None[source]

Parse out all available tables from a full schema object.

Parameters:
  • schema (SchemaObject) – The full schema object to parse

  • type_mapping (Optional[dict[type, str]]) – Custom mapping of node types to strings. Defaults to DEFAULT_NODE_TYPES

Returns:

The parsed objects (tables and enums)

Return type:

Union[dict, None]

graphdoc.data.schema module

class graphdoc.data.schema.SchemaCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema quality categories enumeration.

PERFECT = 'perfect'
ALMOST_PERFECT = 'almost perfect'
POOR_BUT_CORRECT = 'poor but correct'
INCORRECT = 'incorrect'
BLANK = 'blank'
classmethod from_str(value: str) SchemaCategory | None[source]
class graphdoc.data.schema.SchemaRating(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema quality ratings enumeration.

FOUR = '4'
THREE = '3'
TWO = '2'
ONE = '1'
ZERO = '0'
classmethod from_value(value: str | int) SchemaRating | None[source]
class graphdoc.data.schema.SchemaCategoryRatingMapping[source]

Bases: object

Mapping between schema categories and ratings.

static get_rating(category: SchemaCategory) SchemaRating[source]

Get the corresponding rating for a given schema category.

Parameters:

category – The schema category

Returns:

The corresponding rating

static get_category(rating: SchemaRating) SchemaCategory[source]

Get the corresponding category for a given schema rating.

Parameters:

rating – The schema rating

Returns:

The corresponding category

class graphdoc.data.schema.SchemaType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema type enumeration.

FULL_SCHEMA = 'full schema'
TABLE_SCHEMA = 'table schema'
ENUM_SCHEMA = 'enum schema'
classmethod from_str(value: str) SchemaType | None[source]
class graphdoc.data.schema.SchemaCategoryPath(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Maps schema categories to their folder names.

PERFECT = 'perfect'
ALMOST_PERFECT = 'almost_perfect'
POOR_BUT_CORRECT = 'poor_but_correct'
INCORRECT = 'incorrect'
BLANK = 'blank'
classmethod get_path(category: SchemaCategory, folder_path: str | Path) Path | None[source]

Get the folder path for a given schema category and folder path.

Parameters:

category – The schema category

Returns:

The corresponding folder path

class graphdoc.data.schema.SchemaObject(key: str, category: Enum | None = None, rating: Enum | None = None, schema_name: str | None = None, schema_type: Enum | None = None, schema_str: str | None = None, schema_ast: Node | None = None)[source]

Bases: object

Schema object containing schema data and metadata.

key: str
category: Enum | None = None
rating: Enum | None = None
schema_name: str | None = None
schema_type: Enum | None = None
schema_str: str | None = None
schema_ast: Node | None = None
classmethod from_dict(data: dict, category_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, rating_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, type_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaType'>) SchemaObject[source]

Create SchemaObject from dictionary with validation.

Parameters:
  • data – The data dictionary

  • category_enum – Custom Enum class for categories

  • rating_enum – Custom Enum class for ratings

  • type_enum – Custom Enum class for schema types

to_dict() dict[source]

Convert the SchemaObject to a dictionary, excluding the key field.

Returns:

Dictionary representation of the SchemaObject without the key

Return type:

dict

static _hf_schema_object_columns() Features[source]

Return the columns for the graph_doc dataset, based on the SchemaObject fields.

Returns:

The columns for the graph_doc dataset

Return type:

Features

to_dataset() Dataset[source]

Convert the SchemaObject to a Hugging Face Dataset.

Returns:

The Hugging Face Dataset

Return type:

Dataset

graphdoc.data.schema.schema_objects_to_dataset(schema_objects: List[SchemaObject]) Dataset[source]

Convert a list of SchemaObjects to a Hugging Face Dataset.

Parameters:

schema_objects – The list of SchemaObjects

Returns:

The Hugging Face Dataset

Module contents

class graphdoc.data.DspyDataHelper[source]

Bases: ABC

Abstract class for creating data objects related to a given dspy.Signature.

prompt_signature() Signature | SignatureMeta[source]

Given a prompt, return a dspy.Signature object.

Parameters:

prompt (Any) – A prompt.

static _(prompt: Predict) Signature | SignatureMeta[source]

Given a dspy.Predict object, return a dspy.Signature object.

static formatted_signature(signature: Signature | SignatureMeta, example: Example) str[source]

Given a dspy.Signature and a dspy.Example, return a formatted signature as a string.

Parameters:
  • signature (dspy.Signature) – A dspy.Signature object.

  • example (dspy.Example) – A dspy.Example object.

Returns:

A formatted signature as a string.

Return type:

str

abstract static example(inputs: dict[str, Any]) Example[source]

Given a dictionary of inputs, return a dspy.Example object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Example object.

Return type:

dspy.Example

abstract static example_example() Example[source]

Return an example dspy.Example object with the inputs set to the example values.

Returns:

A dspy.Example object.

Return type:

dspy.Example

abstract static model_signature() ModelSignature[source]

Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.

Returns:

A mlflow.models.ModelSignature object.

Return type:

mlflow.models.ModelSignature

abstract static prediction(inputs: dict[str, Any]) Prediction[source]

Given a dictionary of inputs, return a dspy.Prediction object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

abstract static prediction_example() Prediction[source]

Return an example dspy.Prediction object with the inputs set to the example values.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

abstract static trainset(inputs: dict[str, Any] | Dataset, filter_args: dict[str, Any] | None = None) list[Example][source]

Given a dictionary of inputs or a datasets.Dataset object, return a list of dspy.Example objects.

Parameters:
  • inputs (Union[dict[str, Any], datasets.Dataset]) – A dictionary of inputs or a datasets.Dataset object.

  • filter_args (Optional[dict[str, Any]]) – A dictionary of filter arguments. These are instructions for how we will filter and / or transform the inputs.

Returns:

A list of dspy.Example objects.

Return type:

list[dspy.Example]

class graphdoc.data.GenerationDataHelper[source]

Bases: DspyDataHelper

A helper class for creating data objects related to our Documentation Generation dspy.Signature.

The example signature is defined as: ` database_schema: str = dspy.InputField() documented_schema: str = dspy.OutputField() `

static example(inputs: dict[str, Any]) Example[source]

Given a dictionary of inputs, return a dspy.Example object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Example object.

Return type:

dspy.Example

static example_example() Example[source]

Return an example dspy.Example object with the inputs set to the example values.

Returns:

A dspy.Example object.

Return type:

dspy.Example

static model_signature() ModelSignature[source]

Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.

Returns:

A mlflow.models.ModelSignature object.

Return type:

mlflow.models.ModelSignature

static prediction(inputs: dict[str, Any]) Prediction[source]

Given a dictionary of inputs, return a dspy.Prediction object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

static prediction_example() Prediction[source]

Return an example dspy.Prediction object with the inputs set to the example values.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

static trainset(inputs: dict[str, Any] | Dataset, filter_args: dict[str, Any] | None = None) list[Example][source]

Given a dictionary of inputs or a datasets.Dataset object, return a list of dspy.Example objects.

Parameters:
  • inputs (Union[dict[str, Any], datasets.Dataset]) – A dictionary of inputs or a datasets.Dataset object.

  • filter_args (Optional[dict[str, Any]]) – A dictionary of filter arguments. These are instructions for how we will filter and / or transform the inputs.

Returns:

A list of dspy.Example objects.

Return type:

list[dspy.Example]

class graphdoc.data.QualityDataHelper[source]

Bases: DspyDataHelper

A helper class for creating data objects related to our Documentation Quality dspy.Signature.

The example signature is defined as:

database_schema: str = dspy.InputField()
category: Literal["perfect", "almost perfect", "poor but correct", "incorrect"] = (
    dspy.OutputField()
)
rating: Literal[4, 3, 2, 1] = dspy.OutputField()
static example(inputs: dict[str, Any]) Example[source]

Given a dictionary of inputs, return a dspy.Example object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Example object.

Return type:

dspy.Example

static example_example() Example[source]

Return an example dspy.Example object with the inputs set to the example values.

Returns:

A dspy.Example object.

Return type:

dspy.Example

static model_signature() ModelSignature[source]

Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.

Returns:

A mlflow.models.ModelSignature object.

Return type:

mlflow.models.ModelSignature

static prediction(inputs: dict[str, Any]) Prediction[source]

Given a dictionary of inputs, return a dspy.Prediction object.

Parameters:

inputs (dict[str, Any]) – A dictionary of inputs.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

static prediction_example() Prediction[source]

Return an example dspy.Prediction object with the inputs set to the example values.

Returns:

A dspy.Prediction object.

Return type:

dspy.Prediction

static trainset(inputs: dict[str, Any] | Dataset, filter_args: dict[str, Any] | None = None) list[Example][source]

Given a dictionary of inputs or a datasets.Dataset object, return a list of dspy.Example objects.

Parameters:
  • inputs (Union[dict[str, Any], datasets.Dataset]) – A dictionary of inputs or a datasets.Dataset object.

  • filter_args (Optional[dict[str, Any]]) – A dictionary of filter arguments. These are instructions for how we will filter and / or transform the inputs.

Returns:

A list of dspy.Example objects.

Return type:

list[dspy.Example]

class graphdoc.data.LocalDataHelper(schema_directory_path: str | ~pathlib._local.Path | None = None, categories: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, ratings: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, categories_ratings: ~typing.Callable = <function SchemaCategoryRatingMapping.get_rating>)[source]

Bases: object

A helper class for loading data from a directory.

Parameters:
  • schema_directory_path (Union[str, Path] Defaults to the path to the schemas in the graphdoc package.) – The path to the directory containing the schemas

  • categories (Type[Enum]) – The categories of the schemas. Defaults to SchemaCategory.

  • ratings (Type[Enum]) – The ratings of the schemas. Defaults to SchemaRating.

  • categories_ratings – A callable that maps categories to ratings. Defaults to SchemaCategoryRatingMapping.get_rating.

schema_objects_from_folder(category: str, rating: int, folder_path: str | Path) dict[str, SchemaObject][source]

Load schemas from a folder, keeping the difficulty tag.

Parameters:
  • category (str) – The category of the schemas

  • rating (int) – The rating of the schemas

  • folder_path (Union[str, Path]) – The path to the folder containing the schemas

Returns:

A dictionary of schemas

Return type:

dict[str, SchemaObject]

schema_objects_from_folder_of_folders(folder_paths: ~typing.Type[~enum.Enum] | None = <enum 'SchemaCategoryPath'>) Dict[str, SchemaObject] | None[source]

Load a folder of folders containing schemas, keeping the difficulty tag.

Parameters:

folder_paths (Optional[Type[Enum]]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.

Returns:

Dictionary of loaded schemas

Return type:

Union[Dict[str, SchemaObject], None]

folder_to_dataset(category: str, folder_path: str | Path, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset[source]

Load a folder of schemas, keeping the difficulty tag.

Parameters:
  • category (str) – The category of the schemas

  • folder_path (Union[str, Path]) – The path to the folder containing the schemas

  • parse_objects (bool) – Whether to parse the objects from the schemas

  • type_mapping (Optional[dict[type, str]]) – A dictionary mapping types to strings

Returns:

A dataset containing the schemas

Return type:

Dataset

folder_of_folders_to_dataset(folder_paths: ~typing.Type[~enum.Enum] = <enum 'SchemaCategoryPath'>, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset[source]

Load a folder of folders containing schemas, keeping the difficulty tag.

Parameters:
  • folder_paths (Type[Enum]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.

  • parse_objects (bool) – Whether to parse the objects from the schemas

  • type_mapping (Optional[dict[type, str]]) – A dictionary mapping graphql-ast node values to strings

Returns:

A dataset containing the schemas

Return type:

Dataset

class graphdoc.data.MlflowDataHelper(mlflow_tracking_uri: str | Path, mlflow_tracking_username: str | None = None, mlflow_tracking_password: str | None = None)[source]

Bases: object

__init__(mlflow_tracking_uri: str | Path, mlflow_tracking_username: str | None = None, mlflow_tracking_password: str | None = None)[source]

A helper class for loading and saving models and metadata from mlflow.

Parameters:
  • mlflow_tracking_uri (Union[str, Path]) – The uri of the mlflow tracking server.

  • mlflow_tracking_username (Optional[str]) – The username for the mlflow tracking server.

  • mlflow_tracking_password (Optional[str]) – The password for the mlflow tracking server.

update_auth_env_vars(mlflow_tracking_username: str, mlflow_tracking_password: str)[source]

Update the authentication environment variables.

Parameters:
  • mlflow_tracking_username (str) – The username for the mlflow tracking server.

  • mlflow_tracking_password (str) – The password for the mlflow tracking server.

set_auth_env_vars()[source]

Set the authentication environment variables.

latest_model_version(model_name: str)[source]

Load the latest version of a model from mlflow.

Parameters:

model_name (str) – The name of the model to load.

Returns:

The loaded model.

model_by_name_and_version(model_name: str, model_version: str)[source]

Load a model from mlflow by name and version.

Parameters:
  • model_name (str) – The name of the model to load.

  • model_version (str) – The version of the model to load.

Returns:

The loaded model.

model_by_uri(model_uri: str)[source]

Load a model from mlflow by uri.

Parameters:

model_uri (str) – The uri of the model to load.

Returns:

The loaded model.

model_by_args(load_model_args: Dict[str, str])[source]

Given a dictionary of arguments, load a model from mlflow. Ordering is model_by_uri, model_by_name_and_version, latest_model_version.

Parameters:

load_model_args (Dict[str, str]) – A dictionary of arguments.

Returns:

The loaded model.

save_model(model: Signature, model_signature: ModelSignature, model_name: str)[source]

Save a model to mlflow.

Parameters:
  • model (dspy.Signature) – The model to save.

  • model_signature (ModelSignature) – The signature of the model.

  • model_name (str) – The name of the model to save.

run_parameters(run_id: str) dict[str, Any][source]

Load the parameters of a run from mlflow.

Parameters:

run_id (str) – The id of the run to load the parameters from.

Returns:

The parameters of the run.

graphdoc.data._env_constructor(loader: SafeLoader, node: ScalarNode) str[source]

Custom constructor for environment variables.

Parameters:
  • loader (yaml.SafeLoader) – The YAML loader.

  • node (yaml.nodes.ScalarNode) – The node to construct.

Returns:

The environment variable value.

Return type:

str

Raises:

ValueError – If the environment variable is not set.

graphdoc.data.check_directory_path(directory_path: str | Path) None[source]

Check if the provided path resolves to a valid directory.

Parameters:

directory_path (Union[str, Path]) – The path to check.

Raises:

ValueError – If the path does not resolve to a valid directory.

Returns:

None

Return type:

None

graphdoc.data.check_file_path(file_path: str | Path) None[source]

Check if the provided path resolves to a valid file.

Parameters:

file_path (Union[str, Path]) – The path to check.

Raises:

ValueError – If the path does not resolve to a valid file.

Returns:

None

Return type:

None

graphdoc.data.load_yaml_config(file_path: str | Path, use_env: bool = True) dict[source]

Load a YAML configuration file.

Parameters:
  • file_path (Union[str, Path]) – The path to the YAML file.

  • use_env (bool) – Whether to use environment variables.

Returns:

The YAML configuration.

Return type:

dict

Raises:

ValueError – If the path does not resolve to a valid file or the environment variable is not set.

graphdoc.data.setup_logging(log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'])[source]

Setup logging for the application.

Parameters:

log_level (Literal["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]) – The log level.

class graphdoc.data.Parser(type_mapping: dict[type, str] | None = None)[source]

Bases: object

A class for parsing and handling of GraphQL objects.

DEFAULT_NODE_TYPES = {<class 'graphql.language.ast.DocumentNode'>: 'full schema', <class 'graphql.language.ast.EnumTypeDefinitionNode'>: 'enum schema', <class 'graphql.language.ast.EnumValueDefinitionNode'>: 'enum value', <class 'graphql.language.ast.ObjectTypeDefinitionNode'>: 'table schema'}
static _check_node_type(node: Node, type_mapping: dict[type, str] | None = None) str[source]

Check the type of a schema node.

Parameters:
  • node (Node) – The schema node to check

  • type_mapping (Optional[dict[type, str]]) – Custom mapping of node types to strings. Defaults to DEFAULT_NODE_TYPES

Returns:

The type of the schema node

Return type:

str

static parse_schema_from_file(schema_file: str | Path, schema_directory_path: str | Path | None = None) DocumentNode[source]

Parse a schema from a file.

Parameters:
  • schema_file (Union[str, Path]) – The name of the schema file

  • schema_directory_path (Optional[Union[str, Path]]) – A path to a directory containing schemas

Returns:

The parsed schema

Return type:

DocumentNode

Raises:

Exception – If the schema cannot be parsed

static update_node_descriptions(node: Node, new_value: str | None = None) Node[source]

Given a GraphQL node, recursively traverse the node and its children, updating all descriptions with the new value. Can also be used to remove descriptions by passing None as the new value.

Parameters:
  • node (Node) – The GraphQL node to update

  • new_value (Optional[str]) – The new description value. If None, the description will be removed.

Returns:

The updated node

Return type:

Node

static count_description_pattern_matching(node: Node, pattern: str) dict[str, int][source]

Counts the number of times a pattern matches a description in a node and its children.

Parameters:
  • node (Node) – The GraphQL node to count the pattern matches in

  • pattern (str) – The pattern to count the matches of

Returns:

A dictionary with the counts of matches

Return type:

dict[str, int]

static fill_empty_descriptions(node: Node, new_column_value: str = 'Description for column: {}', new_table_value: str = 'Description for table: {}', use_value_name: bool = True, value_name: str | None = None)[source]

Recursively traverse the node and its children, filling in empty descriptions with the new column or table value. Do not update descriptions that already have a value. Default values are provided for the new column and table descriptions.

Parameters:
  • node (Node) – The GraphQL node to update

  • new_column_value (str) – The new column description value

  • new_table_value (str) – The new table description value

  • use_value_name (bool) – Whether to use the value name in the description

  • value_name (Optional[str]) – The name of the value

Returns:

The updated node

Return type:

Node

static schema_equality_check(gold_node: Node, check_node: Node) bool[source]

A method to check if two schema nodes are equal. Only checks that the schemas structures are equal, not the descriptions.

Parameters:
  • gold_node (Node) – The gold standard schema node

  • check_node (Node) – The schema node to check

Returns:

Whether the schemas are equal

Return type:

bool

static schema_object_from_file(schema_file: str | Path, category: str | None = None, rating: int | None = None) SchemaObject[source]

Parse a schema object from a file.

static parse_objects_from_full_schema_object(schema: SchemaObject, type_mapping: dict[type, str] | None = None) dict[str, SchemaObject] | None[source]

Parse out all available tables from a full schema object.

Parameters:
  • schema (SchemaObject) – The full schema object to parse

  • type_mapping (Optional[dict[type, str]]) – Custom mapping of node types to strings. Defaults to DEFAULT_NODE_TYPES

Returns:

The parsed objects (tables and enums)

Return type:

Union[dict, None]

class graphdoc.data.SchemaCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema quality categories enumeration.

PERFECT = 'perfect'
ALMOST_PERFECT = 'almost perfect'
POOR_BUT_CORRECT = 'poor but correct'
INCORRECT = 'incorrect'
BLANK = 'blank'
classmethod from_str(value: str) SchemaCategory | None[source]
class graphdoc.data.SchemaCategoryPath(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Maps schema categories to their folder names.

PERFECT = 'perfect'
ALMOST_PERFECT = 'almost_perfect'
POOR_BUT_CORRECT = 'poor_but_correct'
INCORRECT = 'incorrect'
BLANK = 'blank'
classmethod get_path(category: SchemaCategory, folder_path: str | Path) Path | None[source]

Get the folder path for a given schema category and folder path.

Parameters:

category – The schema category

Returns:

The corresponding folder path

class graphdoc.data.SchemaCategoryRatingMapping[source]

Bases: object

Mapping between schema categories and ratings.

static get_rating(category: SchemaCategory) SchemaRating[source]

Get the corresponding rating for a given schema category.

Parameters:

category – The schema category

Returns:

The corresponding rating

static get_category(rating: SchemaRating) SchemaCategory[source]

Get the corresponding category for a given schema rating.

Parameters:

rating – The schema rating

Returns:

The corresponding category

class graphdoc.data.SchemaObject(key: str, category: Enum | None = None, rating: Enum | None = None, schema_name: str | None = None, schema_type: Enum | None = None, schema_str: str | None = None, schema_ast: Node | None = None)[source]

Bases: object

Schema object containing schema data and metadata.

key: str
category: Enum | None = None
rating: Enum | None = None
schema_name: str | None = None
schema_type: Enum | None = None
schema_str: str | None = None
schema_ast: Node | None = None
classmethod from_dict(data: dict, category_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, rating_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, type_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaType'>) SchemaObject[source]

Create SchemaObject from dictionary with validation.

Parameters:
  • data – The data dictionary

  • category_enum – Custom Enum class for categories

  • rating_enum – Custom Enum class for ratings

  • type_enum – Custom Enum class for schema types

to_dict() dict[source]

Convert the SchemaObject to a dictionary, excluding the key field.

Returns:

Dictionary representation of the SchemaObject without the key

Return type:

dict

static _hf_schema_object_columns() Features[source]

Return the columns for the graph_doc dataset, based on the SchemaObject fields.

Returns:

The columns for the graph_doc dataset

Return type:

Features

to_dataset() Dataset[source]

Convert the SchemaObject to a Hugging Face Dataset.

Returns:

The Hugging Face Dataset

Return type:

Dataset

class graphdoc.data.SchemaRating(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema quality ratings enumeration.

FOUR = '4'
THREE = '3'
TWO = '2'
ONE = '1'
ZERO = '0'
classmethod from_value(value: str | int) SchemaRating | None[source]
class graphdoc.data.SchemaType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: str, Enum

Schema type enumeration.

FULL_SCHEMA = 'full schema'
TABLE_SCHEMA = 'table schema'
ENUM_SCHEMA = 'enum schema'
classmethod from_str(value: str) SchemaType | None[source]
graphdoc.data.schema_objects_to_dataset(schema_objects: List[SchemaObject]) Dataset[source]

Convert a list of SchemaObjects to a Hugging Face Dataset.

Parameters:

schema_objects – The list of SchemaObjects

Returns:

The Hugging Face Dataset

graphdoc.data.load_yaml_config_redacted(file_path: str | Path, replace_value: str = 'redacted') dict[source]

Load a YAML configuration file with environment variables redacted.

Parameters:
  • file_path (Union[str, Path]) – The path to the YAML file.

  • replace_value (str) – The value to replace the environment variables with.

Returns:

The YAML configuration with env vars replaced by “redacted”.

Return type:

dict

Raises:

ValueError – If the path does not resolve to a valid file.