graphdoc.data package
Submodules
graphdoc.data.helper module
- graphdoc.data.helper.check_directory_path(directory_path: str | Path) None [source]
Check if the provided path resolves to a valid directory.
- Parameters:
directory_path (Union[str, Path]) – The path to check.
- Raises:
ValueError – If the path does not resolve to a valid directory.
- Returns:
None
- Return type:
None
- graphdoc.data.helper.check_file_path(file_path: str | Path) None [source]
Check if the provided path resolves to a valid file.
- Parameters:
file_path (Union[str, Path]) – The path to check.
- Raises:
ValueError – If the path does not resolve to a valid file.
- Returns:
None
- Return type:
None
- graphdoc.data.helper._env_constructor(loader: SafeLoader, node: ScalarNode) str [source]
Custom constructor for environment variables.
- Parameters:
loader (yaml.SafeLoader) – The YAML loader.
node (yaml.nodes.ScalarNode) – The node to construct.
- Returns:
The environment variable value.
- Return type:
- Raises:
ValueError – If the environment variable is not set.
- graphdoc.data.helper.load_yaml_config(file_path: str | Path, use_env: bool = True) dict [source]
Load a YAML configuration file.
- Parameters:
- Returns:
The YAML configuration.
- Return type:
- Raises:
ValueError – If the path does not resolve to a valid file or the environment variable is not set.
- graphdoc.data.helper.load_yaml_config_redacted(file_path: str | Path, replace_value: str = 'redacted') dict [source]
Load a YAML configuration file with environment variables redacted.
- Parameters:
- Returns:
The YAML configuration with env vars replaced by “redacted”.
- Return type:
- Raises:
ValueError – If the path does not resolve to a valid file.
graphdoc.data.local module
- class graphdoc.data.local.LocalDataHelper(schema_directory_path: str | ~pathlib._local.Path | None = None, categories: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, ratings: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, categories_ratings: ~typing.Callable = <function SchemaCategoryRatingMapping.get_rating>)[source]
Bases:
object
A helper class for loading data from a directory.
- Parameters:
schema_directory_path (Union[str, Path] Defaults to the path to the schemas in the graphdoc package.) – The path to the directory containing the schemas
categories (Type[Enum]) – The categories of the schemas. Defaults to SchemaCategory.
ratings (Type[Enum]) – The ratings of the schemas. Defaults to SchemaRating.
categories_ratings – A callable that maps categories to ratings. Defaults to SchemaCategoryRatingMapping.get_rating.
- schema_objects_from_folder(category: str, rating: int, folder_path: str | Path) dict[str, SchemaObject] [source]
Load schemas from a folder, keeping the difficulty tag.
- schema_objects_from_folder_of_folders(folder_paths: ~typing.Type[~enum.Enum] | None = <enum 'SchemaCategoryPath'>) Dict[str, SchemaObject] | None [source]
Load a folder of folders containing schemas, keeping the difficulty tag.
- Parameters:
folder_paths (Optional[Type[Enum]]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.
- Returns:
Dictionary of loaded schemas
- Return type:
Union[Dict[str, SchemaObject], None]
- folder_to_dataset(category: str, folder_path: str | Path, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset [source]
Load a folder of schemas, keeping the difficulty tag.
- Parameters:
- Returns:
A dataset containing the schemas
- Return type:
Dataset
- folder_of_folders_to_dataset(folder_paths: ~typing.Type[~enum.Enum] = <enum 'SchemaCategoryPath'>, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset [source]
Load a folder of folders containing schemas, keeping the difficulty tag.
- Parameters:
- Returns:
A dataset containing the schemas
- Return type:
Dataset
graphdoc.data.parser module
- class graphdoc.data.parser.Parser(type_mapping: dict[type, str] | None = None)[source]
Bases:
object
A class for parsing and handling of GraphQL objects.
- DEFAULT_NODE_TYPES = {<class 'graphql.language.ast.DocumentNode'>: 'full schema', <class 'graphql.language.ast.EnumTypeDefinitionNode'>: 'enum schema', <class 'graphql.language.ast.EnumValueDefinitionNode'>: 'enum value', <class 'graphql.language.ast.ObjectTypeDefinitionNode'>: 'table schema'}
- static _check_node_type(node: Node, type_mapping: dict[type, str] | None = None) str [source]
Check the type of a schema node.
- static parse_schema_from_file(schema_file: str | Path, schema_directory_path: str | Path | None = None) DocumentNode [source]
Parse a schema from a file.
- static update_node_descriptions(node: Node, new_value: str | None = None) Node [source]
Given a GraphQL node, recursively traverse the node and its children, updating all descriptions with the new value. Can also be used to remove descriptions by passing None as the new value.
- Parameters:
node (Node) – The GraphQL node to update
new_value (Optional[str]) – The new description value. If None, the description will be removed.
- Returns:
The updated node
- Return type:
Node
- static count_description_pattern_matching(node: Node, pattern: str) dict[str, int] [source]
Counts the number of times a pattern matches a description in a node and its children.
- static fill_empty_descriptions(node: Node, new_column_value: str = 'Description for column: {}', new_table_value: str = 'Description for table: {}', use_value_name: bool = True, value_name: str | None = None)[source]
Recursively traverse the node and its children, filling in empty descriptions with the new column or table value. Do not update descriptions that already have a value. Default values are provided for the new column and table descriptions.
- Parameters:
- Returns:
The updated node
- Return type:
Node
- static schema_equality_check(gold_node: Node, check_node: Node) bool [source]
A method to check if two schema nodes are equal. Only checks that the schemas structures are equal, not the descriptions.
- Parameters:
gold_node (Node) – The gold standard schema node
check_node (Node) – The schema node to check
- Returns:
Whether the schemas are equal
- Return type:
- static schema_object_from_file(schema_file: str | Path, category: str | None = None, rating: int | None = None) SchemaObject [source]
Parse a schema object from a file.
graphdoc.data.schema module
- class graphdoc.data.schema.SchemaCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema quality categories enumeration.
- PERFECT = 'perfect'
- ALMOST_PERFECT = 'almost perfect'
- POOR_BUT_CORRECT = 'poor but correct'
- INCORRECT = 'incorrect'
- BLANK = 'blank'
- class graphdoc.data.schema.SchemaRating(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema quality ratings enumeration.
- FOUR = '4'
- THREE = '3'
- TWO = '2'
- ONE = '1'
- ZERO = '0'
- class graphdoc.data.schema.SchemaCategoryRatingMapping[source]
Bases:
object
Mapping between schema categories and ratings.
- static get_rating(category: SchemaCategory) SchemaRating [source]
Get the corresponding rating for a given schema category.
- Parameters:
category – The schema category
- Returns:
The corresponding rating
- static get_category(rating: SchemaRating) SchemaCategory [source]
Get the corresponding category for a given schema rating.
- Parameters:
rating – The schema rating
- Returns:
The corresponding category
- class graphdoc.data.schema.SchemaType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema type enumeration.
- FULL_SCHEMA = 'full schema'
- TABLE_SCHEMA = 'table schema'
- ENUM_SCHEMA = 'enum schema'
- class graphdoc.data.schema.SchemaCategoryPath(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Maps schema categories to their folder names.
- PERFECT = 'perfect'
- ALMOST_PERFECT = 'almost_perfect'
- POOR_BUT_CORRECT = 'poor_but_correct'
- INCORRECT = 'incorrect'
- BLANK = 'blank'
- class graphdoc.data.schema.SchemaObject(key: str, category: Enum | None = None, rating: Enum | None = None, schema_name: str | None = None, schema_type: Enum | None = None, schema_str: str | None = None, schema_ast: Node | None = None)[source]
Bases:
object
Schema object containing schema data and metadata.
- key: str
- schema_ast: Node | None = None
- classmethod from_dict(data: dict, category_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, rating_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, type_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaType'>) SchemaObject [source]
Create SchemaObject from dictionary with validation.
- Parameters:
data – The data dictionary
category_enum – Custom Enum class for categories
rating_enum – Custom Enum class for ratings
type_enum – Custom Enum class for schema types
- to_dict() dict [source]
Convert the SchemaObject to a dictionary, excluding the key field.
- Returns:
Dictionary representation of the SchemaObject without the key
- Return type:
- static _hf_schema_object_columns() Features [source]
Return the columns for the graph_doc dataset, based on the SchemaObject fields.
- Returns:
The columns for the graph_doc dataset
- Return type:
Features
- to_dataset() Dataset [source]
Convert the SchemaObject to a Hugging Face Dataset.
- Returns:
The Hugging Face Dataset
- Return type:
Dataset
Module contents
- class graphdoc.data.DspyDataHelper[source]
Bases:
ABC
Abstract class for creating data objects related to a given dspy.Signature.
- prompt_signature() Signature | SignatureMeta [source]
Given a prompt, return a dspy.Signature object.
- Parameters:
prompt (Any) – A prompt.
- static _(prompt: Predict) Signature | SignatureMeta [source]
Given a dspy.Predict object, return a dspy.Signature object.
- static formatted_signature(signature: Signature | SignatureMeta, example: Example) str [source]
Given a dspy.Signature and a dspy.Example, return a formatted signature as a string.
- Parameters:
signature (dspy.Signature) – A dspy.Signature object.
example (dspy.Example) – A dspy.Example object.
- Returns:
A formatted signature as a string.
- Return type:
- abstract static example(inputs: dict[str, Any]) Example [source]
Given a dictionary of inputs, return a dspy.Example object.
- abstract static example_example() Example [source]
Return an example dspy.Example object with the inputs set to the example values.
- Returns:
A dspy.Example object.
- Return type:
dspy.Example
- abstract static model_signature() ModelSignature [source]
Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.
- Returns:
A mlflow.models.ModelSignature object.
- Return type:
mlflow.models.ModelSignature
- abstract static prediction(inputs: dict[str, Any]) Prediction [source]
Given a dictionary of inputs, return a dspy.Prediction object.
- abstract static prediction_example() Prediction [source]
Return an example dspy.Prediction object with the inputs set to the example values.
- Returns:
A dspy.Prediction object.
- Return type:
dspy.Prediction
- abstract static trainset(inputs: dict[str, Any] | Dataset, filter_args: dict[str, Any] | None = None) list[Example] [source]
Given a dictionary of inputs or a datasets.Dataset object, return a list of dspy.Example objects.
- Parameters:
- Returns:
A list of dspy.Example objects.
- Return type:
list[dspy.Example]
- class graphdoc.data.GenerationDataHelper[source]
Bases:
DspyDataHelper
A helper class for creating data objects related to our Documentation Generation dspy.Signature.
The example signature is defined as:
` database_schema: str = dspy.InputField() documented_schema: str = dspy.OutputField() `
- static example(inputs: dict[str, Any]) Example [source]
Given a dictionary of inputs, return a dspy.Example object.
- static example_example() Example [source]
Return an example dspy.Example object with the inputs set to the example values.
- Returns:
A dspy.Example object.
- Return type:
dspy.Example
- static model_signature() ModelSignature [source]
Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.
- Returns:
A mlflow.models.ModelSignature object.
- Return type:
mlflow.models.ModelSignature
- static prediction(inputs: dict[str, Any]) Prediction [source]
Given a dictionary of inputs, return a dspy.Prediction object.
- static prediction_example() Prediction [source]
Return an example dspy.Prediction object with the inputs set to the example values.
- Returns:
A dspy.Prediction object.
- Return type:
dspy.Prediction
- class graphdoc.data.QualityDataHelper[source]
Bases:
DspyDataHelper
A helper class for creating data objects related to our Documentation Quality dspy.Signature.
The example signature is defined as:
database_schema: str = dspy.InputField() category: Literal["perfect", "almost perfect", "poor but correct", "incorrect"] = ( dspy.OutputField() ) rating: Literal[4, 3, 2, 1] = dspy.OutputField()
- static example(inputs: dict[str, Any]) Example [source]
Given a dictionary of inputs, return a dspy.Example object.
- static example_example() Example [source]
Return an example dspy.Example object with the inputs set to the example values.
- Returns:
A dspy.Example object.
- Return type:
dspy.Example
- static model_signature() ModelSignature [source]
Return a mlflow.models.ModelSignature object. Based on the example object, removes the output fields and utilizes the remaining fields to infer the model signature.
- Returns:
A mlflow.models.ModelSignature object.
- Return type:
mlflow.models.ModelSignature
- static prediction(inputs: dict[str, Any]) Prediction [source]
Given a dictionary of inputs, return a dspy.Prediction object.
- static prediction_example() Prediction [source]
Return an example dspy.Prediction object with the inputs set to the example values.
- Returns:
A dspy.Prediction object.
- Return type:
dspy.Prediction
- class graphdoc.data.LocalDataHelper(schema_directory_path: str | ~pathlib._local.Path | None = None, categories: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, ratings: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, categories_ratings: ~typing.Callable = <function SchemaCategoryRatingMapping.get_rating>)[source]
Bases:
object
A helper class for loading data from a directory.
- Parameters:
schema_directory_path (Union[str, Path] Defaults to the path to the schemas in the graphdoc package.) – The path to the directory containing the schemas
categories (Type[Enum]) – The categories of the schemas. Defaults to SchemaCategory.
ratings (Type[Enum]) – The ratings of the schemas. Defaults to SchemaRating.
categories_ratings – A callable that maps categories to ratings. Defaults to SchemaCategoryRatingMapping.get_rating.
- schema_objects_from_folder(category: str, rating: int, folder_path: str | Path) dict[str, SchemaObject] [source]
Load schemas from a folder, keeping the difficulty tag.
- schema_objects_from_folder_of_folders(folder_paths: ~typing.Type[~enum.Enum] | None = <enum 'SchemaCategoryPath'>) Dict[str, SchemaObject] | None [source]
Load a folder of folders containing schemas, keeping the difficulty tag.
- Parameters:
folder_paths (Optional[Type[Enum]]) – Enum class defining folder paths, defaults to SchemaCategoryPath. Must have a get_path method.
- Returns:
Dictionary of loaded schemas
- Return type:
Union[Dict[str, SchemaObject], None]
- folder_to_dataset(category: str, folder_path: str | Path, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset [source]
Load a folder of schemas, keeping the difficulty tag.
- Parameters:
- Returns:
A dataset containing the schemas
- Return type:
Dataset
- folder_of_folders_to_dataset(folder_paths: ~typing.Type[~enum.Enum] = <enum 'SchemaCategoryPath'>, parse_objects: bool = True, type_mapping: dict[type, str] | None = None) Dataset [source]
Load a folder of folders containing schemas, keeping the difficulty tag.
- Parameters:
- Returns:
A dataset containing the schemas
- Return type:
Dataset
- class graphdoc.data.MlflowDataHelper(mlflow_tracking_uri: str | Path, mlflow_tracking_username: str | None = None, mlflow_tracking_password: str | None = None)[source]
Bases:
object
- __init__(mlflow_tracking_uri: str | Path, mlflow_tracking_username: str | None = None, mlflow_tracking_password: str | None = None)[source]
A helper class for loading and saving models and metadata from mlflow.
- update_auth_env_vars(mlflow_tracking_username: str, mlflow_tracking_password: str)[source]
Update the authentication environment variables.
- set_auth_env_vars()[source]
Set the authentication environment variables.
- latest_model_version(model_name: str)[source]
Load the latest version of a model from mlflow.
- Parameters:
model_name (str) – The name of the model to load.
- Returns:
The loaded model.
- model_by_name_and_version(model_name: str, model_version: str)[source]
Load a model from mlflow by name and version.
- model_by_uri(model_uri: str)[source]
Load a model from mlflow by uri.
- Parameters:
model_uri (str) – The uri of the model to load.
- Returns:
The loaded model.
- model_by_args(load_model_args: Dict[str, str])[source]
Given a dictionary of arguments, load a model from mlflow. Ordering is model_by_uri, model_by_name_and_version, latest_model_version.
- graphdoc.data._env_constructor(loader: SafeLoader, node: ScalarNode) str [source]
Custom constructor for environment variables.
- Parameters:
loader (yaml.SafeLoader) – The YAML loader.
node (yaml.nodes.ScalarNode) – The node to construct.
- Returns:
The environment variable value.
- Return type:
- Raises:
ValueError – If the environment variable is not set.
- graphdoc.data.check_directory_path(directory_path: str | Path) None [source]
Check if the provided path resolves to a valid directory.
- Parameters:
directory_path (Union[str, Path]) – The path to check.
- Raises:
ValueError – If the path does not resolve to a valid directory.
- Returns:
None
- Return type:
None
- graphdoc.data.check_file_path(file_path: str | Path) None [source]
Check if the provided path resolves to a valid file.
- Parameters:
file_path (Union[str, Path]) – The path to check.
- Raises:
ValueError – If the path does not resolve to a valid file.
- Returns:
None
- Return type:
None
- graphdoc.data.load_yaml_config(file_path: str | Path, use_env: bool = True) dict [source]
Load a YAML configuration file.
- Parameters:
- Returns:
The YAML configuration.
- Return type:
- Raises:
ValueError – If the path does not resolve to a valid file or the environment variable is not set.
- graphdoc.data.setup_logging(log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'])[source]
Setup logging for the application.
- Parameters:
log_level (Literal["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]) – The log level.
- class graphdoc.data.Parser(type_mapping: dict[type, str] | None = None)[source]
Bases:
object
A class for parsing and handling of GraphQL objects.
- DEFAULT_NODE_TYPES = {<class 'graphql.language.ast.DocumentNode'>: 'full schema', <class 'graphql.language.ast.EnumTypeDefinitionNode'>: 'enum schema', <class 'graphql.language.ast.EnumValueDefinitionNode'>: 'enum value', <class 'graphql.language.ast.ObjectTypeDefinitionNode'>: 'table schema'}
- static _check_node_type(node: Node, type_mapping: dict[type, str] | None = None) str [source]
Check the type of a schema node.
- static parse_schema_from_file(schema_file: str | Path, schema_directory_path: str | Path | None = None) DocumentNode [source]
Parse a schema from a file.
- static update_node_descriptions(node: Node, new_value: str | None = None) Node [source]
Given a GraphQL node, recursively traverse the node and its children, updating all descriptions with the new value. Can also be used to remove descriptions by passing None as the new value.
- Parameters:
node (Node) – The GraphQL node to update
new_value (Optional[str]) – The new description value. If None, the description will be removed.
- Returns:
The updated node
- Return type:
Node
- static count_description_pattern_matching(node: Node, pattern: str) dict[str, int] [source]
Counts the number of times a pattern matches a description in a node and its children.
- static fill_empty_descriptions(node: Node, new_column_value: str = 'Description for column: {}', new_table_value: str = 'Description for table: {}', use_value_name: bool = True, value_name: str | None = None)[source]
Recursively traverse the node and its children, filling in empty descriptions with the new column or table value. Do not update descriptions that already have a value. Default values are provided for the new column and table descriptions.
- Parameters:
- Returns:
The updated node
- Return type:
Node
- static schema_equality_check(gold_node: Node, check_node: Node) bool [source]
A method to check if two schema nodes are equal. Only checks that the schemas structures are equal, not the descriptions.
- Parameters:
gold_node (Node) – The gold standard schema node
check_node (Node) – The schema node to check
- Returns:
Whether the schemas are equal
- Return type:
- static schema_object_from_file(schema_file: str | Path, category: str | None = None, rating: int | None = None) SchemaObject [source]
Parse a schema object from a file.
- class graphdoc.data.SchemaCategory(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema quality categories enumeration.
- PERFECT = 'perfect'
- ALMOST_PERFECT = 'almost perfect'
- POOR_BUT_CORRECT = 'poor but correct'
- INCORRECT = 'incorrect'
- BLANK = 'blank'
- class graphdoc.data.SchemaCategoryPath(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Maps schema categories to their folder names.
- PERFECT = 'perfect'
- ALMOST_PERFECT = 'almost_perfect'
- POOR_BUT_CORRECT = 'poor_but_correct'
- INCORRECT = 'incorrect'
- BLANK = 'blank'
- class graphdoc.data.SchemaCategoryRatingMapping[source]
Bases:
object
Mapping between schema categories and ratings.
- static get_rating(category: SchemaCategory) SchemaRating [source]
Get the corresponding rating for a given schema category.
- Parameters:
category – The schema category
- Returns:
The corresponding rating
- static get_category(rating: SchemaRating) SchemaCategory [source]
Get the corresponding category for a given schema rating.
- Parameters:
rating – The schema rating
- Returns:
The corresponding category
- class graphdoc.data.SchemaObject(key: str, category: Enum | None = None, rating: Enum | None = None, schema_name: str | None = None, schema_type: Enum | None = None, schema_str: str | None = None, schema_ast: Node | None = None)[source]
Bases:
object
Schema object containing schema data and metadata.
- key: str
- schema_ast: Node | None = None
- classmethod from_dict(data: dict, category_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaCategory'>, rating_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaRating'>, type_enum: ~typing.Type[~enum.Enum] = <enum 'SchemaType'>) SchemaObject [source]
Create SchemaObject from dictionary with validation.
- Parameters:
data – The data dictionary
category_enum – Custom Enum class for categories
rating_enum – Custom Enum class for ratings
type_enum – Custom Enum class for schema types
- to_dict() dict [source]
Convert the SchemaObject to a dictionary, excluding the key field.
- Returns:
Dictionary representation of the SchemaObject without the key
- Return type:
- static _hf_schema_object_columns() Features [source]
Return the columns for the graph_doc dataset, based on the SchemaObject fields.
- Returns:
The columns for the graph_doc dataset
- Return type:
Features
- to_dataset() Dataset [source]
Convert the SchemaObject to a Hugging Face Dataset.
- Returns:
The Hugging Face Dataset
- Return type:
Dataset
- class graphdoc.data.SchemaRating(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema quality ratings enumeration.
- FOUR = '4'
- THREE = '3'
- TWO = '2'
- ONE = '1'
- ZERO = '0'
- class graphdoc.data.SchemaType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]
-
Schema type enumeration.
- FULL_SCHEMA = 'full schema'
- TABLE_SCHEMA = 'table schema'
- ENUM_SCHEMA = 'enum schema'
- graphdoc.data.schema_objects_to_dataset(schema_objects: List[SchemaObject]) Dataset [source]
Convert a list of SchemaObjects to a Hugging Face Dataset.
- Parameters:
schema_objects – The list of SchemaObjects
- Returns:
The Hugging Face Dataset
- graphdoc.data.load_yaml_config_redacted(file_path: str | Path, replace_value: str = 'redacted') dict [source]
Load a YAML configuration file with environment variables redacted.
- Parameters:
- Returns:
The YAML configuration with env vars replaced by “redacted”.
- Return type:
- Raises:
ValueError – If the path does not resolve to a valid file.