regparser.tree.xml_parser package¶

Submodules¶

regparser.tree.xml_parser.appendices module¶

regparser.tree.xml_parser.extended_preprocessors module¶

regparser.tree.xml_parser.flatsubtree_processor module¶

class regparser.tree.xml_parser.flatsubtree_processor.FlatParagraphProcessor[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor

Paragraph Processor which does not try to derive paragraph markers

MATCHERS = [<regparser.tree.xml_parser.paragraph_processor.StarsMatcher object>, <regparser.tree.xml_parser.paragraph_processor.TableMatcher object>, <regparser.tree.xml_parser.simple_hierarchy_processor.SimpleHierarchyMatcher object>, <regparser.tree.xml_parser.paragraph_processor.HeaderMatcher object>, <regparser.tree.xml_parser.paragraph_processor.SimpleTagMatcher object>, <regparser.tree.xml_parser.us_code.USCodeMatcher object>, <regparser.tree.xml_parser.paragraph_processor.GraphicsMatcher object>, <regparser.tree.xml_parser.paragraph_processor.IgnoreTagMatcher object>]¶

class regparser.tree.xml_parser.flatsubtree_processor.FlatsubtreeMatcher(tags, node_type=u'regtext')[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Detects tags passed to it on init and processes them with the FlatParagraphProcessor. Also optionally sets node_type.

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

regparser.tree.xml_parser.import_category module¶

class regparser.tree.xml_parser.import_category.ImportCategoryMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

The IMPORTCATEGORY gets converted into a subtree with an appropriate title and unique paragraph marker

CATEGORY_RE = <_sre.SRE_Pattern object>¶

derive_nodes(xml, processor=None)[source]¶: Finds and deletes the category header before recursing. Adds this header as a title.

classmethod marker(header_text)[source]¶: Derive a unique, repeatable identifier for this subtree. This allows the same category to be reordered (e.g. if a note has been added), or a header with multiple reserved categories to be split (which would also re-order the categories that followed)

matches(xml)[source]¶

regparser.tree.xml_parser.interpretations module¶

regparser.tree.xml_parser.paragraph_processor module¶

class regparser.tree.xml_parser.paragraph_processor.BaseMatcher[source]¶

Bases: object

Base class defining the interface of various XML node matchers

derive_nodes(xml, processor=None)[source]¶: Given an xml node which this matcher applies against, convert it into a list of Node structures. processor is the paragraph processor which we are being executed in. May be useful when determining how to create the Nodes

matches(xml)[source]¶: Test the xml element – does this matcher apply?

class regparser.tree.xml_parser.paragraph_processor.FencedMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Use github-like fencing to indicate this is code

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.paragraph_processor.GraphicsMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Convert Graphics tags into a markdown-esque format

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.paragraph_processor.HeaderMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.paragraph_processor.IgnoreTagMatcher(*tags)[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.SimpleTagMatcher

As we log warnings when we don’t know how to process a tag, this matcher allows us to positively acknowledge that we’re ignoring some matches

derive_nodes(xml, processor=None)[source]¶

class regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor[source]¶

Bases: object

Processing paragraphs in a generic manner requires a lot of state to be carried in between xml nodes. Use a class to wrap that state so we can compartmentalize processing with various tags. This is an abstract class; regtext, interpretations, appendices, etc. should inherit and override where needed

DEPTH_HEURISTICS = OrderedDict([(<function prefer_diff_types_diff_levels>, 0.8), (<function prefer_multiple_children>, 0.4), (<function prefer_shallow_depths>, 0.2), (<function prefer_no_markerless_sandwich>, 0.2)])¶

MATCHERS = []¶

static additional_constraints()[source]¶: Hook for subtypes to add additional constraints

build_hierarchy(root, nodes, depths)[source]¶: Given a root node, a flat list of child nodes, and a list of depths, build a node hierarchy around the root

carry_label_to_children(node)[source]¶: Takes a node and recursively processes its children to add the appropriate label prefix to them.

parse_nodes(xml)[source]¶: Derive a flat list of nodes from this xml chunk. This does nothing to determine node depth

process(xml, root)[source]¶

static relaxed_constraints()[source]¶: Hook for subtypes to add relaxed constraints for retry logic

static replace_markerless(stack, node, depth)[source]¶: Assign a unique index to all of the MARKERLESS paragraphs

select_depth(depths)[source]¶: There might be multiple solutions to our depth processing problem. Use heuristics to select one.

static separate_intro(nodes)[source]¶: In many situations the first unlabeled paragraph is the “intro” text for a section. We separate that out here

class regparser.tree.xml_parser.paragraph_processor.SimpleTagMatcher(*tags)[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Simple example tag matcher – it listens for specific tags and derives a single node with the associated body

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.paragraph_processor.StarsMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

<STARS> indicates a chunk of text which is being skipped over

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.paragraph_processor.TableMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Matches the GPOTABLE tag

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

regparser.tree.xml_parser.preprocessors module¶

Set of transforms we run on notice XML to account for common inaccuracies in the XML

class regparser.tree.xml_parser.preprocessors.ApprovalsFP[source]¶

Bases: regparser.tree.xml_parser.preprocessors.PreProcessorBase

We expect certain text to an APPRO tag, but it is often mistakenly found inside FP tags. We use REGEX to determine which nodes need to be fixed.

REGEX = <_sre.SRE_Pattern object at 0x3fdc360>¶

static strip_extracts(xml)[source]¶: APPROs should not be alone in an EXTRACT

transform(xml)[source]¶

class regparser.tree.xml_parser.preprocessors.ExtractTags[source]¶

Bases: regparser.tree.xml_parser.preprocessors.PreProcessorBase

Often, what should be a single EXTRACT tag is broken up by incorrectly positioned subtags. Try to find any such EXTRACT sandwiches and merge.

FILLING = (u'FTNT', u'GPOTABLE')¶

combine_with_following(extract, include_tag)[source]¶: We need to merge an extract with the following tag. Rather than iterating over the node, text, tail text, etc. we’re taking a more naive solution: convert to a string, reparse

extract_pair(extract)[source]¶: Checks for and merges two EXTRACT tags in sequence

sandwich(extract)[source]¶: Checks for this pattern: EXTRACT FILLING EXTRACT, and, if present, combines the first two tags. The two EXTRACTs would get merged in a later pass

static strip_root_tag(string)[source]¶

transform(xml)[source]¶

class regparser.tree.xml_parser.preprocessors.Footnotes[source]¶

Bases: regparser.tree.xml_parser.preprocessors.PreProcessorBase

The XML separates the content of footnotes and where they are referenced. To make it more semantic (and easier to process), we find the relevant footnote and attach its text to the references. We also need to split references apart if multiple footnotes apply to the same <SU>

IS_REF_PREDICATE = u'not(ancestor::TNOTE) and not(ancestor::FTNT)'¶

XPATH_FIND_NOTE_TPL = u"./following::SU[(ancestor::TNOTE or ancestor::FTNT) and text()='{0}']"¶

XPATH_IS_REF = u'.//SU[not(ancestor::TNOTE) and not(ancestor::FTNT)]'¶

add_ref_attributes(xml)[source]¶: Modify each footnote reference so that it has an attribute containing its footnote content

static is_reasonably_close(referencing, referenced)[source]¶: We want to make sure that _potential_ footnotes are truly related, as SU might also indicate generic superscript. To match a footnote with its content, we’ll try to find a common SECTION ancestor. We’ll also consider the two SUs related if neither has a SECTION ancestor, though we might want to restrict this further in the future.

split_comma_footnotes(xml)[source]¶: Convert XML such as <SU>1, 2, 3</SU> into distinct SU elements: <SU>1</SU> <SU>2</SU> <SU>3</SU> for easier reference

transform(xml)[source]¶

class regparser.tree.xml_parser.preprocessors.ImportCategories[source]¶

Bases: regparser.tree.xml_parser.preprocessors.PreProcessorBase

447.21 contains an import list, but the XML doesn’t delineate the various categories well. We’ve created IMPORTCATEGORY tags to handle the hierarchy correctly, but we need to modify the XML to insert them in appropriate locations

CATEGORY_HD = u".//HD[contains(., 'categor')]"¶

SECTION_HD = u"//SECTNO[contains(., '447.21')]"¶

static remove_extract(section)[source]¶: The XML currently (though this may change) contains a semantically meaningless EXTRACT. Remove it

static split_categories(category_headers)[source]¶: We now have a big chunk of flat XML with headers and paragraphs. We’ll make it semantic by converting these into bundles and wrapping them in IMPORTCATEGORY tags

transform(xml)[source]¶

class regparser.tree.xml_parser.preprocessors.PreProcessorBase[source]¶

Bases: object

Base class for all the preprocessors. Defines the interface they must implement

transform(xml)[source]¶: Transform the input xml. Mutates that xml, so be sure to make a copy if needed

regparser.tree.xml_parser.preprocessors.atf_i50031(xml)[source]¶: 478.103 also contains a shorter form, which appears in a smaller poster. Unfortunately, the XML didn’t include the appropriate NOTE inside the corresponding EXTRACT

regparser.tree.xml_parser.preprocessors.atf_i50032(xml)[source]¶: 478.103 contains a chunk of text which is meant to appear in a poster and be easily copy-paste-able. Unfortunately, the XML post 2003 isn’t structured to contain all of the appropriate elements within the EXTRACT associated with the poster. This PreProcessor moves these additional elements back into the appropriate EXTRACT.

regparser.tree.xml_parser.preprocessors.move_adjoining_chars(xml)[source]¶: If an e tag has an emdash or period after it, put the char inside the e tag

regparser.tree.xml_parser.preprocessors.move_last_amdpar(xml)[source]¶: If the last element in a section is an AMDPAR, odds are the authors intended it to be associated with the following section

regparser.tree.xml_parser.preprocessors.move_subpart_into_contents(xml)[source]¶: Account for SUBPART tags being outside their intended CONTENTS

regparser.tree.xml_parser.preprocessors.parentheses_cleanup(xml)[source]¶: Clean up where parentheses exist between paragraph an emphasis tags

regparser.tree.xml_parser.preprocessors.preprocess_amdpars(xml)[source]¶: Modify the AMDPAR tag to contain an <EREGS_INSTRUCTIONS> element. This element contains an interpretation of the AMDPAR, as viewed as a sequence of actions for how to modify the CFR. Do _not_ modify any existing EREGS_INSTRUCTIONS (they’ve been manually created)

regparser.tree.xml_parser.preprocessors.promote_nested_tags(tag, xml)[source]¶: We don’t currently support certain tags nested inside subparts, so promote each up one level

regparser.tree.xml_parser.preprocessors.replace_html_entities(xml_bin_str)[source]¶: XML does not contain entity references for many HTML entities, yet the Federal Register XML sometimes contains the HTML entities. Replace them here, lest we throw off XML parsing

regparser.tree.xml_parser.reg_text module¶

regparser.tree.xml_parser.simple_hierarchy_processor module¶

class regparser.tree.xml_parser.simple_hierarchy_processor.DepthParagraphMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Convert a paragraph with an optional prefixing paragraph marker into an appropriate node. Does not know about collapsed markers nor most types of nodes.

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.simple_hierarchy_processor.SimpleHierarchyMatcher(tags, node_type)[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Detects tags passed to it on init and converts the contents of any matches into a hierarchy based on the SimpleHierarchyProcessor. Sets the node_type of the subtree’s root

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.simple_hierarchy_processor.SimpleHierarchyProcessor[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor

ParagraphProcessor which attempts to pull out whatever paragraph marker is available and derive a hierarchy from that.

MATCHERS = [<regparser.tree.xml_parser.simple_hierarchy_processor.DepthParagraphMatcher object>]¶

additional_constraints()[source]¶

regparser.tree.xml_parser.tree_utils module¶

class regparser.tree.xml_parser.tree_utils.NodeStack[source]¶

Bases: regparser.tree.priority_stack.PriorityStack

The NodeStack aids our construction of a struct.Node tree. We process xml one paragraph at a time; using a priority stack allows us to insert items at their proper depth and unwind the stack (collecting children) as necessary

collapse()[source]¶: After all of the nodes have been inserted at their proper levels, collapse them into a single root node

unwind()[source]¶: Unwind the stack, collapsing sub-paragraphs that are on the stack into the children of the previous level.

regparser.tree.xml_parser.tree_utils.footnotes_to_plaintext(node, add_spaces)[source]¶

regparser.tree.xml_parser.tree_utils.get_node_text(node, add_spaces=False)[source]¶: Extract all the text from an XML node (including the text of it’s children).

regparser.tree.xml_parser.tree_utils.get_node_text_tags_preserved(xml_node)[source]¶: Get the body of an XML node as a string, avoiding a specific blacklist of bad tags.

regparser.tree.xml_parser.tree_utils.prepend_parts(parts_prefix, n)[source]¶: Recursively preprend parts_prefix to the parts of the node n. Parts is a list of markers that indicates where you are in the regulation text.

regparser.tree.xml_parser.tree_utils.replace_xml_node_with_text(node, text)[source]¶: There are some complications w/ lxml when determining where to add the replacement text. Account for all of that here.

regparser.tree.xml_parser.tree_utils.replace_xpath(xpath)[source]¶: Decorator to convert all elements matching the provided xpath in to plain text. This’ll convert the wrapped function into a new function which will search for the provided xpath and replace all matches

regparser.tree.xml_parser.tree_utils.split_text(text, tokens)[source]¶: Given a body of text that contains tokens, splice the text along those tokens.

regparser.tree.xml_parser.tree_utils.subscript_to_plaintext(node, add_spaces)[source]¶

regparser.tree.xml_parser.tree_utils.superscript_to_plaintext(node, add_spaces)[source]¶

regparser.tree.xml_parser.us_code module¶

class regparser.tree.xml_parser.us_code.USCodeMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Matches a custom USCODE tag and parses it’s contents with the USCodeProcessor. Does not use a custom node type at the moment

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

class regparser.tree.xml_parser.us_code.USCodeParagraphMatcher[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.BaseMatcher

Convert a paragraph found in the US Code into appropriate Nodes

derive_nodes(xml, processor=None)[source]¶

matches(xml)[source]¶

paragraph_markers(text)[source]¶: We can’t use tree_utils.get_paragraph_markers as that makes assumptions about the order of paragraph markers (specifically that the markers will match the order found in regulations). This is simpler, looking only at multiple markers at the beginning of the paragraph

class regparser.tree.xml_parser.us_code.USCodeProcessor[source]¶

Bases: regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor

ParagraphProcessor which converts a chunk of XML into Nodes. Only processes P nodes and limits the type of paragraph markers to those found in US Code

MATCHERS = [<regparser.tree.xml_parser.us_code.USCodeParagraphMatcher object>]¶

additional_constraints()[source]¶

regparser.tree.xml_parser.xml_wrapper module¶

class regparser.tree.xml_parser.xml_wrapper.XMLWrapper(xml, source=None)[source]¶

Bases: object

Wrapper around XML which provides a consistent interface shared by both Notices and Annual editions of XML

preprocess()[source]¶: Unfortunately, the notice xml is often inaccurate. This function attempts to fix some of those (general) flaws. For specific issues, we tend to instead use the files in settings.LOCAL_XML_PATHS

xml_str()[source]¶

xpath(*args, **kwargs)[source]¶

regparser.tree.xml_parser package¶

Submodules¶

regparser.tree.xml_parser.appendices module¶

regparser.tree.xml_parser.extended_preprocessors module¶

regparser.tree.xml_parser.flatsubtree_processor module¶

regparser.tree.xml_parser.import_category module¶

regparser.tree.xml_parser.interpretations module¶

regparser.tree.xml_parser.paragraph_processor module¶

regparser.tree.xml_parser.preprocessors module¶

regparser.tree.xml_parser.reg_text module¶

regparser.tree.xml_parser.simple_hierarchy_processor module¶

regparser.tree.xml_parser.tree_utils module¶

regparser.tree.xml_parser.us_code module¶

regparser.tree.xml_parser.xml_wrapper module¶

Module contents¶