regparser.tree.xml_parser package¶
Submodules¶
regparser.tree.xml_parser.appendices module¶
regparser.tree.xml_parser.extended_preprocessors module¶
regparser.tree.xml_parser.flatsubtree_processor module¶
-
class
regparser.tree.xml_parser.flatsubtree_processor.
FlatParagraphProcessor
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor
Paragraph Processor which does not try to derive paragraph markers
-
MATCHERS
= [<regparser.tree.xml_parser.paragraph_processor.StarsMatcher object>, <regparser.tree.xml_parser.paragraph_processor.TableMatcher object>, <regparser.tree.xml_parser.simple_hierarchy_processor.SimpleHierarchyMatcher object>, <regparser.tree.xml_parser.paragraph_processor.HeaderMatcher object>, <regparser.tree.xml_parser.paragraph_processor.SimpleTagMatcher object>, <regparser.tree.xml_parser.us_code.USCodeMatcher object>, <regparser.tree.xml_parser.paragraph_processor.GraphicsMatcher object>, <regparser.tree.xml_parser.paragraph_processor.IgnoreTagMatcher object>]¶
-
-
class
regparser.tree.xml_parser.flatsubtree_processor.
FlatsubtreeMatcher
(tags, node_type=u'regtext')[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Detects tags passed to it on init and processes them with the FlatParagraphProcessor. Also optionally sets node_type.
regparser.tree.xml_parser.import_category module¶
-
class
regparser.tree.xml_parser.import_category.
ImportCategoryMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
The IMPORTCATEGORY gets converted into a subtree with an appropriate title and unique paragraph marker
-
CATEGORY_RE
= <_sre.SRE_Pattern object>¶
-
derive_nodes
(xml, processor=None)[source]¶ Finds and deletes the category header before recursing. Adds this header as a title.
-
regparser.tree.xml_parser.interpretations module¶
regparser.tree.xml_parser.paragraph_processor module¶
-
class
regparser.tree.xml_parser.paragraph_processor.
BaseMatcher
[source]¶ Bases:
object
Base class defining the interface of various XML node matchers
-
class
regparser.tree.xml_parser.paragraph_processor.
FencedMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Use github-like fencing to indicate this is code
-
class
regparser.tree.xml_parser.paragraph_processor.
GraphicsMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Convert Graphics tags into a markdown-esque format
-
class
regparser.tree.xml_parser.paragraph_processor.
HeaderMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
-
class
regparser.tree.xml_parser.paragraph_processor.
IgnoreTagMatcher
(*tags)[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.SimpleTagMatcher
As we log warnings when we don’t know how to process a tag, this matcher allows us to positively acknowledge that we’re ignoring some matches
-
class
regparser.tree.xml_parser.paragraph_processor.
ParagraphProcessor
[source]¶ Bases:
object
Processing paragraphs in a generic manner requires a lot of state to be carried in between xml nodes. Use a class to wrap that state so we can compartmentalize processing with various tags. This is an abstract class; regtext, interpretations, appendices, etc. should inherit and override where needed
-
DEPTH_HEURISTICS
= OrderedDict([(<function prefer_diff_types_diff_levels>, 0.8), (<function prefer_multiple_children>, 0.4), (<function prefer_shallow_depths>, 0.2), (<function prefer_no_markerless_sandwich>, 0.2)])¶
-
MATCHERS
= []¶
-
build_hierarchy
(root, nodes, depths)[source]¶ Given a root node, a flat list of child nodes, and a list of depths, build a node hierarchy around the root
-
carry_label_to_children
(node)[source]¶ Takes a node and recursively processes its children to add the appropriate label prefix to them.
-
parse_nodes
(xml)[source]¶ Derive a flat list of nodes from this xml chunk. This does nothing to determine node depth
-
static
replace_markerless
(stack, node, depth)[source]¶ Assign a unique index to all of the MARKERLESS paragraphs
-
-
class
regparser.tree.xml_parser.paragraph_processor.
SimpleTagMatcher
(*tags)[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Simple example tag matcher – it listens for specific tags and derives a single node with the associated body
-
class
regparser.tree.xml_parser.paragraph_processor.
StarsMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
<STARS> indicates a chunk of text which is being skipped over
regparser.tree.xml_parser.preprocessors module¶
Set of transforms we run on notice XML to account for common inaccuracies in the XML
-
class
regparser.tree.xml_parser.preprocessors.
ApprovalsFP
[source]¶ Bases:
regparser.tree.xml_parser.preprocessors.PreProcessorBase
We expect certain text to an APPRO tag, but it is often mistakenly found inside FP tags. We use REGEX to determine which nodes need to be fixed.
-
REGEX
= <_sre.SRE_Pattern object at 0x3fdc360>¶
-
-
class
regparser.tree.xml_parser.preprocessors.
ExtractTags
[source]¶ Bases:
regparser.tree.xml_parser.preprocessors.PreProcessorBase
Often, what should be a single EXTRACT tag is broken up by incorrectly positioned subtags. Try to find any such EXTRACT sandwiches and merge.
-
FILLING
= (u'FTNT', u'GPOTABLE')¶
-
combine_with_following
(extract, include_tag)[source]¶ We need to merge an extract with the following tag. Rather than iterating over the node, text, tail text, etc. we’re taking a more naive solution: convert to a string, reparse
-
-
class
regparser.tree.xml_parser.preprocessors.
Footnotes
[source]¶ Bases:
regparser.tree.xml_parser.preprocessors.PreProcessorBase
The XML separates the content of footnotes and where they are referenced. To make it more semantic (and easier to process), we find the relevant footnote and attach its text to the references. We also need to split references apart if multiple footnotes apply to the same <SU>
-
IS_REF_PREDICATE
= u'not(ancestor::TNOTE) and not(ancestor::FTNT)'¶
-
XPATH_FIND_NOTE_TPL
= u"./following::SU[(ancestor::TNOTE or ancestor::FTNT) and text()='{0}']"¶
-
XPATH_IS_REF
= u'.//SU[not(ancestor::TNOTE) and not(ancestor::FTNT)]'¶
-
add_ref_attributes
(xml)[source]¶ Modify each footnote reference so that it has an attribute containing its footnote content
-
static
is_reasonably_close
(referencing, referenced)[source]¶ We want to make sure that _potential_ footnotes are truly related, as SU might also indicate generic superscript. To match a footnote with its content, we’ll try to find a common SECTION ancestor. We’ll also consider the two SUs related if neither has a SECTION ancestor, though we might want to restrict this further in the future.
-
-
class
regparser.tree.xml_parser.preprocessors.
ImportCategories
[source]¶ Bases:
regparser.tree.xml_parser.preprocessors.PreProcessorBase
447.21 contains an import list, but the XML doesn’t delineate the various categories well. We’ve created IMPORTCATEGORY tags to handle the hierarchy correctly, but we need to modify the XML to insert them in appropriate locations
-
CATEGORY_HD
= u".//HD[contains(., 'categor')]"¶
-
SECTION_HD
= u"//SECTNO[contains(., '447.21')]"¶
-
static
remove_extract
(section)[source]¶ The XML currently (though this may change) contains a semantically meaningless EXTRACT. Remove it
-
-
class
regparser.tree.xml_parser.preprocessors.
PreProcessorBase
[source]¶ Bases:
object
Base class for all the preprocessors. Defines the interface they must implement
-
regparser.tree.xml_parser.preprocessors.
atf_i50031
(xml)[source]¶ 478.103 also contains a shorter form, which appears in a smaller poster. Unfortunately, the XML didn’t include the appropriate NOTE inside the corresponding EXTRACT
-
regparser.tree.xml_parser.preprocessors.
atf_i50032
(xml)[source]¶ 478.103 contains a chunk of text which is meant to appear in a poster and be easily copy-paste-able. Unfortunately, the XML post 2003 isn’t structured to contain all of the appropriate elements within the EXTRACT associated with the poster. This PreProcessor moves these additional elements back into the appropriate EXTRACT.
-
regparser.tree.xml_parser.preprocessors.
move_adjoining_chars
(xml)[source]¶ If an e tag has an emdash or period after it, put the char inside the e tag
-
regparser.tree.xml_parser.preprocessors.
move_last_amdpar
(xml)[source]¶ If the last element in a section is an AMDPAR, odds are the authors intended it to be associated with the following section
-
regparser.tree.xml_parser.preprocessors.
move_subpart_into_contents
(xml)[source]¶ Account for SUBPART tags being outside their intended CONTENTS
-
regparser.tree.xml_parser.preprocessors.
parentheses_cleanup
(xml)[source]¶ Clean up where parentheses exist between paragraph an emphasis tags
-
regparser.tree.xml_parser.preprocessors.
preprocess_amdpars
(xml)[source]¶ Modify the AMDPAR tag to contain an <EREGS_INSTRUCTIONS> element. This element contains an interpretation of the AMDPAR, as viewed as a sequence of actions for how to modify the CFR. Do _not_ modify any existing EREGS_INSTRUCTIONS (they’ve been manually created)
We don’t currently support certain tags nested inside subparts, so promote each up one level
regparser.tree.xml_parser.reg_text module¶
regparser.tree.xml_parser.simple_hierarchy_processor module¶
-
class
regparser.tree.xml_parser.simple_hierarchy_processor.
DepthParagraphMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Convert a paragraph with an optional prefixing paragraph marker into an appropriate node. Does not know about collapsed markers nor most types of nodes.
-
class
regparser.tree.xml_parser.simple_hierarchy_processor.
SimpleHierarchyMatcher
(tags, node_type)[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Detects tags passed to it on init and converts the contents of any matches into a hierarchy based on the SimpleHierarchyProcessor. Sets the node_type of the subtree’s root
-
class
regparser.tree.xml_parser.simple_hierarchy_processor.
SimpleHierarchyProcessor
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor
ParagraphProcessor which attempts to pull out whatever paragraph marker is available and derive a hierarchy from that.
-
MATCHERS
= [<regparser.tree.xml_parser.simple_hierarchy_processor.DepthParagraphMatcher object>]¶
-
regparser.tree.xml_parser.tree_utils module¶
-
class
regparser.tree.xml_parser.tree_utils.
NodeStack
[source]¶ Bases:
regparser.tree.priority_stack.PriorityStack
The NodeStack aids our construction of a struct.Node tree. We process xml one paragraph at a time; using a priority stack allows us to insert items at their proper depth and unwind the stack (collecting children) as necessary
-
regparser.tree.xml_parser.tree_utils.
get_node_text
(node, add_spaces=False)[source]¶ Extract all the text from an XML node (including the text of it’s children).
Get the body of an XML node as a string, avoiding a specific blacklist of bad tags.
-
regparser.tree.xml_parser.tree_utils.
prepend_parts
(parts_prefix, n)[source]¶ Recursively preprend parts_prefix to the parts of the node n. Parts is a list of markers that indicates where you are in the regulation text.
-
regparser.tree.xml_parser.tree_utils.
replace_xml_node_with_text
(node, text)[source]¶ There are some complications w/ lxml when determining where to add the replacement text. Account for all of that here.
-
regparser.tree.xml_parser.tree_utils.
replace_xpath
(xpath)[source]¶ Decorator to convert all elements matching the provided xpath in to plain text. This’ll convert the wrapped function into a new function which will search for the provided xpath and replace all matches
regparser.tree.xml_parser.us_code module¶
-
class
regparser.tree.xml_parser.us_code.
USCodeMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Matches a custom USCODE tag and parses it’s contents with the USCodeProcessor. Does not use a custom node type at the moment
-
class
regparser.tree.xml_parser.us_code.
USCodeParagraphMatcher
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.BaseMatcher
Convert a paragraph found in the US Code into appropriate Nodes
-
paragraph_markers
(text)[source]¶ We can’t use tree_utils.get_paragraph_markers as that makes assumptions about the order of paragraph markers (specifically that the markers will match the order found in regulations). This is simpler, looking only at multiple markers at the beginning of the paragraph
-
-
class
regparser.tree.xml_parser.us_code.
USCodeProcessor
[source]¶ Bases:
regparser.tree.xml_parser.paragraph_processor.ParagraphProcessor
ParagraphProcessor which converts a chunk of XML into Nodes. Only processes P nodes and limits the type of paragraph markers to those found in US Code
-
MATCHERS
= [<regparser.tree.xml_parser.us_code.USCodeParagraphMatcher object>]¶
-
regparser.tree.xml_parser.xml_wrapper module¶
-
class
regparser.tree.xml_parser.xml_wrapper.
XMLWrapper
(xml, source=None)[source]¶ Bases:
object
Wrapper around XML which provides a consistent interface shared by both Notices and Annual editions of XML