Class TrecDocParser
- java.lang.Object
-
- org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
-
- Direct Known Subclasses:
TrecFBISParser
,TrecFR94Parser
,TrecFTParser
,TrecGov2Parser
,TrecLATimesParser
,TrecParserByPath
public abstract class TrecDocParser extends Object
Parser for trec doc content, invoked on doc text excludingand which are handled in TrecContentSource. Required to be stateless and hence thread safe.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TrecDocParser.ParsePathType
Types of trec parse paths,
-
Field Summary
Fields Modifier and Type Field Description static TrecDocParser.ParsePathType
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
Constructor Summary
Constructors Constructor Description TrecDocParser()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static String
extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
Extract frombuf
the text of interest within specified tagsabstract DocData
parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
parse the text prepared in docBuf into a result DocData, no synchronization is required.static TrecDocParser.ParsePathType
pathType(File f)
Compute the path type of a file by inspecting name of file and its parentsstatic String
stripTags(StringBuilder buf, int start)
strip tags frombuf
: each tag is replaced by a single blank.static String
stripTags(String buf, int start)
strip tags from input.
-
-
-
Field Detail
-
DEFAULT_PATH_TYPE
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
-
Method Detail
-
pathType
public static TrecDocParser.ParsePathType pathType(File f)
Compute the path type of a file by inspecting name of file and its parents
-
parse
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException, InterruptedException
parse the text prepared in docBuf into a result DocData, no synchronization is required.- Parameters:
docData
- reusable resultname
- name that should be set to the resulttrecSrc
- calling trec content sourcedocBuf
- text to parsepathType
- type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.- Throws:
IOException
InterruptedException
-
stripTags
public static String stripTags(StringBuilder buf, int start)
strip tags frombuf
: each tag is replaced by a single blank.- Returns:
- text obtained when stripping all tags from
buf
(Input StringBuilder is unmodified).
-
stripTags
public static String stripTags(String buf, int start)
strip tags from input.- See Also:
stripTags(StringBuilder, int)
-
extract
public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
Extract frombuf
the text of interest within specified tags- Parameters:
buf
- entire input textstartTag
- tag marking start of text of interestendTag
- tag marking end of text of interestmaxPos
- if ≥ 0 sets a limit on start of text of interest- Returns:
- text of interest or null if not found
-
-