marcspec / marcspec Goto Github PK

View Code? Open in Web Editor NEW

20.0 7.0 3.0 247 KB

:page_facing_up: MARCspec - A common MARC record path language

Home Page: http://marcspec.github.io/MARCspec/marc-spec.html

Makefile 10.21% HTML 89.79%

marc-records marcspec marc bibliographic specification path

marcspec's Introduction

#MARCspec - A common MARC record path language

MARCspec is the specification of a reference, encoded as string, to a set of data from within a MARC record.

See http://marcspec.github.io/MARCspec for lastest specification.

marcspec's People

Contributors

Stargazers

Watchers

Forkers

nichtich edsu danmichaelo

marcspec's Issues

MarcSpec for Python pymarc

@cKlee, are you still maintaining this organization and the several implementations?

Repeatability

of fields eg. 400
of subfields within fields eg. multiple $w

and furthermore:

since MARC is an implementation ISO 2709 there should be no additional constraints, e.g. alpha characters in field tags should be allowed

Typically one is interested in the last characters of the subfield preceeding 245$e : E.g. a field may be viewed upon as a data field, but subfields as subordinate data fields are the wrong concept: Its simply marks inserted at certain positions of the field data...

I very much doubt that one can invent an practically useful accessor syntax to MARC considerably "simpler" than full XPath.

Question on subfield branch of marcSpec rule

The subfield branch of the marcSpec rule confuses me as to the purpose of the subSpec clauses:

MARCspec          = fieldSpec *subSpec / (subfieldSpec *subSpec *(abrSubfieldSpec *subSpec)) / indicatorSpec *subSpec

Because one or more subSpecs can occur after the subfieldSpec, and after the abrSubfieldSpec, it seems like the following would be quite valid (it is valid with the parser I am building in python):

"880$a{?$f}$b$c$e{$f=\q}"

But I though the function of the abrSubfieldSpec and subSpecs after it are to allow multiple subfields to be specified. Would the subspec {$f=\a} be evaluated against all of the subfields? What is the sense of the first subspec in this?

FieldTag definition

With respect to http://www.loc.gov/marc/specifications/specrecstruc.html#varitags field tags in ANSI Z39.2 and ISO 2709 could consist of both alphabetic and numeric characters, although MARC 21 formats use only numeric tags.

The current Spec embraces this possibility by:

fieldTag  = 3(alphalower / DIGIT / ".") / 3(alphaupper / DIGIT / ".")

@pkiraly suggests to disallow alphabetic characters and make LDR or LEADER an explicit field tag:

fieldTag  = 3(DIGIT / ".") /  "LDR" / "LEADER"

But if the overall MARCspec should cope ANSI Z39.2 and ISO 2709, we should support it, if this does not cause problems.

About the leader: LDR is already covered by 3(DIGIT), so why make this explicit? And does LEADER actually appear in data? This usage might lead to additional efforts for parsing. Is it necessary?

Is some fallback syntax necessary?

If a subfield does not exist, allow to specify a fallback subfield.

245$a|k

Support for pointing to a subfields that follow a specific character

e.g. In titles I would like to point to everything after the Œ/Œ in a 245
field

Is the scope of MARCspec MARC 21 or ISO 2709?

Right now MARCspec refers to MARC 21. Does this limitation hold or is it also suitable for other applications of of ISO 2709?

tests

For interoperability would it be useful/possible to assemble a language-agnostic collection of specs, and their expected stringified output given a test MARC record? The purpose would be to make sure that the MARCspec was parsed and interpreted correctly.

I was thinking of something in Markdown or JSON like:

{
  "245$a": "Finnegan's Wake /",
  "245$a$c": "Finnegan's Wake / James Joyce.",
  ...
}

If I were to put something together would this be of interest? I guess an initial set could be derived fairly easily from TestMarcSpecTest?

Interprete ComparisonString as a regular expression

Reference subfield a of field 306 if character at position 0 of field 007 is either "m", "s" or "v".

306$a{007/0=~m|s|v}

Use "/" instead of "~" for character posion and range

/ seems more common than ~.

008/0-3 instead of 008~0-3

Use "*" instead of X for wildcards

Localy defined fields might contain the character "X" in the tag. This might lead to interpretation problems.

rename

Having two words (MARC spec) as name is problematic, how about MARCspec?

make error: openFile: does not exist

I installed pandoc (v1.12.3) and made sure I had a clone of makespec in the directory above, and I see this when I run make:

pandoc: 1: openFile: does not exist (No such file or directory)

I'm new to this pandoc/makespec toolchain, so my apologies if this is a very basic question.

Automatic build not working?

This was merged quite some time ago: 2b1d491

But I still see 020$s{?020$a} at http://marcspec.github.io/MARCspec/marc-spec.html

Question / request for clarification

Hi. Given example data

020$cLorem$aIpsum
020$cDolor

I expected 020$c{$a} from example in 4.7.2 to return just ['Lorem'] (thinking
xpath-like 020[a]/c), but instead I got ['Lorem', 'Dolor'] (from File_MARC_Reference). Of course I'm guilty of not having read the spec thoroughly enough, but if I understand it right, this is a result of point 2 in 2.3? And that 020$c{$a} is just a shorthand for 020$c{020$a}?

To avoid confusion, perhaps

Reference data content of subfield “c” of field “020”, if subfield “a” of field “020” exists.

could be clarified as

Reference data content of subfield “c” of any field “020”, if subfield “a” of any field “020” exists (not necessarily the same field).

or something along those lines?

references to values of indicators

see discussion pkiraly/qa-catalogue#23

subfieldChar includes punctuation as well as alphanumeric?

According to the MARC 21 bibliographic standard as well as [UNIMARC 2008], a subfield code can only be alphabetic or numeric (MARC 21 specifies lower-case alphabetic). However, the grammar defines subfieldChar and subfieldCode as:

subfieldChar      = %x21-3F / %x5B-7B / %x7D-7E
                    ; ! " # $ % & ' ( ) * + , - . / 0-9 : ; < = > ? [ \ ] ^ _ \` a-z { } ~
subfieldCode      = "$" subfieldChar

Is this intentional? Do non-bibliographic MARC use cases include punctuation as subfield codes?

allow alphabetic characters in field tag

Local defined field tags may contain alphabetic characters. Thus field tag should allow these:

alphaupper = %x41-5A ; A-Z
alphalower = %x61-7A; a-z
fieldTag = 3*3(((alphalower / alphaupper) / DIGIT)) / "LDR"

Problem: how to interprete "X" when defined as local field and not meant as wildcard? Use other character for wildcard? Like "*"?

Support pointer to a subfield of a given field

Given a MARC field, one could further select parts of it. An example:

titles = getMARCspec(record, "245")
foreach titleField in titles
    title = getMARCspec(titleField, "$a")
    remainder = getMARCspec(titleField, "$b")
    if title.endsWith(":") then
     ...
    end
done

I'd propose to change the core syntax to:

MARCspec = fieldSpec / characterSpec / subfieldSpec

; refer to a (set of) fields
fieldSpec    = fieldTag ["_" indicators]

; refer to a character position or range
characterSpec = [ fieldTag ] "/" characterPositionOrRange

; refer to a (set of) subfields of specified or given fields
subfieldSpec = fieldTag [ "$" ] subfieldTags ["_" indicators]
             / fieldTag "_" indicators "$" subfieldTags
             / "$" subfieldTags

This would also allow to select subfields of a given field, such as "$a". The preceding "$" is necessary to not confuse "123" (the field) with "$123" (three subfields). I'd also make it optional for "100a" == "100$a" and to support giving indicator before subfields ("245a_1" == "245$a_1" == "245_1$a").

Note that "$" is also a valid subfield tag, so "$" should be mandatory to refer to this subfield:

100a ; valid (subfield "a" of field 100)
100$ ; invalid
100$a ; valid (subfield "a" of field 100)
100$$ ; valid (subfield "$" of field 100)

Referring to multiple specs for a given mapping/output

Hi there, is there recommended way to define a combination of MARCspecs to indicate multiple applicable matches? Solrmarc and Traject both use a colon to delimit multiple specs.

Examples:

If I wanted to refer to both fields 506 and 540, it would be nice to be able to do something like 506:540.
A slightly more complex example, using subSpecs: 650$z:650$a:034{LDR/6=\e}:255{LDR/6=\e}

pointing to the first item

Pointing to the first item, e.g. first author. For repeatable fields, point to the first in the list.

Possible solution: Prefix field tag with a character, which does not get encoded in URI. E.g. use "-". Thus the first field of all 100 fields is referenced by -100. Other possible characters are "~", "_", "/", "+" and "*".

Suggestion of renaming the specification elements, and make it more clear

When I learn the specification and work on the implementation I had several conclusions I would like to share with you.

Comments on existing features:

There are two main parts of the standard. One for specifying a given part of MARC record, and the other provides a condition. For the first one I suggest to use "path" or "address"
"spec" suffix is used in the specification several times, because of XPath and JSONPath is suggest to use "path" instead. Or "address" or even empty suffix (no suffix at all), which I promote in my suggestion below.
subspec is not very expressive name, I suggest "conditions" and "conditionSet"
there are two kind of conditions: existential (? and !) and comparisions. In conditions based on comparision, leftSide and rightSide is not very expressive, I suggest "marcPath" (or the name instead) and "value"
the value can be a reference or a literal. To denote literal values I suggest the traditional single or double quotation marks than the unusual backslash () character. Use backslash for escape things only.

Here is my formalized suggestion for renaming the specification

alphaupper         = %x41-5A
                     ; A-Z
alphalower         = %x61-7A
                     ; a-z
DIGIT              =  %x30-39
                     ; 0-9
VCHAR              =  %x21-7E
                     ; visible (printing) characters
positiveDigit      = %x31-39
                     ;  "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
positiveInteger    = "0" / positiveDigit [1*DIGIT]

; field
fieldTag           = 3(DIGIT / ".")
                      / "LDR"
                      / "LEADER"
position           = positiveInteger / "#"
range              = position "-" position
positionOrRange    = range
                      / position
characterSpec      = "/" positionOrRange
index              = "[" positionOrRange "]"
shortField         = index [characterSpec]
                      / characterSpec
field              = fieldTag [index] [characterSpec]

; subfield
subfieldChar       = alphaupper
                      / alphalower
                      / DIGIT
subfieldCode       = "$" subfieldChar
subfieldCodeRange  = "$" ( (alphaupper "-" alphaupper)
                      / (alphalower "-" alphalower)
                      / (DIGIT "-" DIGIT) )
                      ; [a-z]-[a-z] / [0-9]-[0-9]
shortSubfield      = (subfieldCode / subfieldCodeRange) [index] [characterSpec]
subfield           = fieldTag [index] shortSubfield

; indicator
shortIndicator     = [index] "^" ("1" / "2")
indicator          = fieldTag shortIndicator

; condition
comparisonString   = ("'" *VCHAR "'")
                      / ('"' *VCHAR '"')
operator           = "=" / "!=" / "~" / "!~" / "!" / "?"
                      ; equal / unequal / includes / not includes / not exists / exists
abbreviation       = shortField
                      / shortSubfield
                      / shortIndicator
conditionTerm      = field
                      / subfield
                      / indicatorPath
                      / comparisonString
                      / abbreviation
condition          = [ [conditionTerm] operator ] conditionTerm
conditionSet       = "{" condition *( "|" condition ) "}"

; the whole together
marcPath           = field *conditionSet
                     / (subfield *conditionSet *(shortSubfield *conditionSet))
                     / indicatorPath *conditionSet

Besides that the relationship between the "path" and the "condition" is not clear for me. There can be two interpretations relating to the conditions, and for both there are valid use cases:

the condition should be true somewhere in the record
the condition should be true inside the context the path specifies

008/18{LDR/6=\t}

Here the situation is clear: 008 and LDR are two different fields, here we should follow the first interpretation.

880$a{100$6~880$6/3-5}
020$c{020$a}

Suppose we have two 880 fields. Should we take both if the condition is true either of them, or we should take that 880 for which the condition is true? Same situation for 020 (which is repeatable field).

I would like to see a constraints in which the context is defined explicitly. We can use the following notation for the leftHandSide (or path) part:

self or . means the current context
- 020$c{.="something"} - get 020$c if it's value is "something"
parent or .. means the parent
- 020$c{..?$a} - get 020$c if the same 020 field has subfield $a
implicit path or any other explicit path: the context is the record
- 020$c{020$a} - get 020$c if there is 020$a anywhere in the record

I admit, "make it more clear" is a very subjective statement, as we don't have absolute scale for semantic clearness. So this comment is more of a discussion opening one, than a final suggestion.

http://marcspec.github.io/ root URL doesn't show anything useful

...and it'd be nice if it did. :-)