Coder Social home page Coder Social logo

rusdracor's Introduction

RusDraCor

Corpus Description

We are building a Russian Drama Corpus with files encoded in TEI-P5. Our corpus comprises 212 plays to date, originating from ilibrary, Wikisource, РВБ, lib.ru, ФЕБ, СовЛит and Wikilivres, converted to TEI and corrected and enhanced by us. There will be more.

If you want to cite the corpus, please use this publication:

  • Fischer, Frank, et al. (2019). Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of DH2019: "Complexities", Utrecht University, doi:10.5281/zenodo.4284002.

RusDraCor was first presented on June 29, 2017, at the Corpora 2017 conference in St. Petersburg (our slides here), on July 11, 2017, at the "Digitizing the stage" conference in Oxford and on November 14, 2017, at the TEI 2017 conference in Victoria. The social network data we extract from plays may also be explored on our website dracor.org/rus or via our Shinyapp.

If you just want to download the corpus in its current state in XML-TEI, do this:

svn export https://github.com/dracor-org/rusdracor/trunk/tei

API

An easy way to download the network data (instead of the actual TEI files) is to use our API (documentation here). If you have jq installed, it would work like this:

for play in `curl 'https://dracor.org/api/corpora/rus' | jq -r ".dramas[] .name"`; do
    wget -O "$play".csv https://dracor.org/api/corpora/rus/play/"$play"/networkdata/csv
done

The API info page is at https://dracor.org/api/info.

Simple Visualisation with R

To have a first look at the distribution of the number of speakers per play over time, you could feed the metadata table into R:

library(data.table)
library(ggplot2)
rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv")
ggplot(rusdracor[], aes(x = yearNormalized, y = numOfSpeakers)) + geom_point()

Result:

number of speakers per play over time

Here is a barplot showing the number of plays per decade:

number of plays per decade

(README last updated on July 26, 2021.)

rusdracor's People

Contributors

alexdyul avatar annaoskina avatar cmil avatar danilsko avatar evgeniashlosman avatar ingoboerner avatar lehkost avatar mathias-goebel avatar nevmenandr avatar rita7798 avatar yaskevich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rusdracor's Issues

dot in the end of speaker's name

Cases like this

<speaker>Г. Тоисиоков.</speaker>

I guess that the dot is presented in the publication, but it's only a part of formatting. If a play is printed with speaker's name separated from speech with indent, we don't find the dot (e.g. krylov-filomela.xml).
As soon it's not a part of the content and not needed for presentation, it's not really needed in XML.

It's very low priority question :-)

invalid usage of data.pointer

when using an IDREF (usually initiated by "#") the named ID must point to a @xml:id in the same document. like the @who points to the person element. this is not the case for your revision description. a good way is to provide the names within the tei:fileDesc using tei:editor or just tei:person, additionally with @resp and a very short description of the role in the project.

teiHeader consistencies

  • embed schema https://dracor.org/rus/schema.rng
  • lowercase Wikidatawikidata in <author key="Wikidata:Qxxxx">Xxxxx, Xxxx Xxxx</author>
  • transform IDs: <idno type="RusDraCor">119</idno><idno type="dracor" xml:base="https://dracor.org/id/">rus000119</idno>
  • change licence to CC0: <ab>CC0</ab><ref target="https://creativecommons.org/publicdomain/zero/1.0/">Licence</ref>
  • delete file name from <publicationStmt> (<idno type="URL">https://dracor.org/rus/xxxxx-xxxxx</idno>)
  • in xml:base="https://www.wikidata.org/wiki/", change /wiki/ to /entity/

new lines inside text

For some reason, there are files where inside text lines we have new line characters, like this

<title type="main" xml:lang="ru">Театральный разъезд после представления новой
          комедии</title>

Source: gogol-teatralnyi-razezd.xml

Of course, it's not important for XML, but if one would like to quickly process bunch of xml files with RegExp or something like this, it would be problem.

And, as to XML, as soon it's not semantics and not presentation, it's not needed.
I can fix this. The issue was created only to discuss the policy.

Stage inside line

Currently, in several files I've got stage within p

<sp who="#neschastlivtsev">
<speaker>Несчастливцев.</speaker>
<p>О, люди, люди! <stage>(Идет в угол, надевает котомку.)</stage></p>
<stage>Аксюша помогает ему и целует его. Берет в руки палку.</stage>
<p>Ну, Аркадий, мы с тобой попировали, пошумели, братец; теперь опять за работу!
<stage>(Выходит на середину сцены, подзывает Карпа и говорит ему с расстановкой и
внушительно.)</stage> Послушай, Карп! Если приедет тройка, ты вороти ее, братец, в
город; скажи, что господа пешком пошли. Руку, товарищ! <stage>(Подает руку
Счастливцеву и медленно удаляется.)</stage></p>
</sp>

Maybe, it should be:

<sp who="#neschastlivtsev">
	<speaker>Несчастливцев.</speaker>
	<p>О, люди, люди!</p>
	<stage>(Идет в угол, надевает котомку.)</stage>			
	<stage>Аксюша помогает ему и целует его. Берет в руки палку.</stage>
	<p>Ну, Аркадий, мы с тобой попировали, пошумели, братец; теперь опять за работу!</p>
	<stage>(Выходит на середину сцены, подзывает Карпа и говорит ему с расстановкой и внушительно.)</stage>
	<p>Послушай, Карп! Если приедет тройка, ты вороти ее, братец, в город; скажи, что господа пешком пошли. Руку, товарищ!</p>
	<stage>(Подает руку Счастливцеву и медленно удаляется.)</stage>
</sp>

As to me, the second variant seems better. Or are there different types of stage?
(The quotations are from ostrovsky-les.xml.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.