Coder Social home page Coder Social logo

Help parsing large file about rdf4h HOT 7 OPEN

h4ck3rm1k3 avatar h4ck3rm1k3 commented on August 18, 2024
Help parsing large file

from rdf4h.

Comments (7)

robstewart57 avatar robstewart57 commented on August 18, 2024

Hi @h4ck3rm1k3 ,

Thanks for the report!

  1. will attoparsec help?

If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 in November.

Try something like:

parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"

Does that improve the memory performance?

  1. can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.

Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:

  1. The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then foaf:homepage predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't.

  2. Turning Turtle data into types? I'm not sure how that'd work, or why turning ontological instances (data as triples) into Haskell types would a useful thing to do, or what it'd look like.

I'm interested to know If attoparsec (above) gives you better results.

from rdf4h.

h4ck3rm1k3 avatar h4ck3rm1k3 commented on August 18, 2024

from rdf4h.

h4ck3rm1k3 avatar h4ck3rm1k3 commented on August 18, 2024

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished. Like the sax model in xml parsing, then I could do my processing before the file is completed.

from rdf4h.

h4ck3rm1k3 avatar h4ck3rm1k3 commented on August 18, 2024

Test with of normal vs attoparse 30k lines we are still hovering around 0.5 seconds per 1k lines,
The memory usage has gone down.
That is still not very fast. I think next I want to look into some callback function.
These are both with NTriplesParserCustom

    Thu Sep 21 06:51 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       14.89 secs   (14886 ticks @ 1000 us, 1 processor)
        total alloc = 28,746,934,240 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                         SRC                                                    %time %alloc

satisfy          Text.Parsec.Char               Text/Parsec/Char.hs:(140,1)-(142,71)                    13.1   21.4
>>=              Text.Parsec.Prim               Text/Parsec/Prim.hs:202:5-29                            11.4   13.5
mplus            Text.Parsec.Prim               Text/Parsec/Prim.hs:289:5-34                             6.5    9.7
parsecMap.\      Text.Parsec.Prim               Text/Parsec/Prim.hs:190:7-48                             6.5   11.4
isSubDelims      Network.URI                    Network/URI.hs:355:1-38                                  4.4    0.0
fmap.\           Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(171,7)-(172,42)       4.1    3.1
isGenDelims      Network.URI                    Network/URI.hs:352:1-34                                  3.7    0.0
>>=.\.succ'      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76              3.5    1.1
encodeChar       Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               3.1    4.6
encodeString     Codec.Binary.UTF8.String       Codec/Binary/UTF8/String.hs:37:1-53                      2.3    4.0
concat.ts'       Data.Text                      Data/Text.hs:902:5-34                                    2.0    2.6

Testing with lastest version of rdf4h

        Thu Sep 21 06:34 2017 Time and Allocation Profiling Report  (Final)

           gcc-haskell-exe +RTS -N -p -h -RTS

        total time  =       15.28 secs   (15282 ticks @ 1000 us, 1 processor)
        total alloc = 33,815,423,648 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                        SRC                                                    %time %alloc

satisfy          Text.Parsec.Char              Text/Parsec/Char.hs:(140,1)-(142,71)                    17.2   27.6
>>=              Text.Parsec.Prim              Text/Parsec/Prim.hs:202:5-29                            16.5   22.8
parsecMap.\      Text.Parsec.Prim              Text/Parsec/Prim.hs:190:7-48                             9.2    8.4
mplus            Text.Parsec.Prim              Text/Parsec/Prim.hs:289:5-34                             7.7    9.5
isSubDelims      Network.URI                   Network/URI.hs:355:1-38                                  3.9    0.0
isGenDelims      Network.URI                   Network/URI.hs:352:1-34                                  3.4    0.0
encodeChar       Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:(50,1)-(67,25)               2.9    3.9
encodeString     Codec.Binary.UTF8.String      Codec/Binary/UTF8/String.hs:37:1-53                      2.2    3.4
parserReturn.\   Text.Parsec.Prim              Text/Parsec/Prim.hs:234:7-30                             2.0    3.1

from rdf4h.

robstewart57 avatar robstewart57 commented on August 18, 2024

Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished

Agree that this would be a good feature, moving towards generating on-the-fly streams of RDF triples whilst parsing, rather than parsing a file/string in entirety.

For example, from the API in the io-streams library, I can imagine to read an RDF source, we'd have a new type class:

class RdfParserStream p where
  parseStringStream
      :: (Rdf a)
      => p
      -> Text
      -> Either ParseFailure (InputStream (RDF a))
  parseFileStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))
  parseURLStream
      :: (Rdf a)
      => p
      -> String
      -> IO (Either ParseFailure (InputStream (RDF a)))

Then these triple streams could be connected to an output stream, e.g. a file output stream, using the io-streams API:

connect :: InputStream a -> OutputStream a -> IO () 

from rdf4h.

h4ck3rm1k3 avatar h4ck3rm1k3 commented on August 18, 2024

The big question I have for rdf and haskel is how to create instances of types from rdf data, is there any easy way to map rdf data via some ontology into haskell types?

from rdf4h.

robstewart57 avatar robstewart57 commented on August 18, 2024

@h4ck3rm1k3 sadly not, although that would be very cool.

There is some work in this area, for other languages including F# and Idris:

And also in Scala, where they have support for type providers from RDF data: https://github.com/travisbrown/type-provider-examples

from rdf4h.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.