Coder Social home page Coder Social logo

Is this method implement only in the data parallel ? is there any pipeline parallel just like the model parallel ? about datatrove HOT 9 CLOSED

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024
Is this method implement only in the data parallel ? is there any pipeline parallel just like the model parallel ?

from datatrove.

Comments (9)

hynky1999 avatar hynky1999 commented on July 30, 2024

Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ?
What method do you have in mind ?

from datatrove.

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024

Hey, I am heaving a bit hard time understanding the issue, could you elaborate more ? What method do you have in mind ?

Thank u answer, i mean, the execute method, the PipelineExecutor class method.

from datatrove.

guipenedo avatar guipenedo commented on July 30, 2024

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

from datatrove.

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

thank u reply

from datatrove.

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024

Hi, to clarify, datatrove is a data processing library, and not a distributed training framework. If you want a distributed training framework I recommend you look into nanotron

Thank u reply again, is there any data processing library or method for processing structured data such as the tabular data in the book ?

from datatrove.

guipenedo avatar guipenedo commented on July 30, 2024

What book are you referring to?

from datatrove.

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024

What book are you referring to?

table_in_book textified_table I have some the pdf type of book that include table data, but i got 'textified_table' derived from the 'table_in_book', when i convert the book.pdf to book.txt, i just wanna the pure text of the book, the no structure 'textified_table' that is some kinda of noisy data. maybe just convert pdf to md. how to deal with this kinda problem in current public book dataset for pretraining of llm , do you know ? thank u a lot, i very appreciate for your reply

from datatrove.

guipenedo avatar guipenedo commented on July 30, 2024

We don't currently have any fix for this, our text extraction is intended for when you have an html document

from datatrove.

WenhaoZhang-Git avatar WenhaoZhang-Git commented on July 30, 2024

We don't currently have any fix for this, our text extraction is intended for when you have an html document

Appreciate u candor

from datatrove.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.