github-linguist / linguist Goto Github PK

View Code? Open in Web Editor NEW

12.1K 523.0 4.2K 39.42 MB

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!

License: MIT License

Ruby 67.90% Shell 1.22% C 22.57% Lex 1.61% Go 6.41% Dockerfile 0.29%

syntax-highlighting language-grammars language-statistics linguistic

linguist's People

Contributors

Stargazers

Watchers

Forkers

toothrot cpatni ananthrk carlosgaldino sen lethaldose bakkdoor rkh dals earl epictetus bratish kkaefer jettero tautologico sarahhodne softprops zlender cybershadow devinus mrorbita visgean txdywy lkuper bkerley flaviusb leto nibalizer gosu-tools kprevas pombredanne beckje01 kevinsawicki grosser simonoff mgdm robsimmons jwilkins ruelbargo maieul eeue56 gitlabhq kepstin btd open-turing-project sunaku stof sbisbee kennknowles dom96 sj26 mcobrien ahrokib timurb brynary utkarshkukreti alokmenghrajani robnewman leafo sylvestre abevoelker meirkriheli rankida valscion mleinart zhensydow geekontheway ab5tract jaxzin jpcs frostman pmoura derekv jstrachan keikubo abarrachina caerbannog lparenteau pao svenefftinge skoon igrigorik nolta pablof7z doubleotoo pwaller ealliaume dineshkummarc jongalloway burningtyger johan wildmichael ngarneau rlsosborne chiraag andyklock chochos sdressler protohub borgified

linguist's Issues

Nimrod .nim files no longer are recognized after some change to linguist

ugh, anyone with Ruby experience want to figure out why github's linguist does not consider .nim files to be the Nimrod language anymore?

I'm quite sure it fails on my comp because I have the latest Ruby version and it doesn't support it.

I don't know what I need, all I want is to get linguist to run.

I also noticed that linguist fails with an error:

custom_require.rb:36:in `require': cannot load such file -- pygments (LoadError)

But I can't find any "gem install pygments"

Do people really HAVE to use bundler in order to try linguist? I don't like bundler
at all, it messes up things in ways I don't want to. :(

All we have to find out is why linguist no longer recognizes .nim files

Nimrod:
type: programming
color: "#37775b"
primary_extension: .nim
extensions:

.nimrod

It should work but it does not.

(.nim are default extensions for nimrod files)

Git commit

I would be nice to have a highlighter for git commits, so I could paste the output of git show around "```commit" and it would look nice.

(p.s., I know about the diff highlighter, I'm mainly talking about making the message and the metadata before it look nice)

Ruby library marked as Javascript

I started a Ruby projects thats a Rails generator. In github search, its categorized as Javascript project. Github support said linguist is behind the project categorization, so I though I'd file an issue.

Project: https://github.com/joshcrews/flexible_admin

https://github.com/search?type=Repositories&language=&q=flexible+admin

Binary *.n files are Neko (haXe) applications, not Nemerle code

Linguist is flagging any file with a *.n extension as Nemerle, but the extension is used by Neko binary code.

Since this is compiled code, I don't think it should be counted towards any source code total -- but it should not be flagged as Nemerle!

For example, I have a project which includes haXe source code, that compiles to a Neko application for processing Javascript, building JS projects, etc. 68% of the file total is the compiled *.n application, while the rest is the haXe source code.

Shipped Classifier cannot be trained

It's not clear whether this limitation is intentional or if this is a side effect of the YAML loading, but it's not possible to update the Classifier instance with a new language.

I'm trying to learn new languages to the already existing classifier at the smallest cost possible and I'm trying to follow the following workplan:

Add new languages in the classifier and train the classifier with an "adequate" volume of data
Reduce the number of tokens for the new languages so that the number of classified tokens remains low to preserve performance (according to the rdoc, #gc should be the one I have to call, but according to the source, it does not do anything. It this something you plan to implement ?)

Do you think this is an acceptable use of your library ?

Right now, I'm duck typing Language to feed Classifier#train, this seems to be enough for it to work. Because the Classifier is not dependent on Language at all, maybe #train could simply use a String as parameter (and #classify return Strings too). This would greatly simplifies the interop with your lib :-)

Following, a simple test-case and patch that allows the test-case to pass.

Cheers,
Pierre.

Move shebang script detection to classifier

The Classifier should be able to pick up on shebang scripts and detect them correctly.

Drop mime-types

Try to get our current mime-type extensions pushed upstream to the mime-types lib. Then try to decouple integration from Linguist. Language detection shouldn't be dependent on any sort of mime type.

Ship public gem

Theres already a linguist gem, we'll take github-linguist.

.t far too generic for perl

We need to check these files contents. See this repo's tests. They're not perl.

Deep content inspection tweaking

I found the place where #! files are analyzed for the right language, but I don't see anywhere a way to extend it. In our case, the simplest way to identify a Racket file would be to look for a #lang line (see example here). A less precise but possibly more broadly useful heuristic is to look for an exec foo line near the top of the file.

Either way, it's not clear whether this is intended to be customizeable, and if so, how to do it.

foundation detected as PHP

foundation detected as ~75% php.

But php files in foundation use a lot of php and one to three php instructions.

It should be detected as ~70% html and ~5% php

language detection doesnt seem to update

create a repo like mine: https://github.com/borgified/linguist-test

populate with a couple perl scripts
commit and push to github
observe that linguist detects it as "perl"
delete all the perl scripts, replace with a whole bunch of php scripts
commit and push to github
observe that linguist still detects repo as "perl"

C code detected as Objective C

Hello,

I have a repository in Github, the Refu Library, which is a pure C project. For some reason the majority of the source files are identified as Objective C and so the project itself is tagged as Objective C. Here is the repository:
http://github.com/LefterisJP/Refu/

I have no knowledge of Ruby so I can't understand how the Linguist project works to find the problem. Any assistance with this matter will be appreciated.

Prolog files misclassified as Perl files

Prolog files are once again misclassified as Perl files. The disambiguation code seems to have been removed. The current specs for Prolog defines "primary_extension" as ".prolog", which nobody in the Prolog programming community uses and ever used. The default extension for Prolog is ".pl" (long before Perl ever existed). How to get the disambiguation functionality back?

Add Lasso programming language

We've submitted our pull request to Pygments to add Lasso as a programming language, and it's been accepted! Lasso now has a lexer:

https://bitbucket.org/birkenfeld/pygments-main/pull-request/95/new-lexer-for-the-lasso-language

What do I need to do next to get Lasso added into Linguist? I need to know which files I should edit for my pull request. Thank you!

Blog Post

Draft Blog Post.

README

Write up a more complete README.

Add .elf extension

I think that all *.elf files should be marked as binary automatically (without reading the file)

add .pl as Prolog extension

At the moment it is recognized as Perl.

edit: Spelling. Both English and Perl are not my native language ;-)

Coq / Verilog Misdetections

Linguist is getting Verilog and Coq confused (see Verilog projects
included in https://github.com/languages/Coq and Coq projects included
in https://github.com/languages/Verilog). Both use .v files. I've gone
through the commit history and the first place that I can get it to
fail is at 4484011, however it may be
failing one commit before that at
c114d71. I can't tell for the latter
commit as that fails the Matlab / obj-c case first. Everything passes
if you go one commit earlier.

I'm using some of my Verilog files to test it, specifically, the files
sitting in https://github.com/seldridge/verilog, and linguist just
isn't having it. Linguist continues to pass for the one test file
(sha-256-functions.v) currently in use. I'm no Ruby guy, so I haven't
attempted to look into this in any significant depth beyond the regex
in blob_helper.rb. This doesn't seem to be the issue as it's picking
up the important matches in my testcases, namely comment structure and
the "module" keyword.

Lazy load repository blobs

https://github.com/github/linguist/blob/master/lib/linguist/repository.rb#L27

Repository requires all the repo blobs be allocated at once. We need to defer this for larger repos.

syntax highlighting for Coq .v files

Pygments now supports Coq .v files. See https://bitbucket.org/birkenfeld/pygments-main/issue/734/support-for-coq

Would it be possible to get this into Github?

Thanks.

Upgrade Pygments to 1.5

http://pygments.org/download/ -- Release 1.5 "Zeitdilatation" is out!

Upgrade to Pygments 1.5

Depends on pygments/pygments.rb#15, unless linguist is still using github/albino in production.
#129 depends on this issue.

Allow specifying an ignore file for language statistics

Some repositories (like SignalR), have samples that include common javascript libraries like jQuery etc. and github ends up classifying the project as javascript instead of C# (in this particular case). Nothing is wrong with this at a high level since jQuery is javascript, but for project maintainers that want more control over statistics need a way to opt out of this behavior.

I see 2 options:

Short term hack: Exclude commonly used js files. This will handle some scenarios but you'll have to exclude multiple versions of the library (unless you had wildcard support).
Longer term solution: Allow a repository to have a .lignore or equivalent (I suck a naming) that uses glob syntax to exclude files to be processed for language statistics.

JS suppression false positives

https://github.com/mishoo/UglifyJS/pull/172/files

Some JS files with just a couple long lines are getting marked as minified.

Invalid gemspec (missing authors)

I'm receiving the following error when I try to install linguist via bundle:

linguist at /usr/lib64/ruby/gems/1.9.1/bundler/gems/linguist-d8903afc12b1 did not have a valid gemspec.
This prevents bundler from installing bins or native extensions, but that may not affect its functionality.
The validation message from Rubygems was:
authors may not be empty

If I clone linguist locally and add an authors line to the .gemspec file, it works fine.

I'm on ruby 1.9.1

do not process files in .linguist-ignore

It would be nice if linguist would be able to read a .linguist-ignore file at the root of the project (or any other name) to be able to not process some files. These files (which can either be auto-generated or imported) are usually not in the same language that the initial project, and may become eventually quite big, so making the statistics completely wrong.

If you thing that feature is useful, I'm happy to propose a patch.

Linguist::Blob does not exist

In the Readme, the example is:

Linguist::Blob.new("linguist.rb")

But that class does not exist.

Erlang escript bundle is treated as JavaScript

Escript bundle is a compressed Erlang script. Linguist detect it incorrectly as a JavaScript:

$ file ./rebar
./rebar: a escript script text executable
$ linguist ./rebar
./rebar: 0 lines (0 sloc)
  type:      Binary
  mime type: text/plain
  language:  JavaScript
$

...so many Erlang projects that are shipped with rebar build tool script may be detected as JavaScript projects alghough they are pure-Erlang!

MaxMSP files still not recognized

Hello,

few weeks ago (remember ? #208) we added MaxMSP samples in the JSON folder ; but now files are detected as JavaScript. MaxMSP code/patcher is a graph of objects, dynamically load at runtime ; it is save as JSON but have nothing related to JavaScript.

IMHO the only solution should be to add extensions to "languages.yml" : ".mxt" is the old format (Max 4) ; Since Max 5 the extensions are ".maxpat" and ".maxhelp".

Modelica everywhere!

Since the new language breakdown bar was introduced, I keep seeing the Modelica language in most of my repositories, even if I didn't even know such a language existed.

Example: http://i.imgur.com/akW7P.png

https://github.com/scribu/wp-pagenavi

Pull Request Failure

Travisbot failed this request: #216

To be honest, fairly new to Github and while it looked like contributing to linguist would prove straightforward, something has clearly gone awry. Any idea what?

Description of test-suite running

The last part of the README file talks about using some bundle thing, which I guess is some ruby utility. Maybe add some more exact description for the uninitiated masses?

pygments updated with improved version of autohotkey lexer

Could you please update your pygments. There is an updated version of the autohotkey lexer in it that is much better.
https://bitbucket.org/birkenfeld/pygments-main/changeset/1c549d7cb1db
Thanks

Support highlighting Twig templates

The syntax of Twig templates is equivalent of the Jinja one (but for PHP projects instead of Python ones) so it could probably be done by reusing the Jinja lexer.
Twig is the default templating engine for Symfony2 (which uses Github) so it would help a lot to have proper highlighting for .twig files.

Matlab extension .m

I've seen you consider Matlab's extension as .matlab, however it is popular to use .m (one of the standard extensions).

I know this conflicts with Objective-C's m files, but it would be interesting to have an option to make syntax checks to guess the extension in dubious cases.

This is confusing to me, as I have both Objective-C and Matlab repositories.

Ruby 1.9.2: file content encoding causes file blobs to fail

The creation of file blobs can fail on creation because the file contents might be encoded. This issue should only be present in Ruby 1.9+ as Ruby 1.8 did not care for encoded files.

A tempory solution is to do this in the file_blob.rb

    # Public: Read file contents.
    #
    # Returns a String.
    def data
      File.read(@path).encoding.to_s
    end

Only thing is the test cases fail now.

Note: If this project was only intended to only work with Ruby 1.8, then disregard this

Crucial invalid detection on Play!

play framework is a Java framework and I believe has a sloc ≥90% of Java. However it shows 76% of it is Python. What could be possibly wrong?

python wsgi

There should be support for .wsgi files, they contain python code so it´s just another python file extension..
links: http://en.wikipedia.org/wiki/Wsgi , http://www.python.org/dev/peps/pep-3333/

Binary detection issues on extensionless files

Check out the md, txt, and zip files in this repo. They all contain the same content, but the zip file is presented as a binary would be. That's not right!

Add .psd1 extension

Add .psd1 (module manifest) into the PowerShell syntax group

Scores sent back by the lib are curious

Hello,

The documentation states that it should returns floats. On my installation, it returns negative numbers:

[[#<Linguist::Language name=PHP>, -66.98989614319586],
 [#<Linguist::Language name=JavaScript>, -68.77510897386178],
 [#<Linguist::Language name=Ruby>, -70.7837674453772],
 [#<Linguist::Language name=Perl>, -71.16156437444059],
 [#<Linguist::Language name=Gosu>, -72.90117504252562],
 [#<Linguist::Language name=Python>, -73.0532406574862],
 [#<Linguist::Language name=Objective-C>, -74.10993364147689],
 [#<Linguist::Language name=TeX>, -77.81775680913668],
 [#<Linguist::Language name=Java>, -78.66295010514327],
 [#<Linguist::Language name=Kotlin>, -79.19112391377584],
 [#<Linguist::Language name=Scala>, -79.596874273976],
 [#<Linguist::Language name=C++>, -80.16597822216151],
 [#<Linguist::Language name=CoffeeScript>, -83.44077180874064],
 [#<Linguist::Language name=Apex>, -83.80881093343098],
 [#<Linguist::Language name=C>, -85.47097078986161],
 [#<Linguist::Language name=AppleScript>, -85.68956917025051],
 [#<Linguist::Language name=SCSS>, -86.60214237229394],
 [#<Linguist::Language name=Groovy>, -86.89541966825266],
 [#<Linguist::Language name=Shell>, -87.43588353355483],
 [#<Linguist::Language name=Dart>, -87.459050333217],
 [#<Linguist::Language name=Coq>, -88.6740351917743],
 [#<Linguist::Language name=Rust>, -93.09294395196528],
 [#<Linguist::Language name=Nemerle>, -93.21419319559817],
 [#<Linguist::Language name=PowerShell>, -93.51902834727619],
 [#<Linguist::Language name=Arduino>, -93.5392310545937],
 [#<Linguist::Language name=Opa>, -93.78609113252523],
 [#<Linguist::Language name=XQuery>, -93.83645881136175],
 [#<Linguist::Language name=R>, -94.21217552783614],
 [#<Linguist::Language name=Delphi>, -94.35016127081002],
 [#<Linguist::Language name=SuperCollider>, -94.40855958019455],
 [#<Linguist::Language name=Verilog>, -94.8229388269385],
 [#<Linguist::Language name=OpenCL>, -96.50244013644215],
 [#<Linguist::Language name=Groovy Server Pages>, -96.56948552051941],
 [#<Linguist::Language name=Racket>, -97.8652823987905],
 [#<Linguist::Language name=OCaml>, -99.6352432871025],
 [#<Linguist::Language name=Matlab>, -101.76930665936734],
 [#<Linguist::Language name=XML>, -101.8170795450655],
 [#<Linguist::Language name=Haml>, -102.25666430330622],
 [#<Linguist::Language name=Scilab>, -102.64814316943966],
 [#<Linguist::Language name=INI>, -102.66212941141441],
 [#<Linguist::Language name=Logtalk>, -103.5329577692118],
 [#<Linguist::Language name=GAS>, -103.96895960118005],
 [#<Linguist::Language name=Sass>, -104.20257445236155],
 [#<Linguist::Language name=Turing>, -104.82161366076778],
 [#<Linguist::Language name=OpenEdge ABL>, -105.1428606897919],
 [#<Linguist::Language name=VimL>, -112.11353183520714],
 [#<Linguist::Language name=Standard ML>, -112.11353183520714],
 [#<Linguist::Language name=Nu>, -112.80667901576709],
 [#<Linguist::Language name=Parrot Assembly>, -112.80667901576709],
 [#<Linguist::Language name=Scheme>, -112.80667901576709],
 [#<Linguist::Language name=Julia>, -112.80667901576709],
 [#<Linguist::Language name=Ioke>, -112.80667901576709],
 [#<Linguist::Language name=Rebol>, -112.80667901576709],
 [#<Linguist::Language name=Parrot Internal Representation>,  -112.80667901576709],
 [#<Linguist::Language name=Emacs Lisp>, -112.80667901576709],
 [#<Linguist::Language name=Tea>, -112.80667901576709],
 [#<Linguist::Language name=Nimrod>, -112.80667901576709],
 [#<Linguist::Language name=VHDL>, -112.80667901576709],
 [#<Linguist::Language name=Diff>, -112.80667901576709],
 [#<Linguist::Language name=Markdown>, -112.80667901576709],
 [#<Linguist::Language name=Visual Basic>, -112.80667901576709],
 [#<Linguist::Language name=Prolog>, -112.80667901576709],
 [#<Linguist::Language name=AutoHotkey>, -112.80667901576709],
 [#<Linguist::Language name=XSLT>, -112.80667901576709],
 [#<Linguist::Language name=YAML>, -112.80667901576709]]

Still the results are in the correct order...

ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0]

The same behavior on x86_64 linux.

Classifier#to_yaml fails with shipped Classifier

Hi,

I'm trying to train the Classifier and hence to serialize it to disk. I run into an issue while trying to serialize the default Classifier:


irb(main):006:0>  Linguist::Classifier.instance.to_yaml($STDOUT)
ArgumentError: comparison of Array with Array failed
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:172:in `sort'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:172:in `block in to_yaml'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:170:in `each'
        from /home/oct/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/github-linguist-2.0.1/lib/linguist/classifier.rb:170:in `to_yaml'
        from (irb):6
        from /home/oct/.rbenv/versions/1.9.3-p194/bin/irb:12:in `<main>'

Nimrods, the whole lotta them.

Shall we add highlighting for it? https://github.assistly.com/agent/case/2839

Binary files detected as Perl

Compile in Linux this simple assembly program using ("as exit.s -o exit.o;ld exit.o -o exit;rm exit.o"):
.section .data
.section .text
.globl _start
_start:
movq $111, %rdi
movq $60, %rax
syscall
And run "bundle exec linguist folder" you will see this:
88% Perl
12% Assembly

startinline
If given and True the lexer starts highlighting with php code (i.e.: no starting <?php required).
The default is False.

Ideally, this sample snippet of PHP code from the Symfony2 project would be highlighted with ```php without having to include <?php:

/**
 * Client simulates a browser and makes requests to a Kernel object.
 *
 * @author Fabien Potencier <[email protected]>
 *
 * @api
 */
class Client extends BaseClient
{
    protected $kernel;

    /**
     * Constructor.
     *
     * @param HttpKernelInterface $kernel    An HttpKernel instance
     * @param array               $server    The server parameters (equivalent of $_SERVER)
     * @param History             $history   A History instance to store the browser history
     * @param CookieJar           $cookieJar A CookieJar instance to store the cookies
     */
    public function __construct(HttpKernelInterface $kernel, array $server = array(), History $history = null, CookieJar $cookieJar = null)
    {
        $this->kernel = $kernel;

        parent::__construct($server, $history, $cookieJar);

        $this->followRedirects = false;
    }
}

github-linguist / linguist Goto Github PK

linguist's People

Contributors

Stargazers

Watchers

Forkers

linguist's Issues

Recommend Projects

Recommend Topics

Recommend Org