Coder Social home page Coder Social logo

zq99 / pgn2data Goto Github PK

View Code? Open in Web Editor NEW
12.0 1.0 4.0 11.24 MB

A library that converts a chess pgn file into a tabulated CSV data set.

License: GNU General Public License v3.0

Python 100.00%
chess pgn fen csv dataset data library chess-analysis

pgn2data's Introduction

pgn2data

License: GPL v3 GitHub stars GitHub forks

This library converts chess pgn files into CSV tabulated data sets.

A pgn file can contain one or multiple chess games. The library parses the pgn file and creates two csv files:

  • Games file: contains high level information (e.g. date, site, event, score, players etc...)

  • Moves file: contains the moves for each game (e.g. notation, squares, fen position, is in check etc...)

The two files can be mapped together using a GUID which the process inserts into both files.

Installation

The library requires Python 3.7 or later.

To install, type the following command on the python terminal:

pip install pgn2data

Implementation

Here is a basic example of how to convert a PGN file:

from converter.pgn_data import PGNData

pgn_data = PGNData("tal_bronstein_1982.pgn")
pgn_data.export()

The following is an example of grouping multiple files into the same output file ("output.csv").

pgn_data = PGNData(["file1.pgn","file2.pgn"],"output")
pgn_data.export()

The export function has a return object which allows you to quickly check the size and location of the files created:

pgn_data = PGNData("tal_bronstein_1982.pgn")
result = pgn_data.export()
result.print_summary()

If you want to check if the files have been created before doing further processing you can do the following:

pgn_data = PGNData("tal_bronstein_1982.pgn")
result = pgn_data.export()
if result.is_complete:
    print("Files created!")
else:
    print("Files not created!")

Pandas

The result object also provides methods to import the created files into pandas dataframes:

pgn_data = PGNData("tal_bronstein_1982.pgn")
result = pgn_data.export()
if result.is_complete:
    
    # read the games file
    games_df = result.get_games_df()
    print(games_df.head())
    
    # read the moves file
    moves_df = result.get_moves_df()
    print(moves_df.head())
    
    # read both files joined together
    combined_df = result.get_combined_df()
    print(combined_df.head())

Optimization

To output the game information only, you can do the following:

from converter.pgn_data import PGNData

pgn_data = PGNData("tal_bronstein_1982.pgn")
pgn_data.export(moves_required=False)

Examples

The folder 'samples' in this repository, has some examples of the output from the library.

You can also go here to see a Kaggle project that converted all of Magnus Carlsen's online Bullet games into CSV format.

Columns

This is a full list of the columns in each output file:

Games File

Field Description
game_id ID of game generated by process
game_order Order of game in PGN file
event Event
site Site
date_played Date played
round Round
white White player
black Black player
result Result
white_elo White player rating
white_rating_diff White rating difference from Black
black_elo Black player rating
black_rating_diff Black rating difference from White
white_title Player title
black_title Player title
winner Player name
winner_elo Player rating
loser Losing player
loser_elo Player rating
winner_loser_elo_diff Diff in rating
eco Opening
termination How game ended
time_control Time control
utc_date Date played
utc_time Time played
variant Game type
ply_count Ply Count
date_created Extract date
file_name PGN source file

Moves File

Field Description
game_id ID of game that maps to games file
move_no Order of moves
move_no_pair Chess move number
player Player name
notation Standard notation of move
move Before and after piece location
from_square Piece location before
to_square Piece location after
piece Initial of piece name
color Piece color
fen Fen position
is_check Is check on board
is_check_mate Is checkmate on board
is_fifty_moves Is 50 move complete
is_fivefold_repetition Is 5 fold repetition on board
is_game_over Is game over
is_insufficient_material Is game over from lack of mating material
white_count Count of white pieces
black_count Count of black pieces
white_{piece}_count Count of white specified piece
black_{piece}_count Count of black specified piece
captured_score_for_white Total of black pieces captured
captured_score_for_black Total of white pieces captured
fen_row{number}_{colour)_count Number of pieces for the specified colour on this row of the board
fen_row{number}_{colour}_value Total value of pieces for the specified colour on this row of the board
move_sequence Sequence of moves up to current position

Contributions

Contributions are welcome, all modifications should come with appropriate tests demonstrating an issue has been resolved, or new functionality is working as intended. Pull Requests without tests will not be merged.

The library can be tested by doing the following:

from testing.tests import run_all_tests
run_all_tests()

New tests should be added to the above method.

Acknowledgements

This project makes use of the python-chess library.

pgn2data's People

Contributors

zq99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pgn2data's Issues

Clock data stripped

Using this converter on lichess PGN files strips all of the clock data from the PGN lines

Illegal moves exit execution

When a game within a pgn has an illegal move or a character the codec can't encode, it breaks the whole execution. It seems python-chess raised the error. I handled this issue while working with chess.pgn using:

try:
game = chess.pgn.read_game(pgn_file)
except:
continue

Btw, so much of a great work, you saved me a ton of work:)

Only export game_info

I am reading in quite large files and would only require the game_info (not the moves). It would be great if there was an option in the export method to only read the game_info and not the moves.

This would save a lot of disk-space for larger files and possible also speed up the routine.

work with this pgn structure?

I thought I'd give this a try with a pgn file I have, but I keep getting this error. Any help / pointers appreciated.

Traceback (most recent call last):
  File "chess_export.py", line 6, in <module>
    result = pgn_data.export()
  File "/Users/james/.pyenv/versions/3.7.8/lib/python3.7/site-packages/converter/pgn_data.py", line 59, in export
    result = self.__process_pgn_list(pgn_list, file)
  File "/Users/james/.pyenv/versions/3.7.8/lib/python3.7/site-packages/converter/pgn_data.py", line 91, in __process_pgn_list
    process.parse_file(add_headers)
  File "/Users/james/.pyenv/versions/3.7.8/lib/python3.7/site-packages/converter/process.py", line 96, in parse_file
    game_writer.writerow(self.__get_game_row_data(game, game_id, order, self.pgn_file))
  File "/Users/james/.pyenv/versions/3.7.8/lib/python3.7/site-packages/converter/process.py", line 164, in __get_game_row_data
    game.headers["BlackElo"] if winner == game.headers["Black"] else "")
  File "/Users/james/.pyenv/versions/3.7.8/lib/python3.7/site-packages/chess/pgn.py", line 947, in __getitem__
    return self._others[key]
KeyError: 'BlackElo'

The pgn input file is this

[Date "2022.6.16"]
[Result "0-1"]
[White "Dr. Wolf - Intermediate"]
[Black "James"]

1. c4 1. d5 2. d4 2. e6 3. Nf3 3. c5 4. e3 4. Nc6 5. Nc3 5. Bd6 6. dxc5 6. Bxc5 7. g3 7. dxc4 8. Qa4 8. f6 9. Qb5 9. b6 10. Qxc4 10. Nge7 11. Ne4 11. Qd5 12. Qxd5 12. Nxd5 13. a3 13. f5 14. Bb5 14. Bd7 15. Bxc6 15. Bxc6 16. Ne5 16. Nxe3 17. Nxc6 17. Nc2+ 18. Ke2 18. Nxa1 19. Ng5 19. Nb3 20. Rd1 20. O-O 21. Be3 21. Rfe8 22. Bxc5 22. bxc5 23. Rd7 23. Nc1+ 24. Kd2 24. Nb3+ 25. Kc2 25. Nd4+ 26. Nxd4 26. cxd4 27. Kd3 27. e5 28. b3 28. Rab8 29. Kc4 29. Rec8+ 30. Kd3 30. Rxb3+ 31. Kd2 31. a5 32. Ne6 32. h6 33. Rxg7+ 33. Kh8 34. Rg6 34. Kh7 35. Rg7+ 35. Kh8 36. Rc7 36. Rc3 37. Rxc8+ 37. Rxc8 38. f4 38. e4 39. g4 39. Re8 40. Nc7 40. Re7 41. Nd5 41. e3+ 42. Kd3 42. Re4 43. Nxe3 43. dxe3 44. Ke2 44. fxg4 45. a4 45. Rxa4 46. Kxe3 46. h5 47. f5 47. Ra3+ 48. Kd2 48. Rf3 49. Ke2 49. Rxf5 50. Kd2 50. Rf2+ 51. Kd3 51. Rxh2 52. Ke3 52. a4 53. Kd4 53. g3 54. Kc4 54. a3 55. Kb3 55. a2 56. Kc4 56. a1=Q 57. Kd5 57. Qa2+ 58. Ke4 58. Qc2+ 59. Kf3 59. Qd3+ 60. Kf4 60. Rf2+ 61. Ke5 61. Rf5+ 62. Ke6 62. Qd5+ 63. Ke7 63. Rf7+ 64. Ke8 64. Qd7 0-1

Memory leak

I am currently interested in using your librairy to parse png file from chess game.
I am using the lichess database to get my data.
Unfortunatly, when I try to parse it using png2data my RAM constently increase!
I think that the process of reading/writing are maybe not optimized, I think that maybe the data from the read file is still saved.
So for my case, I can't read the total content of a 5GB pgn file.

Unicode Error

Has anyone seen this error before? The file has hundreds of thousands of games but I'm getting a Unicode error when running pgn2data. What I've tired so far is the following. Before I manually look at the pgn file with Scid, any other ideas what could be causing this?

iconv-Linux tool to change the encoding but it fails.

pgn-extract-A pgn command line tool to clean pgn files. Still getting the Unicode error.

I thought about creating a python script in attempt to change the encoding but the solutions I researched were working with the read.csv tool in pandas so I thought that'd be incorrect

OS is Debian 11 Bullseye

Example of error. Had multiple position #'s.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 5478: invalid start byte

Example of iconv error. Sequence position has varied.

iconv: illegal input sequence at position 2635110

Example of the code I was using.


from converter.pgn_data import PGNData

pgn_data = PGNData("multiplegames.pgn")
pgn_data.export()

Incorrect FENs

Hi zq99, really happy to have found your work. I am having some issues with the following: Imagine this link https://www.chess.com/game/live/76681276597?username=hikaru

which has the following PGN:

[Event "Live Chess"]
[Site "Chess.com"]
[Date "2023.05.01"]
[Round "?"]
[White "Hikaru"]
[Black "Blackmind96"]
[Result "1-0"]
[ECO "B20"]
[WhiteElo "3158"]
[BlackElo "2878"]
[TimeControl "180"]
[EndTime "5:02:13 PDT"]
[Termination "Hikaru won by resignation"]

1. e4 e6 2. d3 d5 3. Nd2 Nf6 4. Ngf3 c5 5. g3 Nc6 6. Bg2 g6 7. O-O Bg7 8. Re1
O-O 9. Qe2 a5 10. Nf1 b6 11. e5 Nd7 12. Bf4 a4 13. h4 a3 14. b3 b5 15. c3 Ba6
16. Rad1 d4 17. c4 Rb8 18. h5 bxc4 19. dxc4 Nb6 20. h6 Bh8 21. N1h2 Nd5 22. Bg5
Nce7 23. Bd2 Nb4 24. Bxb4 Rxb4 25. Ng4 Nf5 26. Nd2 d3 27. Qf3 Bb7 28. Ne4 Bxe4
29. Rxe4 Qg5 30. Qf4 Qxf4 31. Rxf4 Rd8 32. Be4 d2 33. Rf3 Rbb8 34. Rd3 Rxd3 35.
Bxd3 Nd4 36. Kf1 Nc6 37. f4 Nb4 38. Rxd2 Rd8 39. Ke2 Kf8 40. Bb1 Rxd2+ 41. Kxd2
Ke7 42. Nf2 f5 43. exf6+ Bxf6 44. Ne4 Bd4 45. Ng5 Bf2 46. g4 Bd4 47. Nxh7 Bf6
48. Nxf6 Kxf6 49. Ke3 Kf7 50. Ke4 Nc6 51. f5 exf5+ 52. gxf5 g5 53. h7 Kg7 54.
f6+ 1-0

after I get the two cvs with the details and moves I get the column "fen" and is like this:

1. rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR
2. rnbqkbnr/pppp1ppp/4p3/8/4P3/8/PPPP1PPP/RNBQKBNR
3. ...
4. ...
...
Last. 8/6kP/2n2P2/2p3p1/2P1K3/pP6/P7/1B6

Now, this last one is impossible to analyze on stockfish python. It also fails if I put the fen into lichess.org directly.

The correct FEN that you can obtain from the original link says it should be: 8/6kP/2n2P2/2p3p1/2P1K3/pP6/P7/1B6 b - - 0 54

So, how to correct this? I can not do this manually, I am planning on analyzing 30k+ games and of course 100k+ FENs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.