hmedal / lans Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 6.66 MB

LArge LAbeled Netflow graph Simulator

License: GNU General Public License v3.0

Python 76.74% R 1.39% Scala 21.87%

lans's People

Contributors

Watchers

lans's Issues

Get LANS running on Cray machine

Lastly, they are attempting to build/run the code on a Cray XC40 machine and the version of mpi4py that LANS requires needs to be rebuilt for that machine. I am wondering if your group now has access to the Cray systems, whether you can try to provide a release that works/installs simply on an XC40 machine (only if you have access to one from Cray).

LANS-V5: 3D histogram code failed to handle multiple input files

/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/indexing.py:115: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.setitem_with_indexer(indexer, value)
Traceback (most recent call last):
File "create_3D_edge_attribute_histograms.py", line 90, in
merged_df = pd.read_csv(temp_folder + 'merged_dataframe' + ctu_files[w].split('.', 1)[0] + '.csv')
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 250, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 566, in init
self._make_engine(self.engine)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 705, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/python/lib/python2.7/site-packages/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1072, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 350, in pandas.parser.TextReader.cinit (pandas/parser.c:3187)
File "pandas/parser.pyx", line 594, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5930)
IOError: File /work/sharun/1000_bin/temp/merged_dataframe_6.csv does not exist

Issue from Sponsor (IndexError: list index out of range)

From Mandy Sack:

Issue:
When I run with only 1 data set (5.binetflow) I receive the follow errors:

('Number of Processors: ', 16)

Traceback (most recent call last):
File "Enterprise_Connection_With_Graph_Simulation.py", line 103, in
main()
File "Enterprise_Connection_With_Graph_Simulation.py", line 50, in main
graphList.append(choice(org_graphList[1:len(org_graphList)]))
File "/opt/gd/lang/python-2.7.11/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range

README file mentions compiling using an IDE

We should remove mention of an IDE

indegrees in simulated graph do not fit the original graph or the simulated nodes

in the generate_edge function the global "innodes" and "outnodes" seem to reset between function calls.
these variables are assigned as empty global lists inside the create_graph function (they were previously declared outside of all functions to ensure that they were always run at the import declaration but changing location has had no noticable effect)
though the code removes lines from the roles within innodes in the nodeCreation function to clear it of all nodes with an indegree of less than 1 immediately after initializing it and after an a node's in degree is decremented a check is run to test if the new in degree of that node is less than 1 and remove the node from the list if it is, the code still chooses nodes that should have been removed from the innodes list as destination roles.

pre-edge simulation indegree higher than outdegree

when generating nodes the predefined indegree is higher than the predefined outdegree, we need to check if this is random or bias

LANS-V4: Error in Graph property calculation for scenario 3 and 1

Exception in thread "main" org.apache.hadoop.fs.ParentNotDirectoryException: Parent path is not a directory: file:/work/sharun/LargeScenarios/input_files/1.binetflow
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:523)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:531)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:531)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:694)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:313)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:124)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at Properties$.main(Properties.scala:71)
at Properties.main(Properties.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Read 4 items
Error in if (substr(files[k], 1, 4) == "part") { :
missing value where TRUE/FALSE needed
Execution halted

Fix networkx issue (only in LANS v6)

However, then I ran into another error in Enterprise_Connection_With_Graph_Simulation.py at line:
create_graph(temp_folder,graphList[rank], seed = seedlist[rank], startpoint = startIndex[rank])
I was not able to get past that error quickly, and the version of pandas was rolled back to 0.19.1.

The 2nd issue I ran into was with networkx, and only in LANS version 6. Networkx version 2 was available in the sponsor’s environment. The following changes to Property.py makes it compatible with both version 1 and 2 of networkx.

Line62-Line66
    def getInDegree(self):
        return sorted(dict(self.G.in_degree()).values())

    def getOutDegree(self):
        return sorted(dict(self.G.out_degree()).values())

I found this site helpful for the mirgration: https://networkx.github.io/documentation/stable/release/migration_guide_from_1.x_to_2.0.html

Inaccurate in and out degree in simulation

simulated nodes do not always fit the histograms, need to switch histogram use to exact values rather than probabilities

task 1

Fix pandas issue

This issue needs to be fixed in versions 5 an 6.1.

The first issue was noticed with pandas (same issues on both versions of LANS). The version of pandas that was available on the system was 0.21.1. None of these issues are seen when using pandas version 0.19.1
In the file role_mining.py, an error occurs at line:
feature_data = pd.read_csv(feature_file,delimiter=',',usecols=[0,1,2,3,4,5,6])
features = feature_data[[1,2,3,4,5,6]].as_matrix()

What I did to work around it before rolling back to pandas version 0.19.1 was to specify those columns, which made it past that error.

version 5, error in get_histograms

error with line
str = each[1].split(",",2)
in function get_histograms

caused by incorrect version of create_attribute_histograms.py
or incorrect attribute files being used as inputs

create 3d histograms throwing error file not found

creating the 3d histograms threw an error wherein the merged dataframe would be created and then the code could not find the completed dataframe, this turned out to be an issue where hardcoding was used to find a .csv file where the actual result could be .csv or .binetflow

LANS-V4: Parallel_parameter_estimation_V2 is taking long time for scenario 5,7,11 and 9,13(two groups)

version 5 input files could not be .binetflow

in the event of a .binetflow file graph_gen5.py looked for a file with only the scenario name

ex. if input was 5.binetflow, the code would look for just 5

README revisions

Readme needs a few edits.

Fix issue with issues with running code for CTU datasets

Hi Hugh, Mandy is traveling so I’ll do my best to describe the issues, she can correct me when she gets back.

The main issue was if either the sport or dport fields was empty it made things very unhappy. The way Mandy worked around it was to replace the empty field with a dummy value of 0.

She also removed the spaces in the hex values for those fields and converted them to decimal.

All the CTU data set manipulation was only required to be done once for the datasets so it was easy to forget they were tweaked.

hmedal / lans Goto Github PK

lans's People

Contributors

Watchers

lans's Issues

Recommend Projects

Recommend Topics

Recommend Org