cmsdaq / daqaggregator Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 5.13 MB

Aggregate monitoring data from the CMS DAQ system

Java 100.00%

daqaggregator's People

Contributors

Watchers

Forkers

andreh12 gladky phil2812

daqaggregator's Issues

Sleep time in case of error

Sometimes there is error in main loop e.g.No active session found for toppro and PublicGlobal

If the error lasts for few hours there will be a lot of stack traces in logs. We could have dynamic period of sleeping if error occurs. Start with 2s then increase it each time the error repeats until it reaches some limit e.g. 5 minutes.

filter all flashlists by DAQ session id

all flashlists (not only LEVEL_ZERO_FM_SUBSYS as discussed in #44) should be filtered by the session id taken from the DAQ subsystem.

Filtering on the LAS with a URL parameter is no longer supported, so the DAQAggregator needs to filter in its own code.

Natural ids

natural id's will be stored in the same manner across all objects - '@id' field
to avoid id conflicts we add prefix with type of object to each id (e.g. conflict SCAL as subsystem and ttcpartition name)
FRLs id is combination of 2 elements: FRLPc.hostname + FRL.geoSlot
FMM id is combination of 2 elements: FMMApplication.hostname + FMM.geoslot
SubFEDBuilder id is combination of 2 elements: TTCPartition.name + FEDBuilder.name

class RefPrefix

Looking at this class https://github.com/cmsdaq/DAQAggregator/blob/e68e57a7a80b30259bb0427ba49701ffb4541d03/src/test/java/rcms/utilities/daqaggregator/RefPrefix.java I see that we do not have permission to read the test snapshots or that they do not exist anymore.

I would propose to remove this class.

add FED type to snapshot

distinguish the following types:

SLINK
SLINKExpress6G
SLINKExpress10G
FEROL40_10G
FEROL40_6G

Items to add to the data model

Dead time, dead time beam active, dead time from other sources
FROM: tcds_cpm_deadtimes (updated only once per LS ?)
Trigger rate from physics, randoms, calibration; suppressed trigger rate
FROM: tcds_cpm_rates (updated only once per LS ?)
num triggers / num suppressed triggers
FROM: tcds_cpm_counts
Current prescale column
FROM: probably in a trigger flash list?
Number of resyncs
FROM: tcds_pm_action_count ?
Fill number
FROM: tcds_cpm_deadtimes
TCDS action founts
FROM: tcds_pm_action_counts
Instantaneous luminosity
FROM: ?
CMS Magnet current
FROM: ?

Changes to EvB flashlists

Hi,

as mentioned this morning, EvB version 4.5.0 will bring a couple of changes to the flashlist online. The depreciated parameters have been kept for backwards compatibility, but are no longer filled. Mostly the parameters counting messages since the start of the run have been replaced by corresponding rate parameters. Rates are always in Hertz.

Remi

component	old	new	type	comment
RU/EVM	requestCount	requestRate	uint32	total rate of event request messages from BUs
	fragmentCount	fragmentRate	uint32	total rate of event fragment messages going to the BUs
	payloadPerBU	throughputPerBU	vector of uint64	Bytes/s going to each BU
	n/a	buTids	vector of uint32	List of BU TIDs, order corresponds to other per-BU vectors
	n/a	fragmentRatePerBU	vector of uint32	rate of event fragment messages for each BU
	n/a	retryRatePerBU	vector of double	rate of retries to send I2O message for each BU
	requestCountPerBU	removed
	dataReadyCount	removed
EVM	n/a	allocateRate	uint32	rate of messages going to the RUs
	n/a	allocateRetryRate	double	rate of retries to send messages to the Bus
BU	bandwidth	throughput	uint64	Bytes/s flowing through the BU
	requestCount	requestRate	uint32	rate of event request messages going to the EVM
	n/a	requestRetryRate	double	rate of retries to send messages to the EVM
	fragmentCount	fragmentRate	uint32	rate of event fragment messages received from the RUs
	n/a	slowestRUtid	uint32	the RU TID which delivers the fragments last

Extra FEDs that should not make it into the model

The two (pseudo)FEDs with srcId 103 and 388 make their way into the Aggregator's model, while they are not actually part of the configuration. The reason is that hardware feds are populated over the equipment set without any checks or exceptions. The Java DAQView provides an alternative way to get the hardware feds that should actually make their way into the model (indeed, these two feds are not included in the current daqview, but they are included when visualising the model using the new daqview application).

add ACTION_MSG of LEVEL_ZERO_FM_SUBSYS flashlist to snapshot

this is needed for item 5 in cmsdaq/DAQExpert#232 but probably also has applications elsewhere

add DAQAggregator version to snapshot

We should add the version of the DAQAggregator which created the snapshot to the snapshot itself such that the DAQ expert can display the (historical) DAQ aggregator version in the timeline.

RU fedIdsWithoutFragments should only be considered when rate is 0

Hi,

I noticed in the react DAQView, that the RU displays spurious FED IDs which have not sent any data. This FED IDs are listed in "fedIdsWithoutFragments" in the RU/EVM flashlist. Checking in the old DAQView code, this field is only displayed if the RU is stuck:

if ( fields.containsKey("fedIdsWithoutFragments") && Integer.parseInt( fields.get("eventRate") ) == 0 && Integer.parseInt( fields.get("incompleteSuperFragmentCount") ) > 0 )

I guess this should go into the aggregator, too.

Cheers,
Remi

Writing tmp files

DAQSnapshotService is accessing the file that DAQAggregator has not yet closed (DAQAgg is still writing to it). This results in file being not de-serialized (and stacktrace in logs) in that iteration of SnapshotService (unexpected end o file). However it's discovered again and successfully de-serialized a moment later when the file is ready. This effect is not visible in DAQView as all snapshots are delivered sooner or later. It could be easily fixed by writing to tmp file that DAQSnapshot is filtering and then renaming it. For example:

writing to snapshot.json.gzip~ (with suffix)
renaming to snapshot.json.gzip
reading only files without suffix '~'

DAQAggregator for all setups

There should be the possibility of running parallel Aggregators to produce snapshots for all daq setups upon request, not only cdaq. This would allow (among other benefits) the new DAQView to achieve the information completeness of the old one, for all use cases. Things to take into account:

Persistence of snapshots (allocate new storage area-directory for each supported setup)
Accessing the different configurations (that is: creating different configuration files and start parallel aggregators with them)
Serving of snapshots. We could use a single API (with an extra parameter for setup).This API could be a standalone app and should be notified whenever a new setup is added (e.g. with a properties file that is checked every some seconds and contains entries of the type: setup_name->storage_path).
Eliminate hardcoded bits of the DAQAggregator that are specific to cdaq. I report the following:

At FlashlistDispatcher, we use the harcoded service name "cpm-pri" and the hardcoded type names "tts_ici" and "tts_apve". These are used to extract the correct rows from the TCDS_PM_TTS_CHANNEL flashlist and at the time of hardcoding them, there was apparently no way (or need) to dynamically know these information.
For the TTS P and A state hyperlinks at the DAQView, we use a hardcoded host:port. This is, of course, a (DAQ) View problem but it should be solved at model level, because it is function of the setup, as well.

Application homepage port for frlpcs,rus,bus

DAQView needs to know the hostname (which it does know), as well as the port that the frlpc,ru and bu applications run, so that the hyperlinks become functional and point to the XDAQ homepage of each host.
While in cdaq that port was 11100 for all hosts, in daqval it seems to be 11600.
This information should dynamically be available for the DAQView, therefore I suggest adding it to the model. If you agree, I would take it from the "context" flashlist variable.

Support arbitrary number of LASes and auto-discover flash list to LAS mapping

DAQAggregator needs to be able to support any number of LASes in the configuration.
At the startup of the application, DAQAggregator should query all LASes using the retrieveCatalog service in order to determine on which LAS each flash list is served.

hash codes of objects change after insertion into sets

We noticed that comparing a snapshot with its serialized and again deserialized counterpart gives us the result that the snapshots are not equal. However, the data contained in them appears to be the same.

While debugging, it became clear that comparing sets of data, hash sets in particular, led to this result. For example, the set of TTC partitions of two subsystems were considered to be unequal, even when the subsystems and partitions were the same.

Comparing the data manually made obvious that the hash codes of the affected objects are in fact the same and the objects are also considered equal by their implemented equals()-method. However, when the objects are inserted into the set while the HWCfg-DB is read, their hash codes are different from their final hash codes, as the data that is used by their hashCode()-method changes. Therefore, the sets contain invalid hash-object relations once the objects are populated with data from LAS. This leads to contains() and containsAll(), which are used to check if two hash sets are equal, returning false, despite the hash codes of the objects being equal at the point of comparison, as the backing hash map uses the outdated hash codes to identify its contained objects.

All occurrences of this problem should be fixed, either by removing and re-adding the objects to the sets whenever the data used to generate the hash code changes, or by not using hash-dependent data structures to store the data in the Java model. Feel free to suggest alternatives.

Negative throughputs from BUs

Hi,

there is a problem with mapping the unsigned long values from the BUs for the throughput in the aggregator. The values in the flashlist show e.g. '3286449865' Bytes/s, while the aggregator snapshot contains number like '-1132031326'.

Remi

Snapshot storage on NFS dedicated area

DAQAggregator has been running almost without interruptions since the beginning of this month. When there exists a matching L0 static flashlist row, there is one new snapshot every three seconds, on average. Each snapshot file is slightly less than 300KB.

For the cdaq that means that DAQAggregator consumes almost 8GB of space per day. Given that there is currently 147GB left, there will be space shortage in less than three weeks from now, unless we ask for extension or we delete redundant data (dev or prod-2016).

Missing field in flashlist definition?

While probably not a purely DAQAggregator issue, this was discovered in its context, so it is probably interesting to record here, for reference.

At the usage of flashlist types FMM_PARTITION_DEAD_TIME, FMM_FED_DEAD_TIME it is mentioned that these should be downloaded in a session context. This is implemented by using a sessionId GET parameter when downloading, whose value is matched against a corresponding column in LAS (usually labeled "sessionid" or "SID").
However, the two flashlists mentioned above, do not seem to contain any colum relevant to the session id. This has not been exposed yet, as neither of these flashlists is actuallly downloaded in the current development configuration of the aggregator. But if they are actually downloaded with session context enabled, they will contain zero tuples.

Do these flashlists are actually supposed to be downloaded in a session context? If so, is there a column missing at LAS?

FED percent backpressure calculation problem

FED.percentBackpressure field calculation: delta (Accumulated Backpressure) / delta (Time). Have a look how it is done in DAQView

Including flashlist information for the Ferol40

For the Ferol40, we will need to include new flashlists in addition to the current frl monitoring ones and we will need to choose which ones to use to update FRL and FED objects. Particularly, the choice is based on the FRL type for the FRL and the connected FRL type for the FED. The following actions are needed (feel free to add):

Make sure the new flashlists are available on the decidated LAS URLs that are used by the DAQAggregator (when? who is to be asked?)
Devise a way to distinguish FRL objects according to their type (maybe through looking at their FEDs? add extra flag?)
Update the flashlist type enumeration, download and dispatch the new flashlists to the correct objects
Add/take into account the new flashlists from a persistence point of view

Some helpful info for the Ferol40 flashlists reported by their implementations' author:

a) No frlMonitoring flashlist for the Ferol40. For the backpressure at FEDs, there will be similar columns in the ferol40InputStream flashlist.

b) Renaming of flashlists for Ferol40:
-ferol40Configuration
-ferol40StreamConfiguration
-ferol40InputStream
-ferol40TclStream
-ferol40Status

c) All equivalent monitorables that were in the ferolMonitoring flashlist will be in ferol40InputStream and ferol40TcpStream for the Ferol40.

FED srcIdExpected and srcIdReceived

In the FEDs, while the srcIdExpected is always set to the value returned by the hardware FED's getSrcId() method at the initial object mapping, the srcIdReceived is probably not correctly set in the FED's updateFromFlashlist(...), as revealed by comparing Java DAQView to React DAQView (there are red notifications whenever the two srcIds differ in a given FED).

I fixed this by introducing the following conditional value setting, when updating FEDs from the FEROL_INPUT_STREAM flashlist:

if (flashlistRow.get("WrongFEDIdDetected").asInt() == 0){ this.srcIdReceived = this.srcIdExpected; }else{ this.srcIdReceived = flashlistRow.get("WrongFEDId").asInt(); }

This works for most cases (although in some cases it does not...as Philipp noticed "The still affected FEDs seem to have some value masked"), but I would rather confirm that I correctly understand the semantics of WrongFEDIdDetected and WrongFEDId columns in the FEROL_INPUT_STREAM flashlist.

For info, there is also a flashlist field called "expectedFedId", but I am not currently using it, because the expected srcId is already filled from the hardware DB which, I assume, is correct. I also noticed that the HWDB-filled expected srcId is, on many cases, different from the value of this flashlist field, at a given FED. These cases are about ~220 out of ~640 FEDs, for which the updateFromFlashlist was called during a test. Total number of FEDs produced in the snapshot was 699.

Thank you in advance for your comments on this one.

SLINK and TTS masks when inactive

According to the "Finite State Machine Model for Level 1 Function Managers" internal note, there are three cases when a FED has SLINK or TTS output: active, inactive, masked.
In the old daqview I have seen FEDs which were grayed out when they were inactive, not masked. Therefore in the Aggregator's FED masks, "true" may mean either inactive or masked.
We can keep this, if it does not violate important semantics. Otherwise we will need a new variable to let new daqview know if a FED has inactive SLINK/TTS output, so that rendering also takes this into account.

failing to build with maven

when checking out the current head (b688861) and building it with

   mvn install

I get the following error:

Project ID: com.google.code.gson:gson-parent

Reason: Invalid JDK version in profile 'doclint-java8-disable': Unbounded range: [1.8, for project com.google.code.gson:gson-parent

Searching a bit, it looks like I'm affected by issue 819 in https://github.com/google/gson . Changing the gson version from 2.6.2 to 2.6.1 in pom.xml makes the problem go away for me.

Unless we really need 2.6.2, I'd suggest to move back to 2.6.1 . I can prepare a pull request if needed.

When to reload the structure from hardware database?

To give an answer to the discussion we had yesterday, concerning whether the structure is reloaded at each new session and whether this is correct, I took a look and it seems that the structure is actually reloaded and is used to rebuild DAQ model each time a new session is detected OR a new hwcfg_key is detected. Hence it looks like the structure is rebuilt more often than needed, rather than less often.
For reference, see method detectNewSession() at SessionDetector class, where the relevant flag is set. Of course the naming can be changed to something more intuitive.

FRL Masking at Ferol40

Hi,

In the current use of system by HCAL partitions, I see that a number of FEROL40 FEDs are being used (based on the convention of srcID over 1024).

However, I do not see any FEROL40 flashlists in the LAS for cdaq. Among others, this leads to missing FRL masks at these FEDs, as I had been asked to include FRL masking from the new FEROL40 flashlists, not from the Level 0 static flashlist, as done with legacy FEDs.

At this moment, the masking information already exists in the Level 0 static flashlist for both legacy and new FEDs (again, based on the srcID>1024 convention).

Is there still a reason not to include the masks from there?

Cheers,
Michail

SubFedBuilder mapping

In MappingManager, subfedbuilders (Aggregator data type) are created and mapped as follows:

SubFEDBuilder subFedBuilder = new SubFEDBuilder();
subFedBuilders.put(subFedBuilder.hashCode(), subFedBuilder);

hashCode for an empty subFedBuilder is always the same, so each new subfedbuilder added, overwrites the previous one at the map position pointed by the same key. A consequence is that all fedbuilders are then linked to a single subfedbuilder (the last one added).

Fix proposed:
Replace the hashCode of the empty SubFEDBuilder object with a new id for mapping, that identifies uniquely each newly added subfedbuilder. This id can be HWCfg DB's TTCP hashCode together with that of the FRLPc. Their combination should not be a simple operation (e.g. addition), in order to keep the collision probability as low as possible.

Flashlist persistence

There are 21 flashlists to be persisted. Test was performed with 2 s sleep time between data retrieval.

Disk space consumption

binary json format (smile) - 0.9TB/month - (1 flashlist snapshot: 2.1MB)
json format 3.9TB/month - (1 flashlit snapshot 9.1MB)

Flashlist sizes

Flashlist list (ordered by size):

FMMInputDetail (19.0%)
ferolTcpStream (11.0%)
FMMInput (10.2%)
ferolConfiguration (9.5%)
diskInfo (7.7%)
jobcontrol (7.3%)
levelZeroFM_static (5.7%)
ferolMonitoring (5.4%)
RU (4.0%)
ferolStatus (2.8%)
BU (2.8%)
tcds_pm_tts_channel (1.5%)
FMMStatus (1.1%)
levelZeroFM_subsys (<1%)
FMMFEDDeadTime (<1%)
EVM (<1%)
levelZeroFM_dynamic (<1%)
hostInfo (<1%)
FMMPartitionDeadTime (<1%)

Measured Aug 25.

PartitionDeadtime, FEDDeadtime should be displayed combined (as part of Deadtime)

An example case of a FED causing FEDDeadtime which causes PartitionDeadtime which in turn causes Deadtime is here: http://daq-expert.cms/DAQExpert/?start=2017-10-26T05:31:18+02:00&end=2017-10-26T05:35:18+02:00

Add support for other networks besides .cms

The DAQAggregator assumes hostnames to end with .cms in some places:

https://github.com/cmsdaq/DAQAggregator/search?l=Java&q=%22.cms%22&type=&utf8=%E2%9C%93

This caused problems in the setup for 904 and a solution for supporting at least .cms and .cms904 should be found.

BackpressureConverter retains backpressure when FED is masked

The backpressure converter caches the result of the last calculation [1], leading to FEDs that are masked having constant backpressure in the produced snapshots [2], as their time delta appears to remain zero (due to the flashlist not being updated). What is the reasoning behind retaining and returning the previous result?

[1]

DAQAggregator/src/main/java/rcms/utilities/daqaggregator/data/helper/BackpressureConverter.java

Line 60 in d79714c

return lastResult;

[2] cmsdaq/daqview-react#33

DAQ state in Error when BU is Blocked

Hi,

while debugging the new EvB version in daqval, I noticed that the DAQ state is reported in Error when a BU is in blocked state:
http://daq-expert.cms/daq2view-react/index.html?setup=daqvaldev&time=2017-05-16-11:19:51

The EvB is running fine at 200 Hz and the DAQ FM is in Running state. Once I unblocked the BU, the DAQ state went back to Running:
http://daq-expert.cms/daq2view-react/index.html?setup=daqvaldev&time=2017-05-16-11:28:46

Remi

Back-porting features from Ferol40 to Ferol

Jonni announced 6 changes in flashlists. Summarized impact on DAQAggregator:

Deprocate: BackpressureCounter remove when no longer used by monitoring code… daqView etc.

Currently in DAQAggregator this value is not used at all.

rename AccBackpressureSecond to AccBackPressureSeconds - measure at the input fifo not the BIFI

Currently in DAQAggregator:

value from column AccBackpressureSecond of ferolInputStream flashlist is used to calculate the field percentBackpressure of FED object.
value from column AccBackpressureSeconds of ferol40InputStream flashlist is used to calculate the field percentBackpressure of FED object.

add AccSlinkFullSeconds for core back pressure

Currently in DAQAggregator:

value from column AccSlinkFullSeconds of ferol40InputStream flashlist is used to fill the field frl_AccSlinkFullSec of FED object.

add AccBIFIBackpressureSeconds

Currently in DAQAggregator:

value from column AccBIFIBackpressureSeconds of ferol40InputStream flashlist is used to fill the field frl_AccBIFIBackpressureSeconds of FED object.

add latchedFerol40ClockSeconds (calculate the latched time in seconds of the clock latch)

Currently in DAQAggregator:

value from column LatchedFerol40ClockSeconds of ferol40InputStream flashlist is used to fill the field frl_AccLatchedFerol40ClockSeconds of FED object.

add latchedSlinkSenderClockSeconds

Currently in DAQAggregator this value is not used at all.

storing DAQPartition objects in test data directory

We could persist the DAQPartition objects for the hardware configuration keys needed in FlashlistDispatcherIT instead of reading them from the hardware database. This would greatly speed up the test (but it may be against the spirit of an integration test).

use session id of DAQ subsystem for identifying state of other subsystems

the LEVEL_ZERO_FM_SUBSYS flashlist can have multiple instances for each subsystem. For the DAQ subsystem the instance is correctly identified as the one belonging to the function manager URL of toppro (according to

DAQAggregator/src/main/java/rcms/utilities/daqaggregator/datasource/FlashlistDispatcher.java

Line 140 in 725b5fa

case LEVEL_ZERO_FM_SUBSYS: // TODO: SID column

) and the SessionRetriever applies filters to the Level0 function manager URL to get the session id (

DAQAggregator/src/main/java/rcms/utilities/daqaggregator/datasource/SessionRetriever.java

Line 72 in 2866d69

if (fmUrl.contains(filter1) && fmUrl.contains(filter2)) {

)

However for other subsystems this check needs to be added by requiring that the value of the SID column of a given row in the flashlist matches the sessionId of the top level DAQ object (which seems to be valid and up to date by the time FlashlistDispatcher.dispatch(..) is executed).

support for TCDS_CPM_RATES_1HZ flashlists

TCDSGlobalInfo should also support reading values from the TCDS_CPM_RATES_1HZ in addition to TCDS_CPM_RATES which it already reads.

As @gladky mentioend, adding values for TCDS_CPM_DEADTIMES_1HZ (while keeping TCDS_CPM_DEADTIMES) was straightforward because content taken from TCDS_CPM_DEADTIMES was grouped in the field deadTimes of class TCDSGlobalInfo. Adding the _1HZ flashlist translated into adding another field deadTimesInstant.

However, the information obtained from TCDS_CPM_RATES is spread across several fields in class TCDSGlobalInfo.

Several possible solutions are:

duplicate the existing fields (trg_rate_total etc.) and add a prefix/suffix instant. Pro: schema stays backwards compatible, Con: not very elegant
group the existing fields into a new class and have two fields, one for the 'classic' rates updated every lumi section and one for the instantaneous ones. Pro: preferred solution, Con: a priori breaks the possibility to read the snapshot archives (DAQExpert browser and DAQView).

We should check if we can do schema evolution in Jackson (the framework used for serializing the DAQ snapshot objects) supports.

Variable for "FRL input for FED is enabled"

In the current daqview, at some points, for instance at DAQ state=Initialize, all FEDs are greyed out. According to the daqview table help and particularly the geoslot:srcId explanation, it holds:

"FED source ID: The (expected) FED source ID. The id displayed in black if the FRL input for this FED is enabled. Otherwise it is displayed in grey."

By checking the entirety of greyed out FEDs in the snapshot, it seems that they all had "frlMasked" : false at the time when they were gray at the current daqview.

Please let me know if this indicates a bug in the frlMasked flag. If not, then I get it as "FRL input for this FED is enabled" value cannot be deduced from the frlMasked flag.

Thanks,
Michail

Trigger rate in FEDBuilderSummary calculation problem

Trigger rate is wrong. For example in 280296 we were running with 106 kHz. Expert shows below 100 (bottom chart)

See it here: http://daq-expert.cms:8080/DAQExpert-1.3.3/?start=2016-09-07T15:00:00+02:00&end=2016-09-07T16:00:00+02:00

Masked RUs should not take part in calculation
Also check EVB

support for uTCA FED TTS states

these are only available through the TCDS system. The code should do the following:

find (C/L)PM url from tcdsFM
look up PM in HWCfgDB
find attached PIs
match PI flash list context + service with PI attached to PM
a masked channel has flag 0x98 = 152

Inadequate identification of subFEDbuilders

By looking at the current DAQView at a yesterday's code review with Philipp, it seems that identifying subFEDbuilders uniquely (both in model object mapping and in persistence) with an id composite of TTCP and FRLPc, is probably not able to cover some cases which exist (quite often) in production. Currently, two subFEDbuilders, under two different FED builders that, nevertheless, are defined by the same pair of TTCPartition and FRLPc, will be mapped to the same subFEDbuilder object of the model.

An example (there are multiple cases) from the current DAQView:
TTCP=TEC+:26 - FRLPc=s1d06-25-01 > Used by two different FEDbuilders, TEC+3a and TEC+3b

Given that FEDbuilder names are unique, they should probably also participate in the composite natural ID of the subFEDbuilders, in order to fully differentiate between different subFEDbuilders.

When determining an FRLPc's crashed flag, only take into account jobs that are running on the correct port

When determining the crashed state of an FRLPc, the aggregator looks at all jobs in the jobTable of the FRLPc host's context [1]. However, it should only take into account the jobs with a jid that includes the FRLPc's hostname and port, not all the jobs running in the host's context.

Also, if multiple jobs with the same jid (same host, same port) exist, for example because the job crashed and remains in the table in Z-state [2], only the more recent job (based on startTime) should be looked at to determine the status.

(consider doing the same for FMMApplication, RU, BU)

[1]

DAQAggregator/src/main/java/rcms/utilities/daqaggregator/data/FRLPc.java

Line 81 in 06d2fc6

    
           public void updateFromFlashlist(FlashlistType flashlistType, JsonNode flashlistRow) {

[2]

Only one FRL per SubFEDBuilder

In current snapshots there seems to be only one FRL in each SubFEDBuilder. This bug is most likely caused by these lines:

DAQAggregator/src/main/java/rcms/utilities/daqaggregator/mappers/MappingManager.java

Lines 136 to 140 in 125c24d

    
           /* SubFEDBuilder - FRL */ 
        
           if (!subFedBuilderToFrl.containsKey(fb.hashCode())) { 
        
           	subFedBuilderToFrl.put(sfbId, new HashSet<Integer>()); 
        
           } 
        
           subFedBuilderToFrl.get(sfbId).add(frl.hashCode());

Line 137 in particular, as the FEDBuilder's hash code is never added to the map's key set, therefore a new Set is created on every iteration, discarding the previously added FRLs.

This:

if (!subFedBuilderToFrl.containsKey(fb.hashCode())) {

should probably be:

if (!subFedBuilderToFrl.containsKey(sfbId)) {

One naming convention in snapshot serializer: camel case

I suggest we stick to one naming convention in serializers: camel case. Here is part of snapshot requested in Expert API with block_retri key.

"globalTtsStates": {
"block_retri": {
"state": "R_8",
"percentWarning": 0,
"percentBusy": 0
},
}

Natural ids: unnecessary id field

I'm switching to use mixins from DAQAggregator in DAQExpert API. Current natural ids are stored in '@id' field but 'id' field is also generated. You can see it below.

I understand that @id is generated by id generator you have implemented, and 'id' fields are unnecessary. You are not using that field whatsoever.

Please confirm that.

I will then remove the fields id from all classes, e.g.:
private final String id = "busummary";

{
  "@id": "DAQ",
  "sessionId": 283305,
  "runNumber": 279019,
  "lhcMachineMode": "PROTON PHYSICS",
  "lhcBeamMode": "INJECTION PHYSICS BEAM",
  "daqState": "Running",
  "levelZeroState": "Running",
  "dpsetPath": "/daq2/eq_160715/fb_all/dp_bl329_75BU",
  "lastUpdate": 1471535815270,
  "buSummary": {
    "@id": "BUS",
    "id": "busummary",

(..)
  "id": "daq"
}

Add run start timestamp to snapshots

As Hannes told us, the current DAQView detects state changes to "Running" and calculates the current run duration itself. This has the disadvantage of being wrong when the DAQView is restarted during a run, on the other hand works without having to add an additional column to the flashlist and populating it in the LV0.

We might also want to consider adding a more general timestamp for the last state entry independent of the state, not just for Running.

FEROL40 masking not working correctly

I noticed in DAQview that after BPIX has been removed, some FEDs are still active, even hours later:

It looks like all FED IDs from the last stream within a slot/FED builder has still "frlMasked" : false (fmmMasked is true).

RU 'isEVM' attribute is renamed to 'evm' when persisting

In the JSON snapshot, the RU attribute is called "evm" instead of "isEVM". I suppose this is done by Jackson, as the model's getter is named "isEVM" and of type boolean, just like the Java attribute itself.

From the model point of view, the attribute should be called isEVM in the snapshot as well (unless we change it to just evm in the Java model), yet this would require us to rename the getter to isIsEVM or getIsEVM. Otherwise, adding a Jackson annotation to the RU MixIn might also work (untested):

@JsonProperty("isEVM")
abstract boolean isEVM();

Summary of the model's object issues

Observations below are based on examining the most recent version of snapshot (new persistence, natural ids). Some of them, I suppose, are not bugs but expected behaviour (please confirm below when needed):

At FMMs
-Relations to fmmApp and ttcp are sometimes null
-Fields serviceName (from hwcfgdb) and url (from flashlist) are sometimes null
At FRLs
-Relation to subfedbuilder often null
At RUs
-Message fields from flashlists are sometimes null
-Field status is always null and is in fact never set. Where does this should come from? There is an intuitively related column in the relevant flashlist but this is already mapped to RU.stateName. There is no apparently related field in the hwru.
At SubFEDBuilders
-Field id is always null and is in fact never set. What is the meaning of this field for this type? There is no equivalent hardware DB object, so I suppose it cannot be a DB id
At SubSystems
-Field status (from flashlist) is sometimes null
At TTCPartitions
-Relation to fmm is sometimes null
-Field ttsState (loaded from flashlist FMM_STATUS) is sometimes null

Within each type, I have not observed very obvious correlations, e.g. a field being null whenever a relation is missing.
If you think some of the above should not really happen (probably case 2 at least), please comment on. Also, if you have any suggestions about the never-used fields of cases 3 and 4.

None of the above issues affects the model's persistence and its parsing (e.g. by DAQView).

Handle SubFEDBuilders containing only pseudo-FEDs

As evident by looking at a non-production DAQ View such as: http://cmsdaqweb.cms/local/daqview/daqdev/DAQ.html there might be SubFEDBuilders containing only pseudo-FEDs.

Obviously, these SubFEDBuilders (rows in the DAQView) do not have a FRLPc, meaning that the current SubFEDBuilder JSON ID generation is unable to handle this case properly.

Besides that, the current mapping creates SubFEDBuilders by looping over the FRLs, which does not create SubFEDBuilders containing only pseudo-FEDs, as these FEDs are not connected to FRLs but to other FEDs instead.

I don't see a way of representing a pseudo-FED-only SubFEDBuilder in the current model, as there is no direct link between the SubFEDBuilder and its FEDs. It is not possible to find them using the FRLs as discussed before and the TTCP linked to the SubFEDBuilder might reference additional FEDs that are not related to the SubFEDBuilder.

These additional SubFEDBuilders might be "invented" by the application processing the snapshots (by processing the dependent FEDs, creating SubFEDBuilders for them if applicable and adding some sort of link between the SubFEDBuilders and their pseudo-FEDs). Although, since the current model is supposed to represent the DAQ View's components, there would be (and currently are) missing SubFEDBuilders in the model.

Duplicate flashlist names across different LAS urls

The recently included LAS url with various tcds flashlists also contains the four flashlists: diskInfo, eventing-statistics, hostInfo and jobcontrol. There are already flashlists with these names, but containing different information, that are retrieved (and some, such as the jobcontrol downloaded) from the other two LAS urls we have been using.

Since types of different flashlists are inferred by name only, without regard to the source URL, this leads to conflicts in storing and dispatching available flashlists. This is because the way of handling and setting information from flashlists is a function of their inferred type. Also, all available flashlists are stored in a map with flashlist type as key. Currently, this is handled by ignoring any flashlist from the new LAS which does not include 'tcds_' in name, therefore we never end up with two flashlists of the same inferred type. This is possible because the four problematic flashlists at from the new LAS have, for the time being, no use for the DAQAggregator.

However, in case we would need to include these flashlists in the future (e.g. the two jobcontrol ones from both LAS urls that contain them), it might help to make all flashlist types aware of their LAS url source. In this case we would have two distinct types and it would be easy to distinguish between them at all stages. It would also allow accessing any LAS url in the same way (list*-all-flashlists-and-download-whichever-you-need) rather than introducing url-specific handling.
In case such name conflicts could also in the future happen between different flashlist sets on the same LAS url, the source awareness of a flashlist might even need to be extended to include the lid...

*Listing an available flashlist requires having declared its type in the FlashlistType enum.

Thank you,
Michail

	/* SubFEDBuilder - FRL */
	if (!subFedBuilderToFrl.containsKey(fb.hashCode())) {
	subFedBuilderToFrl.put(sfbId, new HashSet<Integer>());
	}
	subFedBuilderToFrl.get(sfbId).add(frl.hashCode());

cmsdaq / daqaggregator Goto Github PK

daqaggregator's People

Contributors

Watchers

Forkers

daqaggregator's Issues

Disk space consumption

Flashlist sizes

Recommend Projects

Recommend Topics

Recommend Org