Second Shift Troubleshooting
Contents:
- Section 1: Introduction
- Section 2: Antenna Troubles
- Section 3: Correlator Problems
- Section 4: Computer Rebooting
- Section 5: Gunn Troubles
- Section 6: Script issues and suggestions
- Section 7: Tsys troubles
- Section 8: Other Issues
This document was created by Taiwan second shift observers as a trouble shooting guide for second shift operators. Most topics, along with solutions to common observing issues, also state how the issue should be logged (as a time lost event) and have a link to the corresponding procedure page.
- If an antenna has been offline:
"..an antenna has been offline" can mean a number of things. Is the antenna not listed on monitor "a" page because it is in the hanger? Is it in the field, just not included in the project for a particular reason? Is it in the project, but has been set offline by the first shift observers? First, determine what is meant by "offline".
1. If the antenna is not showing up on monitor (all the fields on monitor "a" page are lines): this means the antenna is not in the field. Once the day crew moves the antenna back into the field, they will edit the deadAntennas file appropriately, which will report the antenna on monitor again.
2. If the antenna is not in the project: The first shift observers will inform you of any reasons a particular antenna is not in the project (it has been warmed up, something is not working with a receiver, etc). If an antenna is unavailable during first shift, it will more than likely be unavailable during second shift. First shifters can give you the relevant information.
3. The antenna has been set offline temporarily: this happens when an antenna is in the project, something is causing its scans to be flagged bad, but we want that flagging ignored and the script to continue. An example is if halfway through the track we experience a drive problem with an antenna and have to stow it for the remainder of the track. If it is halfway through a track, we don't want to remove the antenna from the project, restart the correlator, and issue a newFile. But, if the antenna isn't moving (or is stowed) the scans will be flagged bad and the script won't advance. setAntOffline -a # will allow the script to continue, despite any bad scans for one particular antenna. If it is something temporary (an acc reboot), it can be set back online when the condition clears. If it is something more involved (drive issue, antenna warming, etc), it may need to remain "offline" until day crew can check it in the morning. Either way, first shift observers will give you the necessary information and instructions.
No action will be needed by second shift observers if it is scenarios 1 or 2. If it is scenario 3, the antenna may be available later in the track. In any case, make sure you ask the first shifters if you have any questions about the status of an antenna.
- If an antenna has been offline:
- If an antenna stops:
- Case 1: You receive an Operator error
message stating that either the El or Az drive has stopped. Check the
"a" page of monitor, and the drive line will read "off". For this, you
should issue the following commands:
- hal9000> standby -a # (# is the antenna number)
hal9000> resume -a #
You may have to repeat this a couple of times.
LOG THIS: as an "Antenna Stops" time lost event.
- Case 2: Drive line reads SrvS### (### is a counting
number. If it is stale for too long it will change to SrvS***), but all
other lines look fine.
1. hal9000>killdaemon acc# servo restart
This restarts the Servo program. Usually this won't work, but it is worth trying.2.If that doesn't work, you will need to reboot the antenna control computer (acc).
- a) If you are at the summit, hard reboot the acc# by pushing the yellow button.
b) If observing remotely, hal9000> accRemoteReset -a #
Once the antenna is back, you must issue: - hal9000>observe -s source -a #
hal9000> resume -a #
to put the antenna back on the proper source.
LOG THIS: as a "Servo stale" time lost event.
- Case 3: If the antenna reads TrkStle, SrvStl, CalVane and tsys reads stale:
This is a spontaneous acc Reboot. You can wait, and it should come up properly, but to be safe, you can reboot the acc. Either:
1. hard reboot by pushing the yellow button (if at the summit) or
2. issuing accRemoteReset -a #Again, you will need to issue observe, then resume, to put the antenna back to source.
LOG THIS: as a "Spontaneous Antenna reboot" time lost event.
- If an antenna stops:
- Correlator Crates get out of sync
This can show up two ways. One, the error will appear on monitor 'a' page, in the bottom line of the second section (the line that shows the scan count, the scan time, and the number of scans remaining). If this appears, check the 'c' page and confirm that the "I Time" on monitor "c" page is not synchronized for all crates.
Unfortunately, sometimes the correlator crates get out of sync just enough to not record data but the message won't show up on the "a" page. If you see that data isn't coming in on corrPlotter mir_mode (that time isn't increasing) check the correlator "c" page. This will usually occur at the end of an ipoint, when one crate takes a little longer to reset back to the original scan length.
In both of these instances, you should get an Operator Error Message stating that a crate is out of sync, but that message will only appear once.
To fix:
- If the crates are out of sync:
- hal9000> correlatorPause, wait a couple seconds so that all crates pause
- hal9000> correlatorResume
Verify that all crates are in sync after, and that data is once again coming in on corrPlotter mir_mode. If it isn't, you may be required to issue the correlatorPause and correlatorResume sequence again.
This is why it is important to monitor corrPlotter during the track, especially mir_mode, to make sure that time keeps increasing, and that when the antennas change sources that corrPlotter changes too.
LOG THIS: as "correlator crates out of sync" time lost event.
PLEASE NOTE: The integration time does not synchronize perfectly on the "c" page, so as long as the crates are all counting pretty close to each other, there are no errors on the monitor "a" page, and there are fringes, there should be no problem.
- Correlator Crates get out of sync
- If "I time" on monitor "c" page does not count:
If this happens, you should:
- a) have received an operator error message stating that "crate # may have stopped
- b) check monitor "a" page, on the correlator line. It should read "correlator crates are out of sync"
- If "I time" on monitor "c" page does not count:
- 1. restart the crate that isn't counting:
hal9000> restartCorrelator -sxxx -c #
where xxx is the correlator resolution, the number of channels per chunk. # is the crate number.
- 1. restart the crate that isn't counting:
- 2. If that doesn't work, restart the entire correlator:
hal9000>restartCorrelator -sxxx
Remember to use the the same command that was used during priming. It
can be found on the observing plan page for the track you are running.
It can also be found at the beginning of the observing script.- 2. If that doesn't work, restart the entire correlator:
- 3. If the crate still won't start, try rebooting the crate. This can be done by either:
- a) logging into crate# and issuing the /common/bin/Reboot command
(if that doesn't work, log into the crate and issue > reboot )
- b) by creating the file /global/killcrate#
- 3. If the crate still won't start, try rebooting the crate. This can be done by either:
- 4. If the crate still won't start, the correlator trouble shooting document suggests rebooting newdds, waiting until the process dDSServer is running on newdds, and then restarting the correlator software in the normal fashion.
- 5. If it still doesn't start, you will need to contact someone with more experience with the correlator, such as Taco.
LOG THIS: As a "correaltor crates stopped counting" time lost event.
- If corrPlotter shows "red boxes"/bad sampler statistics.
- hal9000> setIFLevels -v (-a #) (# equals antenna number) Make sure to issue this command on a quasar, like the gain calibrator. Also, make SURE you are not on a planet.
- If corrPlotter shows "red boxes"/bad sampler statistics.
- If some crates are missing on corrPlotter
If crates are missing on corrPlotter that could mean one of two things:
- 1. They were intentionally left out of the project. Sometimes there is an issue with a crate, and it will sometimes be left out of a project. This is rare, but does happen. There is no problem if this is the case.
- If some crates are missing on corrPlotter
- 2. The crates weren't specified properly when the
project was started. Check the "p" page to see what correlator crates
are active. Check the project command to see if the crates were
specified properly. If they weren't you will need to add them into the
project and restart the correlator.
- > project -r -C 1..12 to revise the project (-r) by activating the correlator crates (-C) 1 through 12 (1..12) once that is done, make sure all crates show up on the "p" page and the "c" page, and restart the correlator.
- 2. The crates weren't specified properly when the
project was started. Check the "p" page to see what correlator crates
are active. Check the project command to see if the crates were
specified properly. If they weren't you will need to add them into the
project and restart the correlator.
- If some antennas are missing on corrPlotter (or you see weird baselines like 0-0)
This probably means that the number of antennas was changed after the project was started, but the correlator was not restarted, nor was a newFile issued.
Following the "Changing Project Antennas - a check list" document page, if antennas were taken away from or added to the project, you must restart the correlator:
- hal9000> restartCorrelator -sxxx and then start a new data file:
- hal9000> newFile -D "directory" where "directory" is science, pointing, priming, etc
please do newFile --help for more information about specifying directories.
- If some antennas are missing on corrPlotter (or you see weird baselines like 0-0)
- If you don't see fringes
First, determine which antenna is not showing fringes. Once you've determined that:
- If you don't see fringes
- Are the mirror doors open? If not:
- hal9000> openM3 (-a #)
- Is the antenna on the proper source? if not:
- hal9000> observe -s source -a #
- hal9000> resume -a #
- Are there large Az or El offsets? Is Track running on all antennas? Check monitor # page (# is the number of the antenna)
- Check to make sure you have the correct feed offsets. Either "R" or "#" page, or on the "M3 Doors" line of monitor "a" page
- Check that you are in radio mode, on "R" or # page. If not:
- hal9000> radio -a #
- Is the ambient load in the beam? Check CalVane line on monitor "a" page. If so:
- hal9000> tune -a # -c "unheatedload out" or
- hal9000> tsys -a #
- Are the choppers out of position? To determine this, check the "a" page, and see if the choppers are reading anything other than "OK-FC". If so:
- hal9000> initChopper -a #
- Are you looking at an extended source like a planet? Planets will be resolved on the longer baselines, especially during EXT and VEX configurations. Make sure you check both corrPlotter scan mode (phase vs bandwidth) and mir_mode (phase vs time) of corrPlotter to check for fringes. For planets, you can check the "Planetary visibility function plotter" on the SMA tool page.
- Is the antenna being flagged bad (highlighted "Flagged" appearing above the ant/pad line? go to the "F" page to see why the flagging occurred)? This can give you some indication of why you aren't seeing fringes.
- hal9000> tune -a # -c "zt"
- zeros the tuner
- hal9000> tune -a # -c "zb"
- zeros the backshort
- hal9000> tune -a # -c "tune xx.xx" or:
- hal9000> tune -a # -c "interpolate -f xx.xx -g"
- where xx.xx is the gunn frequency, found on monitor "i" page.
If that doesn't lock it up, you'll need to do the normal steps to lock it:- hal9000> tune -a # -c "rfpower 20"
- hal9000> tune -a # -c "t joy"
- then hal9000> tune -a # -c "opt curr -t" if necessary.
You can monitor the receiver positions (gunn tuner, backshort, etc) on the r++ page of monitor.
If everything looks normal but you still aren't seeing fringes, the antenna may have a false lock. That it, it is locked, but not at the proper position. To fix this you may need to retune that antenna. On hal9000:
- If chunk-to-chunk phase offsets are seen
Observe a source where you see good scan-mode fringes to all baselines to the lowest numbered antenna. (i.e. If the project has antennas 1,2,3,4,6,7,8, make sure all 1-* baselines have good fringes. If the project has antennas 2..8, make sure you see fringes on all baselines involving antenna 2).
Once you have good fringes to the lowest numbered antenna, the command is:
- hal9000> setChunkPhaseOffsets -u
This will take a little bit of time, so make sure you wait at least two scans AFTER you have issued the command to change sources, to make sure the offsets were corrected.
This command can be rather dangerous. It should only be used during priming, and should never be used once the science track has started.
Sometimes there will not be a source available, or the weather could be too poor, for you to see good enough fringes to set the chunk-to-chunk offsets before starting the script. If you don't get them set before starting the track THAT IS OK. These offsets can be calibrated out during data reduction.
NEVER ISSUE A setChunkPhaseOffset COMMAND DURING A SCIENCE TRACK!
- If chunk-to-chunk phase offsets are seen
-
Rebooting PowerPCs
- Rebooting acc#
- 1. If at the summit, hard reboot using the yellow buttons
2. If observing remotely: hal9000> accRemoteReset -a #
- Rebooting colossus
- From hal9000, acc1..8, ono
hal9000> colossusRemoteReset
- For hal9000 only:
- log into colossus and run: colossus> /application/bin/halRemoteReset
- For any of the other PowerPC computers:
- To reboot any of the crate computers, newdds, or m5, log into the relevant machine and use:
machine> /common/bin/Reboot
if that doesn't work, log into the machine and issue
machine> reboot (or /application/bin/reboot, if /application/bin is not in your path)
- Please note: if you reboot m5 or hal9000 during a science
observation, you will need to re-prime the antennas. The basic commands
you will need:
- hal9000> standby
hal9000> observe -s science source as specified in priming information
hal9000> dopplerTrack as specified in priming info
hal9000> restartCorrelator as specified in priming info
you will then need to reconfirm fringes.
- If you cannot log into the machine:
- Case 1: If someone is at the summit, ask him/her to do a hard reboot (pushing the button) on the particular machine.
- Case 2: If not, on hal9000, acc#, crate#, newdds, create a file named killmachine_name under /global directory (example, if you need to reboot crate4, killcrate4). Then:
> touch /global/killmachine_name
- If hal9000 won't boot: In this case the best thing to do would be to contact someone with experience dealing with this machines, such as Taco, Charlie Katz, or Mac Cooper.
- The gunn flashes in and out of lock, causing scans to be flagged bad.
Check the "r" page. If the receiver looks like it should be locked (high IF power, low Nz Power), but the PLL ratio is low (<10), it is flashing in and out of lock because the PLL ratio is close to or below the threshold (usually 10). To fix this, you should only need to adjust the rfpower slightly to achieve a better ratio.
- hal9000> tune -a # -c "rfpower xx.x" where xx.x is a number between 0 and 20.
If this is the problem, you should only have to adjust the rfpower by a little bit, 0.3, 0.5, something like that. If the rfpower was 12.4, you could try 12.7, or 12.0, to see if the PLL ratio gets better or worse.
LOG THIS: as "Gunn PLL pops out of lock
- The gunn flashes in and out of lock, causing scans to be flagged bad.
- If gunn is out of lock:
- 1. Check that the YIG is locked. If it is not locked:
- hal9000> lockYIG -a # If that does not lock the YIG, you may need to restart the antIFServer
- hal9000> killdaemon acc# antIFServer restart then:
hal9000> lockYIG -a #
- 2. If YIG is locked but gunn is not:
- hal9000> tune -a # -c "opt tun"
- 3. If "opt tun" doesn't re-lock the gunn:
- hal9000> tune -a # -c "t joy" This moves the Gunn tuner using the Up and Down arrow keys.
- 4. If "t joy" alone doesn't relock it, adjust the rfpower to max (20) first, then t joy:
- hal9000> tune -a # -c "rfpower 20" hal9000> tune -a # -c "t joy"
- 5. If "t joy" doesn't re-lock the gunn:
- hal9000> tune -a # -c "tune xxx" xxx is the gunn frequency, which can be found on the "i" page.
LOG THIS: as "Gunn PLL pops out of lock" time lost event.
- How to pause the script:
> Ctrl - z in the window where the script is running. - How to restart the script after pausing it:
> fg in that same window.
Please note: when you restart a script using fg, the script will
go to the next command. For example: you want to check pointing via
ipoint, so after the script goes to a calibrator, you want to pause it
using Ctrl - z The script will issue the commands: - observe -s calibrator
- tsys
- integrate -s 10 -w
- hal9000>
- observe -s source
- tsys
- integrate -s 30
- How to kill the script:
hal9000> kill -9 xxx - a) On the monitor "p" page. Please note that 0 (zero) is not a valid PID, so if that is being reported, it is incorrect.
- b) In the window running the script. The script should print out a line every so often saying "PID of this script is xxx". You may have to scroll up to see it.
- c)on hal9000, issue
hal9000> ps -ax -u username
username is the username of the person running the script. One of the commands you are running should show up as /perl, which will be the script.
- c)on hal9000, issue
hal9000> ps -ax -u username
- How to skip the current source in the script:
hal9000> integrate -s 0this puts the "number of scans remaining" counter (located on monitor "a" page) to 0, and the script will go to the next command after the current scan finishes.
Once it gives the "integrate" command, Ctrl -z the script, which brings back the prompt:
So remember, if you pause the script, especially for pointing: PAUSE IT AFTER A CALIBRATOR. Then, when you are ready to restart it, MANUALLY OBSERVE THE SAME CALIBRATOR BEFORE RESTARTING THE SCRIPT WITH FG.
This is so you have calibrator data on either side of the pause, whether it be for pointing or what not.
xxx is the process ID (PID) number. It can be found a number of places:
- If Tsys shows wacko or unreasonably high value on monitor:
Sometimes tsys measurements just fail. Therefore issue tsys to confirm:hal9000> tsys -a #
- If Tsys measurement comes back saying "Tsys is corrupted" and tsys has an * :
Usually this means that the continuum detector (line IFCnt1D1 on monitor "r" page) is getting saturated (9999.99) during a tsys, causing a bad value. This can be fixed by lowering the current (SIS-0 I) for the antenna.hal9000> tune -a # -c "opt curr -c" xx
where xx is a current value slightly lower than where it is presently set. For example, if the current was reading 36, first try:
hal9000> tune -a # -c "opt curr -c 34"
hal9000> tsys -a #If tsys is still corrupted, drop the current a little bit more.
-
If the Dewar slowly warms up to > 5K:
Issue the command:
hal9000> tune -a # -c "bfield1 0"This sets the bfield for the High Freq Rx to 0. Do this even if you are not using the High Freq Rx.
hal9000> tune -a # -c "bfield0 0"
This sets the bfield for the Low Freq Rx to 0. This may cause the tsys to become a bit higher, because the Low Freq Rx will not be properly optimized, but that is ok for now.
After setting BOTH the bfields to 0, the antenna should slowly start to cool back down. You can monitor it on the monitor "y" page.
If the dewar still doesn't start cooling, you will need to deactivate the receivers. Issue:
hal9000> tune -a # -c "activate -l 1 -h 5"
This will result in the loss of fringes for that antenna. You should set the antenna offline so the script can continue.
hal9000> setAntOffline -a #
When the dewar has cooled back to normal operating temperature, you can reactivate the receivers and set the antenna back online.
hal9000> tune -a # -c "activate -l # -h 5" with either -l 0 for 200Rx or -l 2 for the 300Rx.
hal9000> setAntOnline -a #