Better error handling is more reliable.
Options:
-L, follow location
--retry, number of retries
--retry-all-errors, does not fail on ALL errors, that's why there is -f
-f, fail fast with no output at all on server errors
--retry-delay, make curl sleep this amount of time before each retry
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20788>
The Mesa CI LAVA job submitter was suffering from a bug in the LAVA
software that made their timeouts related to sub-actions unreliable,
such as waiting for the user login prompt automatic response.
The following MR
https://git.lavasoftware.org/lava/lava/-/merge_requests/1900 fixed this
issue. So we can now better control job timeouts granularity, failing
the job faster when there is something weird hanging the boot stage.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20596>
Currently, LAVA jobs only show metadata when successful, let's show this
info in all retries to make it easier to debug or report issues.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20051>
To debug the LAVA jobs locally, we have an option in the
lava_job_submitter script to ignore the JWT token to make it possible to
retry jobs without the need to get an unexpired token.
But this trick needs to modify the overlayed directory so that we would
need to download and extract it earlier in the run.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20051>
LAVA splits the DUT log lines with `\r` in them. Unfortunately, it
breaks the Gitlab section line syntax when the HWCI script calls it
since it is a oneliner.
This commit changes the` fix_lava_gitlab_section_log` function to a
stateful generator that can merge lines that respects two consecutive
patterns.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7703
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20051>
LAVA uses XMLRPC to send jobs information and control, more specifically
it sends device logs via YAML dumps encoded in UTF-8 bytes.
In Python, we have xmlrpc.client.Binary class as the serializer
protocol, we get the logs wrapped by this class, which encodes the data
as UTF-8 bytes data.
We were converting the encoded data to a string via the `str` function,
but this led the loaded YAML data to use single quotes instead of double
quotes for string values that made special characters, such as `\x1b` to
be escaped as `\\x1b`.
With this fix, we can now drop one of the hacks that fixed the bash
colors.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20051>
We don't need to login anymore, but we can't use plain minio commands
now. `ci-fairy` got a helper as `s3cp` to keep an almost identical
API.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19076>
Let's be consistent.
Also needed for parsing device and traces yml file from logs.
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18422>
Visible size reduction and minor performance improvement at no cost.
rootfs measure:
```
2022-07-27 11:15:16.235268: 475MB downloaded in 111.59s (4.26MB/s)
2022-07-27 15:07:40.984857: 425MB downloaded in 85.57s (4.97MB/s)
```
So let say approx. 95s vs 85s if we assume 5MB/s, which can bring us
10s speedup...
Acked-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17776>
Refer to environment variables before falling back to the default
timeouts for each Gitlab section.
This makes more explicit in the job definition that there is a
particular case where the job may obey different timeouts.
Closes: #6908
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17703>
When we don't want to communicate with minio, e.g. running
lava_job_submitter script locally, MINIO_RESULTS_UPLOAD should be unset.
But this variable is already set by generate-env script, so we need to
remove it from the /set-job-env-vars.sh to avoid declaring it in
unexpected scenarios.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17645>
Empirically, a successful LAVA boot time should take less than 3
minutes.
LAVA itself is configured to attempt thrice to boot the device,
summing up to 9 minutes.
It is better to retry the boot than cancel the job and re-submit to
avoid the enqueue delay.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17646>
Sleep a bit before executing `lava-test-case` to give time for bash to dump shell xtrace messages into
the console, which may cause interleaving with `LAVA_SIGNAL_STARTTC` in some
devices like a618.
The same approach worked for `LAVA_SIGNAL_STARTRUN` since
3b8d10d270 (deafdd86b8d9d0108bc692f479c3b31c4c7d5635_161_164)
was merged.
Closes: #6867
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Reviewed-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17533>
If you accidentally re-included your test job core definition after your
driver-specific ruleset, you'd end up running the driver job on every
source code change. This had happened with a630_gles_asan: it included
.baremetal-test-arm64-asan (and thus .baremetal-test) after including
.a630-test, to override .baremetal-test-arm64's depednencies to use asan
artifacts instead.
Acked-by: Michel Dänzer <mdaenzer@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17386>
... and explain what they're doing, compared to the test rules in
test-source-dep.yml.
Unfortunately, we can't really pull them into test-source-dep.yml with
other source deps, because of various '&'-'*' references.
Acked-by: Michel Dänzer <mdaenzer@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17386>
We should be explicit that we are cancelling jobs once the script finds
some log messages that are linked with known issues. That means the
script preemptively retried the job without giving chances to recover.
Adds magenta color to cancelled jobs.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17389>
Mark test_full_yaml_log with this new marker to be easily run by the
developers.
Make `debian-testing` skip this test with `not slow` marker hint.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17389>
Implement a log-based retry hint for R8152 issue described in #6681,
which is based on detecting these two consecutive lines:
```
r8152 <USB> eth0: Tx status -71
nfs: server <IP> not responding, still trying
```
Where <IP> and <USB> could be any IP and USB addresses, respectfully.
This commit is a temporary fix since it requires a section-aware log
follower, implemented in !16323. When the cited MR is merged, one will
make a proper fix on top of that.
Closes: #6681
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17389>
In some jobs, such as
https://gitlab.freedesktop.org/gallo/mesa/-/jobs/24904100, the kmsg is
interleaved with stderr/stdout in serial console, making it difficult to
confidently find the log messages to detect when the DUT is booting,
when the DUT is running etc.
Luckily, LAVA sends redundant messages about their signals. We can use
them to mitigate the chance of missing an interleaved message by being
more open to different messages, using the regex on both `debug` and
`target` LAVA log levels.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
There are some leftovers in the jobs logs after the result log line.
Only print until the init-stage2.sh output, to raise the chance to check
for the test script results at the first glance in the Gitlab logs.
Extra changes:
- Add `hung` status for jobs considered hanging in the Gitlab
- print them after the retry loop
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
This will serve to warn the user that those messages are processed
differently, e.g. the kmsgs does not trigger heartbeats and maybe
eventual targets of hint to retry the job immediately.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
Currently, the submitter consider that every new log that comes from the
DUT console is a signal that the device is healthy, but maybe that is
not the case, since in some kernel hangs/failures, no output is
presented except from some kernel messages.
This commit bypass the heartbeat when the LogFollower detect a kernel
message. Any log line that does follow the kmsg pattern will make the
job labeled as healthy again.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
Now LogFollower is used to deal with the LAVA logs.
Moreover, this commit adds timeouts per Gitlab section, if a section
takes longer than expected, cancel the job and retry again.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
- Create LogFollower to capture LAVA log and process it adding some
- GitlabSection and color treatment to it
- Break logs further, make new gitlab sections between testcases
- Implement LogFollower as ContextManager to deal with incomplete LAVA
jobs.
- Use template method to simplify gitlab log sections management
- Fix sections timestamps
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
Gitlab has support for collapsible sections, so it would be good to
create collapsed log sections for the LAVA setup logs. This way, the
Mesa developers to see only the execution of the scripts, instead of
LAVA messages clutter.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16323>
LAVA job logs have an ongoing problem of message interleaving with kmsg.
So any kernel dumps and LAVA signals (which are being printed in kmsg)
will have a chance to clutter the pattern matching for `hwci: mesa:
(pass|fail)` line.
v2:
- Add an 1 second sleep before exiting the test script, to give enough
time to print the result message without conflicting with LAVA ENDTC
signal from kmsg
Closes: #6714
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17175>
Some LAVA jobs emit lots of messages "Listened to connection for
namespace 'common' for up to 1s" in a row at the end of the logs, making
difficult to see the result of the test script.
This commit removes those lines until a proper solution is deployed on
the LAVA side.
Closes: #6116
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Acked-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17151>
Currently, the LAVA job submitter is employing a temporary solution for
the bash escape code mangling in the LAVA jobs. Until the issue is not
fixed on the LAVA side, the submitter will replace the wrong characters
with the fixed ones.
This commit improves the regex pattern to comprehend the scenarios of
color codes with font formatting and background color information, such
as: `echo -e "\e[1;41;39mRed background with white bold text color\e[0m"`
Fixes: #5503
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Reviewed-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17046>
LAVA is mangling the escape codes from ANSI during log fetching from the
target device, making the gitlab section markers from deqp, for example,
to not work, inputting noise into the log.
This commit makes the simplest fix which is to replace the mangled
characters to the fixed ones.
This approach is error-prone, since it may unwittingly replace a genuine
log that resembles the mangled escape code. But this solution should
suffice until we get a proper fix from LAVA team itself.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16520>
LAVA is mangling the escape codes from ANSI during log fetching from the
target device, making the colored lines from deqp, for example, to not
work, inputting noise into the log.
This commit makes the most straightforward fix which is to replace the
mangled characters to the fixed ones.
This approach is error-prone since it may unwittingly replace a genuine
log that resembles the mangled escape code. But this solution should
suffice until we get a proper fix from LAVA developers itself.
Fixes: #5503
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16520>
Currently, the LAVA job submitter fetches the job results from the LAVA
XMLRPC call, but that is not necessary, as the job result is easily
found in the logs. E.g. the bare-metal and poe jobs uses that log to set
the final job status of their runs.
Another reason for the change is that the LAVA signals are not reliable
in some devices with one serial port, causing some troubles in a618
recently. So, if one signal fails to be sent/received, the job will
ultimately fail even when the hwci script has been successful.
Fixes: #6435
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16425>
Rarely the jobs.logs RPC call can return corrupted data, such as
mal-formed YAML data. As this is expected and very rare to occur, let's
retry this RPC call several times to give it a chance to fix itself.
Retrying would not swallow the log lines since we keep track of how many
log lines each job has.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15938>
Move exceptions to its own file.
Create MesaCITimeoutError and MesaCIRetryError with specific exception
data for better exception classification.
Avoid the use of `fatal_err` in favor of raising a proper exception.
Make _call_proxy exception handling exhaustive, add missing
ResponseError treatment.
Also, detect JobError during job result parsing. So when a LAVA timeout error
happens, it is probably cause by some boot/network issues with a
specific device, we can retry the same job in other device with the same
device_type.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15938>
During development, we may want to test lava_job_submitter.py locally,
sometimes one can find what one is been looking for before the LAV job
is done. Let's respond to SIGINT signal and cancel the LAVA job, as we
can't follow what is happening anymore via script.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15938>
A normal boot on LAVA device would take less than 1 minute. Sometimes
the depthcharge freezes and LAVA waits until the current timeout (25
minutes) to kick in and fail the job.
With this lower timeout value, the job could fail faster and eventually
LAVA will be able to retry it.
Furthermore, set a default timeout for all actions to fix an undesired
behavior with some LAVA job subactions, such as depthcharge-action.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15938>