forked from k8snetworkplumbingwg/sriov-network-operator
-
Notifications
You must be signed in to change notification settings - Fork 0
Rebase with upstream master #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
clarkzinzow
wants to merge
239
commits into
master
Choose a base branch
from
clark/rebase-upstream
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PrometheusRules allow recording pre-defined queries. Use `sriov_kubepoddevice` metric to add `pod|namespace` pair to the sriov metrics. Feature is enabled via the `METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULE` environment variable. Signed-off-by: Andrea Panattoni <[email protected]>
When the `metricsExporter` feature is turned off, deployed resources should be removed. These changes fix the error: ``` │ 2024-08-28T14:07:57.699760017Z ERROR controller/controller.go:266 Reconciler error {"controller": "sriovoperatorconfig", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovOperatorConfig", "SriovOperatorConfig": {"name":"default","namespace":"openshift-sriov-network-operator"}, │ │ "namespace": "openshift-sriov-network-operator", "name": "default", "reconcileID": "fa841c50-dbb8-4c4c-9ddd-b98624fd2a24", "error": "failed to delete object &{map[apiVersion:monitoring.coreos.com/v1 kind:ServiceMonitor metadata:map[name:sriov-network-metrics-exporter namespace:openshift-sriov-network-operator] │ │ spec:map[endpoints:[map[bearerTokenFile:/var/run/secrets/kubernetes.io/serviceaccount/token honorLabels:true interval:30s port:sriov-network-metrics scheme:https tlsConfig:map[caFile:/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt insecureSkipVerify:false serverName:sriov-network-metrics-expor │ │ ter-service.openshift-sriov-network-operator.svc]]] namespaceSelector:map[matchNames:[openshift-sriov-network-operator]] selector:map[matchLabels:map[name:sriov-network-metrics-exporter-service]]]]} with err: could not delete object (monitoring.coreos.com/v1, Kind=ServiceMonitor) openshift-sriov-network-operato │ │ r/sriov-network-metrics-exporter: servicemonitors.monitoring.coreos.com \"sriov-network-metrics-exporter\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot delete resource \"servicemonitors\" in API group \"monitoring.coreos.com\" in the namespace \"ope │ │ nshift-sriov-network-operator\""} ``` Signed-off-by: Andrea Panattoni <[email protected]>
…er-rules [metrics 4/x] Metrics exporter rules
…devices Refactor some conformance tests to use `SRIOV_NODE_AND_DEVICE_NAME_FILTER`
if the current obj as annotation and the updated doesn't we still want to add the ones from the current object Signed-off-by: Sebastian Sch <[email protected]>
When a user deletes the default SriovOperatorConfig resource and tries to recreate it afterwards, the operator webhooks returns the error: ``` Error from server (InternalError): error when creating "/tmp/opconfig.yml": Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/validating-custom-resource?timeout=10s": service "operator-webhook-service" not found ``` as the webhook configuration is still present, while the Service and the DaemonSet has been deleted. Delete all the webhook configurations when the user deletes the default SriovOperatorConfig Signed-off-by: Andrea Panattoni <[email protected]>
Fix merge annotation function
Signed-off-by: Andrea Panattoni <[email protected]>
The bash syntax was incorrect and yielded: hack/env.sh: line 35: ${$RDMA_CNI_IMAGE:-}: bad substitution
…IMAGE Fix syntax for RDMA_CNI_IMAGE var substitution
It might happen that two SR-IOV pods, deployed on different node, are using devices with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error: ``` Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation: [ { __name__="sriov_kubepoddevice", container="test", dev_type="openshift.io/intelnetdevice", endpoint="sriov-network-metrics", instance="10.1.98.60:9110", job="sriov-network-metrics-exporter-service", namespace="cnf-4916", pciAddr="0000:3b:02.4", pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com", prometheus="openshift-monitoring/k8s", service="sriov-network-metrics-exporter-service" }, { __name__="sriov_kubepoddevice", container="test", dev_type="openshift.io/intelnetdevice", endpoint="sriov-network-metrics", instance="10.1.98.230:9110", job="sriov-network-metrics-exporter-service", namespace="cnf-4916", pciAddr="0000:3b:02.4", pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com", prometheus="openshift-monitoring/k8s", service="sriov-network-metrics-exporter-service" } ];many-to-many matching not allowed: matching labels must be unique on one side ``` Configure the ServiceMonitor resource to add a `node` label to all metrics. The right query to get metrics, as updated in the PrometheusRule, will be: ``` sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice ``` Also remove `pod`, `namespace` and `container` label from the `sriov_vf_*` metrics, as they were wrongly set to `sriov-network-metrics-exporter-zj2n9`, `openshift-sriov-network-operator`, `kube-rbac-proxy` [1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38 Signed-off-by: Andrea Panattoni <[email protected]>
When we want to use config-drive in immutable systems, very often the config-drive is only used at boot and then umounted (e.g. ignition does this). Later when we want to fetch Metadata from the config drive, we actually have to mount it. In this PR, I'm adding similar code than coreos/ignition where we dynamically mount the config-drive is the device was found with the right label (config-2 or CONFIG-2 as documented in OpenStack). If the device is found, we mount it, fetch the data and umount it.
Fixes the following shellcheck error: SC2068 (error): Double quote array expansions to avoid re-splitting elements. https://www.shellcheck.net/wiki/SC2068
Fixes the following shellcheck error: SC2148 (error): Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive. https://www.shellcheck.net/wiki/SC2148
Fixes the following shellcheck errors: SC2145 (error): Argument mixes string and array. Use * or separate argument. SC2199 (error): Arrays implicitly concatenate in [[ ]]. Use a loop (or explicit * instead of @). https://www.shellcheck.net/wiki/SC2145 https://www.shellcheck.net/wiki/SC2199 Also fixes a typo in SUPPORTED_INTERFACE_SWITCHER_MODES.
Fixes the following shellcheck error: SC2045 (error): Iterating over ls output is fragile. Use globs. https://www.shellcheck.net/wiki/SC2045
On some kernels GetDevlinkDeviceParam may return empty values for some kernel parameters. The netlink library is able to handle this, but the code in GetDevlinkDeviceParam function may panic if unexpected value received. Add extra checks to avoid panics
Delete webhooks when SriovOperatorConfig is deleted
…getdevlinkdeviceparam Fix: GetDevlinkDeviceParam to handle edge-cases correctly
…er-drop-labels [metrics 5/x] Add node label to sriov_* metrics
`sriov_kubepoddevice` metric might end up in the Prometheus database after a while, as the default scrape interval is 30s. This leads to failures in the end-to-end lane like: ``` [sriov] Metrics Exporter When Prometheus operator is available [It] Metrics should have the correct labels /root/opr-ocp2-1/data/sriov-network-operator/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:132 [FAILED] no value for metric sriov_kubepoddevice ``` Put the metric assertion in an `Eventually` statement Signed-off-by: Andrea Panattoni <[email protected]>
Signed-off-by: Sebastian Sch <[email protected]>
Fixes the following shellcheck error: SC2081 (error): [ .. ] can't match globs. Use a case statement. https://www.shellcheck.net/wiki/SC2081
Warns about shellcheck issues with severity `error`.
…-fix metrics: Fix `Metrics should have the correct labels` test
CI: Add a bash linter to pre-submits
openstack: dynamically mount the config-drive
When the operator changes the device-plugin Spec (e.g. .Spec.NodeSelector), it may happen that there are two device plugin pods for a given node, one that is terminating, the other that is initializing. If the config-daemon executes `restartDevicePluginPod()` at the same time, it may kill the terminating pod, while the initializing one will run with the old dp configuration. This may cause one or more resources to not being advertised, until a manual device plugin restart occurs. Make the config-daemon restart all the device-plugin instances it founds for its own node. Signed-off-by: Andrea Panattoni <[email protected]>
…r_election_test remove leader election functional test
Add Intel E830 Network Adapters
`NetworkPolicy.networking.k8s.io` objects [1] allow defining which network traffic can reach a pod or can go leave a pod, in a kubernetes cluster. In order to reduce the surface attack all the operands that are not hostNetworked, should be covered by NetworkPolicies, explicitly setting the connections they require to work. [1] https://kubernetes.io/docs/concepts/services-networking/network-policies/ Signed-off-by: Andrea Panattoni <[email protected]>
e2e: Filter devices in `FindOneMellanoxSriovDevice`
Folder `/config` contains the information to build an operator-sdk bundle, which logic has been removed by k8snetworkplumbingwg#502. They progressively diverged from the openshift/sriov-network-operator fork, and the files in these repository not being used became obsolete, with less information or wrong data at all. As a first step to clean, remove all the unused files in the folder. `config/crd/bases` is still a good source of truth, as it is used by the `manifests` Makefile rule to update the Helm charts. In the future, if we want to support OLM deployment in this repository, we can renovate the configuration starting from the `openshift` fork, maybe adjusting the CI to validate its content at every PR. Refs - k8snetworkplumbingwg#502 Signed-off-by: Andrea Panattoni <[email protected]>
we are using --skip-driver option with mlxfwreset to speed up the process. However, if there are still VFs created for the PF, the command will fail. Let's reset the VFs to mitigate the error. Signed-off-by: Alexander Maslennikov <[email protected]>
…etwork-policies security: Add `NetworkPolicies` to operands
Signed-off-by: Fred Rolland <[email protected]>
fixing: ``` [FAILED] Unexpected error: <*errors.StatusError | 0xc000788780>: Operation cannot be fulfilled on nodes "worker-0": the object has been modified; please apply your changes to the latest version and try again { ErrStatus: { TypeMeta: {Kind: "", APIVersion: ""}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "Operation cannot be fulfilled on nodes \"worker-0\": the object has been modified; please apply your changes to the latest version and try again", Reason: "Conflict", Details: {Name: "worker-0", Group: "", Kind: "nodes", UID: "", Causes: nil, RetryAfterSeconds: 0}, Code: 409, }, } ``` Signed-off-by: Sebastian Sch <[email protected]>
…eset-fix fix: set SR-IOV NUM_VFS to 0 for mellanox NICs before fw reset
…resource-condition e2e: enable `resourceInjectorMatchCondition` featureGate
Bump golang.org/x/oauth2 from 0.25.0 to 0.27.0
Signed-off-by: Sebastian Sch <[email protected]>
…g-folder Remove unused files from `/config`
Add a retry to the func-test
This allow us to select if we want to use a stable OCP release or CI for example Signed-off-by: Sebastian Sch <[email protected]>
The following PR remove the kustomize k8snetworkplumbingwg#913 Signed-off-by: Sebastian Sch <[email protected]>
Allow to configure OCP version type
add retry to update node label
SriovNetwork,IbSriovNetwork,OVSNetwork object (XNetwork resources from now on) must be created in the operator's namespace and the `.Spec.NetworkNamespace` field defines where the controller should create the NetworkAttachmentDefinition resource. This constraint can be a problem in clusters where applications are managed by non cluster administrators. These changes makes the `genericNetworkReconciler` to accept XNetwork resources in namespaces different than the operator's one. In such cases, the field `.Spec.NetworkNamespace` must be empty, and a validating webhook ensure this constraint. Signed-off-by: Andrea Panattoni <[email protected]>
Signed-off-by: Andrea Panattoni <[email protected]>
to avoid panic: ``` panic: /tmp/go-build1586768671/b001/exe/webhook flag redefined: alsologtostderr goroutine 1 [running]: flag.(*FlagSet).Var(0xc0000781c0, {0x3daf2d0, 0x5566969}, {0x394a710, 0xf}, {0x39e87ee, 0x49}) /usr/lib/golang/src/flag/flag.go:1029 +0x3b9 k8s.io/klog/v2.InitFlags.func1(0xc000234990?) /home/apanatto/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:447 +0x31 flag.(*FlagSet).VisitAll(0xc0003f4c40?, 0xc000785de0) /usr/lib/golang/src/flag/flag.go:458 +0x42 k8s.io/klog/v2.InitFlags(0x3dc1aa0?) /home/apanatto/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:446 +0x3c main.init.0() /home/apanatto/dev/github.com/k8snetworkplumbingwg/sriov-network-operator/cmd/webhook/main.go:27 +0x15 exit status 2 ``` Signed-off-by: Andrea Panattoni <[email protected]>
Move `[sriov] operator No SriovNetworkNodePolicy ...` test cases to its own test file. Implement test case to verify controller and webhook logic Signed-off-by: Andrea Panattoni <[email protected]>
…dabot Add dependabot config
…networks Support namespaced {Sriov,SriovIB,OVS}Networks
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Upgrade golangci-lint to work with Go 1.23 Add platform build arg. Comment out Mellanox plugin's draining + rebooting for totalVfs + SRIOV_EN configs, which is buggy. Scan GUIDs (#7) * Squash commits into one * Squash commits into one * Cherry-picked types (build fix) * merge issues fix * read GUID from sysfs * node -> port * args mismatch :( * NAD config fix to include guid * port -> node back * rollback partially * rollback partially * rollback partially * rollback partially * rollback partially * rollback partially * rollback partially * fix * bring back pKey to netAttDef definition * pkey proper location * removed excessive log lines * quotes fix ENG-21048 - KernelArgIommuOn instead of KernelArgIommuPt (to enable ATS & ACS) (#9) KernelArgIommuOn instead of KernelArgIommuPt ENG-19808 mlxfwreset before reboot in SR-IOV operator (#8) * bring reboots back * bring reboots back GUIDSavedInUFM config parameter added camelcase -> snake case
7bd4007
to
963fed6
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.