Bug 641 :Fix for Nic flutation and TX hangs.

Review Request #241 — Created May 7, 2024 and updated

satyendra
APV10
apv_10_7_0
641
prajesh, roland, tanya, timlai, wli

As the packets are processed in dpdk as burst , if the packet queued in TX ring is more than 40 , It will go in
TX hang state as the H/W descriptor has MAX 40 TX desc limit .

When TX ring goes in hang state , The dpdk try to reset it , and NIC should recovered after reset .
But due to dpdk has multiple threads and if NIC reset happened same time with different threads ,
It trigger race condition and reset has failed , Initiating the reboot .

Unit testing and regression testing is done

Description From Last Updated

Can you add details on what unit testing has been done?

prajeshprajesh

Change the target branch to 10.7

prajeshprajesh

from where this is getting called?

prajeshprajesh

code can be cleared up insted of cluterring it up with #ifdef 0.

prajeshprajesh

Where is this function defined. I dont see it in 10.4.x codebased. May be I am missing something.

prajeshprajesh

state should be decouple from clear queues functionality. satate can be set in the caller. This is breaking SRP principle.

prajeshprajesh

My point is clearing the queue and updating the state should not be coupled together. The function clear queue gives …

prajeshprajesh

same as above comment

prajeshprajesh
prajesh
  1. 
      
    1. I missed to put correct routine name of vf reset check , will upload changes again.
      Also did changes in pf driver in AVX to send correct state for rx and tx queue before sending link status up to VF driver .

    1. this is called from i40evf_init_vf and i40evf_dev_close , and these routines will come into pic at boot time or when port flaps.

  2. state should be decouple from clear queues functionality. satate can be set in the caller. This is breaking SRP principle.

    1. in the I40E vf driver , The state of rx and tx queue is always set after i40e_reset_rx_queue/i40e_reset_tx_queue , marking the state of queue . I am not sure about SRP .
      for the indidival VF reset , this is already in place . I put the check if the full device goes through reset

    2. Check this - https://en.wikipedia.org/wiki/Single-responsibility_principle

    1. in the I40E vf driver , The state of rx and tx queue is always set after i40e_reset_rx_queue/i40e_reset_tx_queue , marking the state of queue . I am not sure about SRP .
      for the indidival VF reset , this is already in place . I put the check if the full device goes through reset

  3. 
      
prajesh
  1. 
      
  2. My point is clearing the queue and updating the state should not be coupled together. The function clear queue gives the impression that it is only responsible for clearing queues and not updating the state. But it in our case, it does both which is overloading two different functionalities into one. Ideally, We should keep thenm separate.

    1. i40e_dev_tx_queue_stop(struct rte_eth_dev dev, uint16_t tx_queue_id)
      {
      struct i40e_tx_queue
      txq;
      int err;
      struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);

          if (tx_queue_id < dev->data->nb_tx_queues) {
                  txq = dev->data->tx_queues[tx_queue_id];
      
                  /*
                  * tx_queue_id is queue id aplication refers to, while
                  * txq->reg_idx is the real queue index.
                  */
                  err = i40e_switch_tx_queue(hw, txq->reg_idx, FALSE);
      
                  if (err) {
                          PMD_DRV_LOG(ERR, "Failed to switch TX queue %u of",
                                      tx_queue_id);
                          return err;
                  }
      
                  i40e_tx_queue_release_mbufs(txq);
                  i40e_reset_tx_queue(txq);
                  dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;  <<.  i added here the same thing 
          }
      
          return 0;
      

      }

      void
      i40e_dev_clear_queues(struct rte_eth_dev *dev)
      {
      uint16_t i;

          PMD_INIT_FUNC_TRACE();
      
          for (i = 0; i < dev->data->nb_tx_queues; i++) {
                  if (!dev->data->tx_queues[i])
                          continue;
                  i40e_tx_queue_release_mbufs(dev->data->tx_queues[i]);
                  i40e_reset_tx_queue(dev->data->tx_queues[i]);
                  dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;  // in this routine , state is set . as per the desing /code , it is always setting the state
                                                                               // if you go through this file  , it can also be seen , that when the driver start the rx and tx queue ,after starting the queue  It will again set queue state .
      

      // I put the same check when when all VF are reset , state marking can be done inside routine
      or after routing .
      if you think it is not good to mark state , Then i will remove it .

      }

          for (i = 0; i < dev->data->nb_rx_queues; i++) {
                  if (!dev->data->rx_queues[i])
                          continue;
                  i40e_rx_queue_release_mbufs(dev->data->rx_queues[i]);
                  i40e_reset_rx_queue(dev->data->rx_queues[i]);
                  dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
          }
      

      }

      i40e_driver_clear_queues is clearing all the queues and i40e_dev_tx_queue_stop is used to clear one specific queue .

    2. i40e_dev_tx_queue_stop(struct rte_eth_dev dev, uint16_t tx_queue_id) { struct i40e_tx_queue txq;
      int err;
      struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);

      if (tx_queue_id < dev->data->nb_tx_queues) {
              txq = dev->data->tx_queues[tx_queue_id];
      
              /*
              * tx_queue_id is queue id aplication refers to, while
              * txq->reg_idx is the real queue index.
              */
              err = i40e_switch_tx_queue(hw, txq->reg_idx, FALSE);
      
              if (err) {
                      PMD_DRV_LOG(ERR, "Failed to switch TX queue %u of",
                                  tx_queue_id);
                      return err;
              }
      
              i40e_tx_queue_release_mbufs(txq);
              i40e_reset_tx_queue(txq);
              dev->data->tx_queue_state[tx_queue_id] = RTE_ETH_QUEUE_STATE_STOPPED;
      
              // in this routine , state is set . as per the desing /code , it is always setting the state
                                                                           // if you go through this file  , it can also be seen , that when the driver start the rx and tx queue ,after starting the queue  It will again set queue state .
      

      // I put the same check when when all VF are reset , state marking can be done inside routine
      or after routing .
      if you think it is not good to mark state , Then i will remove it .

      }
      
      return 0;
      

      }

      void
      i40e_dev_clear_queues(struct rte_eth_dev *dev)
      {
      uint16_t i;

      PMD_INIT_FUNC_TRACE();
      
      for (i = 0; i < dev->data->nb_tx_queues; i++) {
              if (!dev->data->tx_queues[i])
                      continue;
              i40e_tx_queue_release_mbufs(dev->data->tx_queues[i]);
              i40e_reset_tx_queue(dev->data->tx_queues[i]);
              dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;    << i add here
      

      }

      for (i = 0; i < dev->data->nb_rx_queues; i++) {
              if (!dev->data->rx_queues[i])
                      continue;
              i40e_rx_queue_release_mbufs(dev->data->rx_queues[i]);
              i40e_reset_rx_queue(dev->data->rx_queues[i]);
              dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;    << i add here 
      }
      

      }

      i40e_driver_clear_queues is clearing all the queues and i40e_dev_tx_queue_stop is used to clear one specific queue .

  3. 
      
prajesh
  1. 
      
  2. code can be cleared up insted of cluterring it up with #ifdef 0.

  3. Where is this function defined. I dont see it in 10.4.x codebased. May be I am missing something.

  4. 
      
prajesh
  1. 
      
  2. Can you add details on what unit testing has been done?

  3. 
      
prajesh
Loading...