Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • HabitatSampler HabitatSampler
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 8
    • Issues 8
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Habitat SamplerHabitat Sampler
  • HabitatSamplerHabitatSampler
  • Issues
  • #59
Closed
Open
Issue created Mar 16, 2022 by Romulo Pereira Goncalves@romuloOwner

set.seed is always called before a sample() function reducing the sampling randomness

The parameter seed can be changed by the user to get different sampling values. HaSa uses the parameter seed to set a new seed the following 2 code blocks:

model_opt.r#L172

    if (abs(di[1] - di[2]) > min(di) * 0.3) {
      if (which.min(di) == 2) {
        set.seed(seed)
        d3 <- sample(which(pbtn1@data$nam == 1), di[1] - di[2], replace = F)
        pbtn1 <- pbtn1[-d3, ]
        test1 <- test1[-d3, ]
      } else {
        set.seed(seed)
        d4 <- sample(which(pbtn1@data$nam == 2), di[2] - di[1], replace = F)
        pbtn1 <- pbtn1[-d4, ]
        test1 <- test1[-d4, ]
      }
    }

model_opt.r#L187

    if (sum(pbtn1@data$nam == 1) > max_samples_per_class) {
      set.seed(seed)
      dr <-
        sample(which(pbtn1@data$nam == 1),
               sum(pbtn1@data$nam == 1) - max_samples_per_class,
               replace = F)
      pbtn1 <- pbtn1[-dr, ]
      test1 <- test1[-dr, ]
    }
    if (sum(pbtn1@data$nam == 2) > max_samples_per_class) {
      set.seed(seed)
      dr <-
        sample(
          which(pbtn1@data$nam == 2),
          sum(pbtn1@data$nam == 2) - max_samples_per_class,
          replace = F
        )
      pbtn1 <- pbtn1[-dr, ]
      test1 <- test1[-dr, ]
    }

By having set.seed(seed) just before a sample() function the return values are always the same for a function call. Is this desired? Why is the seed set so many times? It might help to reproduce results, but it reduces the randomness of the sample() function, and thus we might get to local optimum.

We think set.seed() should only be called once, and at the outer_procedure.r, and to obtain new/different random values on each function call the user could set seed to as.numeric(Sys.time()) so it is always different.

@carstenn what do you think?

Assignee
Assign to
Time tracking