Hi, I follow the procedure that is already posted in the documentation on Git if I want to present the package for other users:
library(remotes)
remotes::install_git("https://git.gfz-potsdam.de/habitat-sampler/HabitatSampler.git",
ref = "master", subdir = "R-package/hasa", dependencies = NA, upgrade = FALSE,
build = TRUE, build_manual = TRUE, build_vignettes = TRUE)
I think we should communicate a working stepwise procedure to install HaSa for any kind of external user. Some problems always occur, particularly for users in public administration: 1) required admin rights for certain packages 2) operating system conflicts particularly Linux/Mac dependencies 3) third software demands e.g. git or RStools on windows. For me personally, I just update as a source package over install.packages("", type ="source") when I change some code for my local version.
perfect...I think a light HaSa package installation would really help to keep test users that want to get to know the tool before coming to MiSa.C ;)
two new bugs due to R version change:
The reason is that the R package installation is tarting to get overloaded. There are too many dependencies that make the installation process slow and increases the potential of conflicts. For example, under Linux operating systems all the R dependencies require specific Linux libraries to be compiled. I got reports that user stop installing HaSa package under Linux since there are too many Linux compile processes required. Even on Windows it takes 5-10 minutes to install HaSa which is unusual for R packages.
I have the feeling that there are many R dependencies integrated that are used only by MiSa.C or I am wrong? So my question, would it make sense at a certain point to stop adding more and more dependencies to the R package that are not really needed for executing the algorithm itself and provide a minimalistic stable version of an R package that is then only updated if the algorithm itself is changed?
I have the feeling that the R-package is already unnecessarily big with regard to dependencies!
Hi...it is the first term of the predictive distance. So according the formula it is [1-Hex/Hn]. See Supplementary material S-1C for the pseudo-code or 2.1.2 in the paper for explanation. The order for which.max shouldn't play any role since both habitats are then showing the same model performance in that step. If you want a more detailed explanation using my own words we can have a talk on that? Greetings
I did not cover this issue in the optimization branch. Was there any test on that issue? My point here was that we could increase processing speed if we always change the datatype from float to integer since many satellites images are max. representing radiometry in 16 bit. It can make a huge difference in processing time, however I did not implement a test on that yet. Greetings
perfect!
hmmm...as long as we save then seed2 in each step we could even repeat each step separately without always starting at step 1 or?
Ok cool, if we can assure that the series is not get break, sure your suggestions is fine! Also about 1. I was not sure that seed is and integer and seed2 is the vector. As long as we can guarantee repeatability of the results your suggestion putting one seed at the beginning also using system.time is absolutely fine!
Hi,
seed must be set at each random component to get reproducible results. However, seed in this case is not one value...it is a vector of n_elements = n_models, hence it is new for each model. It is written somewhere at the beginning of the code.
Also seed in case of max_samples_per_class
and abs(di[1] - di[2]) > min(di) * 0.3
(sample balance) as in your two examples, is applied on different input in each iteration. So there are always new values that are sampled since each iteration delivers new points.
According this two points I do not see how randomness is reduced. Any way, my idea was to use the seed at each random component to generate repeatability. Maybe my thinking lacks certain logic? Do you think one seed at the beginning is enough?
Hi, sampleRegular always uses the same points for the same sample_size. SampleRandom needs a seed for reproducing same points ;) By the way do we need to keep the sample size and n_models for reproducing results? Is this solved? Sorry I cannot remember.
A very good information, thank you for pointing on that! I honestly never had this possible effect in my mind since I always worked with ntrees > 500. Thanks again for the detailed work!
Yes, one can image, the more trees are used the lower the probability that a majority vote gets tied at the end ;) A good point to keep in mind when reducing the ntree argument for speed optimization!
Ahhh...thank you. It is exactly the random component we were looking for! Using an odd number of ntree gives similar predictions resulting in sum(abs(pred2[!is.na(pred2)]-pred3[!is.na(pred3)])) = 0
Using an even number of ntrees results in sum(abs(pred2[!is.na(pred2)]-pred3[!is.na(pred3)])) ~ 3900 meaning we have a variation in prediction between 1.5 and 2 % of pixels
The reason is actually here again: "Any ties are broken at random, so if this is undesirable, avoid it by using odd number ntree in randomForest()"
In general, max 2% variation in prediction does not really change the overall spatial structure of predicted rasters since the pixels with different predictions are randomly distributed over the whole test side. For performance test ntrees should be hold odd.
Hope this analysis helps.
I tested a lot and there is definitely still a random effect in randomForest prediction. Since I cannot explain I checked the models you send to me. They all have the argument ntree = 10. This for sure does not work, so please send me a model with ntree = 9 or 11 to enable me to run a valid test. I do not know where you take the numbers 499, 500 and 501 from, the models I got are based on 10 ;)
Yes that would be nice...We need to test with number of trees e.g. 499 or 501. Can you send me a new model to compare with the previous one?
There is one random component in the predict function using RF: "Any ties are broken at random, so if this is undesirable, avoid it by using odd number ntree in randomForest()"
Can we check about ntrees to test this random component, so can you provide two datasets one odd and one even ntree model fit? We can then simply check consistency using:
sum(abs(pred2[!is.na(pred2)]-pred3[!is.na(pred3)])) which should give 0 in the odd case.
There shouldn't be any differences! I never heard of such behavior. I would first suggest to look at the "votes" that are used for the trees and then read if different predict functions use different thresholds. Usually this is just a majority vote and if one uses the same trees then the prediction consequently delivers the same output. I do not see how this can be confused in prediction.
By the way I see that you use different input raster for your predictions: ma_rast and rast. I guess this is the explanation. I can check if you provide me models[[1]], ma_rast and rast.
Carsten Neumann (f7b06a41) at 17 Jan 09:57
For the first steps (1-2 in dependency on how many pixels are extracted from the image) I would recommend using:
Then switch to random sampling if the number of remaining pixels decreases let's say < 50 %. You can then easily see what an appropriate number of models and samples will be needed for good results checking on
a) class accuracy (this is the predictive distance that is computed and printed) if this value is let's say < 0.75 increase nr_samples between 80-150 should be enough for most applications
b) the number of selected models. If this number is let's say < 8 the user should increase nr_models until we get a stable number of selected models between 10-30. > 30 is not really necessary (increasing bit depth of probability map) and just increases time for prediction. So in general, to reach 10-30 selected models the user should change nr_models between 150-300, sometimes 400-500 are good.
However all this advice crucially depends on the use case and data used. For example, we may reach a sufficient number of selected models with nr_models < 100 and high accuracies with nr_samples < 100. My experience is that you need to check on your data in the first steps and then you will see in which range your parameters usually lay to achieve good results.
I hope this helps.
P.S. my advice is based on working with Landsat and Sentinel-2 time series for habit mapping on former military training grounds. It could be a good idea to also collect experiences from other colleagues using different datasets to present case specific rule of thumbs ;)