I am building predictive genomic models from a large-scale SNP array dataset and would like a skilled collaborator to take ownership of the statistical side of the pipeline. The raw data have already been collected; what I now need is a rigorous analysis workflow that turns those variants into clear, reproducible insights.
Before any modelling begins, the file will need a light round of quality assurance: filtering out low-quality calls, imputing missing genotypes, normalising across batches, and performing feature selection so that only informative loci move forward. Once this cleaned matrix is in place, the core assignment is to implement and interpret two complementary methods—regression analysis for association testing and principal component analysis for dimensionality reduction and structure correction.
Deliverables
. Most important point is to get binary classification of camels when predicted by the model with 90-95% accuracy.
. Mismatches are permitted only in the case of borderlines not the extreme misclassification.
* Reproducible scripts (R/Python or both) that perform the stated preprocessing steps
* Well-commented code for regression models and PCA, including visual summaries of key outputs
* A concise report or notebook explaining findings, parameter choices, and recommendations for the next modelling phase
Note please share your Past work screenshots or video working with R programming
... Show more