Can language models help me fix such issues in CNN based vision models?