Reasons to not use PCA for feature selection

jstx1 · on April 1, 2022

The only reason you need: PCA is not a feature selection algorithm

Is the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.

it_does_follow · on April 1, 2022

It's very clear if you read the article that what the author is calling "feature selection" might be better termed "feature generation". He explicitly calls out what he means in the post:

> When used for feature selection, data scientists typically regard z^p:=(z_1,…,z_p) as a feature vector than contains fewer and richer representations than the original input x for predicting a target y.

I don't even think this is necessarily incorrect terminology, especially given the author's background of working primarily for Google and the like. It's the difference between considering feature section as "choosing from a list of the provided features" vs "choosing from the set of all possible features". The author's term makes perfect sense given the latter.

PCA is used for this all the time in the field. There have been an astounding number of presentations I've seen where people start with PCA/SVD as the first round of feature transformation. I always ask "why are you doing that?" and the answer is always mumbling with shoulder shrugging.

This is a solid post and I find it odd that you try to dismiss it as either ignorant or click bait, when a quick skim of it dismisses both of these options.

rlayton2 · on April 1, 2022

Yeah I thought I was reading the title wrong!

For anyone that isn't aware, the role of pca is to create new (synthetic) features that represent the original features.

It does not tell you which features of the original set are good for feature selection purposes.

galangalalgol · on April 1, 2022

The eigen values do give information as to energy, so if your selection criteria is simply to pick the linear component with the most energy, you can use PCA to select that feature and extract it with the corresponding eigen vector. The MUSIC algorithm is a classic example.

Edit: having now actually read the article, the case I mention falls into the author's test of (do linear combinations of my features make sense and have as much of a relationship to the target as the features themselves).

pas · on April 1, 2022

so what's a good way/algorithm/strategy/method/technique to select features? (obviously asking for a friend who's not that familiar with this, whereas I've all the black belts in feature engineering!)

z2210558 · on April 1, 2022

L1 regularisation is the usual way (see e.g. https://en.wikipedia.org/wiki/Lasso_(statistics))

microtonal · on April 1, 2022

There are also some iterative methods like grafting (using feature gradients):

https://www.jmlr.org/papers/volume3/perkins03a/perkins03a.pd...

and gain-based selection (using the improvement of the objective), see the appendix of:

https://aclanthology.org/J96-1002.pdf

We used grafting for parser feature selection, for which it worked quite well:

https://danieldk.eu/Research/Publications/ucnlg2011.pdf

ylks · on April 5, 2022

Feature selection ought to be model-specific. Because a feature wasn't selected by Lasso (in a linear model) does not mean it cannot be useful in a non-linear model.

clove · on April 1, 2022

Psychologists use factor analysis.

tomrod · on April 1, 2022

I work with ML regularly, and there is always something new to learn!

Another commenter mentioned L1 regularization, which is useful for linear regression. You wouldn't use it for all classes of problems. L1 regularization has to do with minimizing error of absolute values, instead of squared errors or similar.

I skimmed this article and thinks it's accessible: https://www.kdnuggets.com/2021/06/feature-selection-overview...

PCA is a form of dimensionality reduction, but it doesn't select features for you.

georgefox · on April 1, 2022

I'm probably nitpicking your language, but L1 regularization is precisely that: regularization. (See https://en.wikipedia.org/wiki/Regularization_(mathematics)#R....) In your typical linear regression setting, it does not replace the squared error loss but rather augments it. In regularized linear regression, for example, your loss function becomes a weighted sum of the usual squared error loss (aiming to minimize residuals/maximize model fit) and the norm of the vector of estimated coefficients (aiming to minimize model complexity).

tomrod · on April 1, 2022

Hey, I appreciate your correction! I wrote my comment late and night and definitely mashed the details. Your nice "nitpick" is a much needed correction to my inaccuracy.

r-zip · on April 1, 2022

> L1 regularization has to do with minimizing error of absolute values

Not quite. It has to do with minimizing the sum of absolute values of the coefficients, not the error. The squared error is still the "fidelity" term in the cost function.

disgruntledphd2 · on April 1, 2022

lasso is probably the easiest way to do it relatively quickly.

Lasso is also known as L1 regularisation, and it tends to set the coefficients to a bunch of features to zero, hence performing feature selection.

Note that if two predictors are very correlated, lasso may pick one mostly at random. Obviously one should do CV and bootstrapping to ensure that the results are relatively stable.

In general though, there's no real substitute for domain expertise when it comes to selecting good features.

edit: lasso is L1, not L2

newrotik · on April 1, 2022

check scikit-learn feature selection utilities https://scikit-learn.org/stable/modules/feature_selection.ht...

besides techniques mentioned in other posts (l1 regularization) sequential feature selection (backward and forward) is quite common

disgruntledphd2 · on April 1, 2022

> equential feature selection (backward and forward) is quite common

This is fine if you have test/validation sets, but never, ever report p-values on the result of such a selection process, as they are incredibly biased.

ylks · on April 5, 2022

Check out these posts: https://blog.kxy.ai/tag/feature-selection/.

This one in particular compares a few methods on 38 datasets and has some Python code: https://blog.kxy.ai/adding-feature-selection-to-any-model-in....

Disclaimer: I wrote the original blog post.

naveen99 · on April 1, 2022

This is the basic promise of deep learning: resnet, unet, transformers etc… you replace feature engineering with deep learning models… otherwise feature engineering comes from domain expertise/ human experience. or trial and error from known features that work elsewhere.

platz · on April 1, 2022

Isn't the article in fact saying that the new synthetic features are not good features for training?

j16sdiz · on April 1, 2022

PCA as feature selection is quite popular. Many text book on classification teach this.

rlayton2 · on April 1, 2022

So you have an example? I've never seen this and would like to see what people are doing here. I teach a lot of newbies data mining, so I'm very interested in how people get it wrong

bllguo · on April 1, 2022

here's one: https://charlesreid1.github.io/circe/Digit%20Classification%...

hervature · on April 1, 2022

But this is not so much "feature selection" as it is "compressing the data". It says right in the conclusion that the entire goal was "dimensionality reduction". In a very real way, PCA is selecting all features. That is, your data collection process remains unchanged. In a real feature selection, you would be able to say "ok, we don't need to collect X data anymore".

barrkel · on April 1, 2022

The point of the article, though, is that dimensionality reduction which minimizes information loss (PCA) isn't necessarily dimensionality reduction which minimizes signal loss.

A good example from the article is random features: random implies high information content but no signal value.

platz · on April 1, 2022

exactly right, barrkel. Did you read the article, hervature?

LudwigNagasena · on April 1, 2022

It says nothing about feature selection.

mywittyname · on April 1, 2022

What is the difference between feature selection and dimensionality reduction?

CapmCrackaWaka · on April 1, 2022

Feature selection is a process by which you drop features for different reasons. The main reason features tend to be dropped is because they are closely related to another feature, so you only need one of them. This makes algorithms train faster, reduces noise, and makes it easier to diagnose what the algorithm did.

PCA takes some N features and compresses them into N-n features. This process ALSO eliminates Collinearity completely, as the resulting, compressed features will be completely uncorrelated. However, calling PCA a feature selection algorithm is a bit untrue, because you have essentially selected none of your features, you have completely transformed them into something else.

dekhn · on April 1, 2022

It doesn't seem like such a stretch to conceptualize that if PCA assigns a tiny weight to a variable (assume all variables have been pre de-biased to mean 0, std dev 1) then it is saying it's a feature that doesn't contribute much to the overall prediction and therefore "deselecting" it by merging it with several other variables that it's correlated with, and downweighting it relative to them.

Most successful techniques I see in deep nets take the incoming features and mux them into intermediate features which are the actual ones being learned. Feature selection and PCA are in a sense just built into the network.

kthejoker2 · on April 1, 2022

Short answer: feature selection is one particular method of dimensionality reduction.

And most people when they say feature selection mean deliberate domain-driven selection of features.

That is to say, you can also create an entirely new synthesized feature from, say, 5 raw features and use it to replace those 5 features (this is ... PCA-esque.)

Or you could also use random forest techniques for example which just arbitrarily reduce the dimensions / features of each individual decision tree in the forest.

I agree many people here are making mountains out of molehills of terminology.

bllguo · on April 1, 2022

practically speaking i have never had a reason to do this with PCA, but i find this reaction very weird. i have encountered the concept of using PCA to reduce the dimensionality of your data before training many times, including in university. cursory searches show plenty of discussion, questions on crossvalidated, etc.

jstx1 · on April 1, 2022

My point is that dimensionality reduction != feature selection. PCA is dimensionality reduction, not feature selection.

bllguo · on April 1, 2022

maybe the wording is not precise, but the meaning is clearly training on the top principal components instead of the whole data

anyway in general i dont think the boundary between feature selection and dimensionality reduction is that sharp

etrautmann · on April 1, 2022

It’s not, but the typical use for PCA as a dimensionality reduction step prior to a downstream classifier is to introduce a bottleneck to prevent overfitting. In some sense, that can be thought of as a means of avoiding specific feature selection by using an unsupervised method to discover the population covariance structure

ylks · on April 5, 2022

That's precisely what the blog post deals with.

In that PCA pre-processing step, nothing guarantees that principal components are better representations for your problem than original inputs; in fact PCA has nothing to do with your target, how could it guarantee principal components are better representation to predict it?

Similarly, understanding the covariance structure of your original inputs will not necessarily help you predict your target better.

Here's a simple example illustrating this. Take x a single feature highly informative about y, take z a (large) d-dimensional highly structured vector that is independent from both x and y. Now, consider using [x, z] to predict y.

In this case, x happens to be a principal component as x and z are independent; it is associated to eigenvector [1, 0,...., 0] and eigenvalue Var(x). All other eigenvectors are of the form [0, u_1, ..., u_d] where [u_1, ..., u_d] is an eigenvector of Cov(z).

All it would take for x to be the very last (i.e. 'least important') principal component is for Var(x) to be smaller than all eigenvalues of Cov(z), which is easily conceivable, irrespective of y! In your quest for a lower-dimensional 'bottleneck' using PCA you would end up removing x, the only useful feature for predicting y! This will certainly not reduce overfitting.

PCA and other autoencoders work well as pre-processing step when there are structural reasons to believe low energy loss (or low reconstruction error) coincides with low signal loss. In tabular data, this tends to be the exception, not the norm.

j16sdiz · on April 1, 2022

If you don't do any feature selection explicitly and use PCA dimensional reduction, you are letting PCA select the features

hervature · on April 1, 2022

This is false though as your top eigenvectors are going to have non-zero entries in all components unless your original data has some features with 0 covariance which is highly unlikely. You still need to collect all the original features, you have not selected features that can be removed.

LudwigNagasena · on April 1, 2022

That’s like saying that in a stacked model every layer is feature selection. That doesn’t really make sense. Feature selection is about which input data is used, not about how it is transformed during the whole prediction procedure.

derbOac · on April 1, 2022

I think you could think of PCA as a feature selection candidate when the prediction target is unknown or random?

pas · on April 1, 2022

Could you elaborate on this a bit? Are there feature selection algorithms?

jstx1 · on April 1, 2022

PCA is a form for dimensionality reduction. It can be used to reduce your data to a only few features while preserving some properties like the predictive quality of your model. Simple example: you give PCA 10 input features, it gives back the principal components, you pick the first 2 principal components and use them in your regression model (you still select how many componenets to use yourself, PCA doesn't do that for you).

Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use. You can do this based on domain knowledge, based on analysing variance, based on correlation between features, based on feature importance in some model, by adding/removing/shuffling features and seeing how your model performs etc.

Frost1x · on April 1, 2022

>Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use.

That sounds a lot like another form of dimensionality reduction to me. The part people trip up on is that dimensions have a very specific definition in the world of statistics. All of these are basically reductionist approaches to tackle representing more complex systems by shedding information, they just go about it in different ways.

jstx1 · on April 1, 2022

Good point, a look at the contents of the wikipedia page for dimensionality reduction makes it clear - https://en.wikipedia.org/wiki/Dimensionality_reduction

    Dimensionality reduction:

    1. Feature selection.

    2. Feature projection.

      2.1 PCA

So you could say that feature selection is dimensionality reduction but you can't say that PCA is feature selection.

mywittyname · on April 1, 2022

> but you can't say that PCA is feature selection.

One of the steps in PCA is to select the top N eigenvectors sorted by eigenvalues. So while it is not feature selection, it can be argued that it involves a feature selection step.

weberer · on April 1, 2022

>Are there feature selection algorithms?

Yeah. Here's a few

https://scribe.citizen4.eu/feature-selection-how-to-throw-aw...

civilized · on April 1, 2022

Now that we've all said "feature selection is not dimensionality reduction" to our hearts' content, could we return to the point of the article?

Regardless of whether you're doing feature selection or dimensionality reduction, the point remains that, if you're doing supervised learning, PCA is just compressing your X space, without any regard to your y. It could be that the last principal component of X, containing only 0.1% of the variance, contains 100% of the correlation between X and y.

Using PCA for dimensionality reduction in a supervised learning context means throwing out an unknown amount of signal, which could be up to 100% of the signal.

Now for unsupervised, exploratory analysis, PCA is definitely a candidate, but there are plenty of often-better alternatives there too.

jstx1 · on April 1, 2022

For me PCA is strictly in a bucket labeled "this comes up a lot in data science interviews but you never actually need to use it at work".

it_does_follow · on April 1, 2022

You've never had to any kind of factor analysis in your work or done any searching for latent variables that map to customer/stakeholder question? Given the number of people I've worked with that are interested in modeling "engagement", I find this hard to believe.

PCA is an incredibly valuable tool that I've used in most jobs I've had. It's just a terrible idea as a default part of a feature engineering pipeline (which is what the author is talking about in terms of "feature selection"), for reasons outline in this article.

I suggest you don't be quite so quick to dismiss important concepts in this area, and before criticizing this post, at least read through it (I noticed your comment about misunderstand what the author is discussing by "feature selection" is the top comment here).

jstx1 · on April 1, 2022

> You've never had to

Nope, never. I'm not dismissing PCA altogether; I'm sharing my experience and pointing out that some topics come up much more often in interviews than on the job.

nerdponx · on April 1, 2022

My experience is the opposite of yours. It has never come up in a job interview but I have used it for real work several times.

civilized · on April 1, 2022

For me the contents of this bucket are things like SVM, Naive Bayes, and k-means. PCA is actually in a slightly higher tier than those three.

kleene_op · on April 1, 2022

k-means are actually useful for bootstrapping statistical mixtures when you don't have good priors.

Just for this convenient use alone I would place them above PCA.

civilized · on April 1, 2022

What does bootstrapping statistical mixtures mean?

Why not just bootstrap in the regular old non-parametric way? Why inject k-means into it?

kleene_op · on April 1, 2022

When you try to approximate your data with a mixture of distributions, say Gaussians for instance, you have to initialize their parameters with some values before running your EM loop.

Finding good ones can be very problematic. The k-means process is a reasonable method to get good enough starting values without having to think/compute too much.

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximizati...

civilized · on April 1, 2022

OK, so for you, "bootstrapping statistical mixtures" means "warm-starting EM mixture modeling".

Personally I never would have guessed that, in retrospect it makes sense... to each their own!

hatmatrix · on April 1, 2022

If you're looking for the part of the signal in X that covaries with y, the first thing to try is partial least squares.

mturmon · on April 1, 2022

Yes. In the simplified linear setting we're considering...

If you want to do dimension reduction and only have X, use PCA.

If you want to do dimension reduction and also have a "y", consider PLS or CCA (canonical correlation analysis).

Some background: https://www.cs.cmu.edu/~tom/10701_sp11/slides/CCA_tutorial.p...

ylks · on April 5, 2022

You nailed it.

quanto · on April 1, 2022

PCA can be construed as a loss compression of the data matrix. In fact, Eckart-Young theorem shows that this is an optimal compression (optimal w.r.t low rank, i.e. space to hold the values). In the language of OP, this shows the minimal energy loss for a given space constraint.

The key word is "lossy". It may as well be that the loss'ed part had the signal for further classification down the pipeline. Or may be not. It depends on the case.

srean · on April 1, 2022

There was a recent discussion on PCA for classification that I had walked into, however, everyone had left the building when I joined. Since I run into this conceptual misunderstanding of PCA's relevance to classification often, let me repeat what I said there.

The problem with using PCA as a preprocessing step for linear classification is that this dimensionality reduction step is being done without paying any heed to the end goal -- better linear separation of the classes. One can get lucky and get a low-d projection that separates well but that is pure luck. Let me see if I can draw an example

           ++++++
       ++++++++++++++++
             +++
           ---------
       -----------------
            -------

The '+' and '-' denote the data points of the two different classes. In this example the PCA direction will be the along the X axis, which would be the worst axis to project on to separate the classes. The best in this case would have been the Y axis.

A far better approach would be to use a dimensionality reduction technique that is aware of the end goal. One such example is Fisher discriminant analysis and its kernelized variant.

coconuthacker42 · on April 1, 2022

The answer, as always, is IT DEPENDS.

platz · on April 1, 2022

> 1) Does linearly combining my features make any sense?

> 2) Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?

the article provided a negative example where pca does not fit, but doesnt provide an example where it does or what pca is actually used for. I come away from this article thinking pca is useless.

what would be an example where 2) is true?

I cannot answer 2) without already having experience of what explanations there could possibly be. ( 2) is almost begging the question, at least pedagogically- pca is good when the features are good for pca)

When does linearly combining my features "make sense"? again, an example is not provided

ThereIsNoWorry · on April 1, 2022

I mean, that's the point. What kind of real world complexities are actually linearly explainable? Almost none. It goes all the way back to why classical statistics failed to provide real world value after a certain point and the trend has been going towards using non-linear black boxes for the last decade.

platz · on April 1, 2022

So you basically are saying PCA is a useless procedure.

Also, "after a certain point" is doing a lot of heavy lifting in that sentance.

hervature · on April 1, 2022

The article is silly because PCA cannot select features. It is all about dimensionality reduction. You should think of PCA as the equivalent to VAEs from the neural network world. The idea would be something like this: you have big images (let's say 4k) and this is too expensive to train with/store forever. So, you collect a training set, train a PCA on these images, and then you can convert your 4k images to 720p or even 10 numbers, which you then use to predict/train whatever you want. Of course, we have algorithms that scale images but maybe all your images are of cats and there is a specific linear transformation that contains more information from the 4k image than simply scaling. The implicit thing here is that you still are collecting 4k images but just immediately compress them down using your trained PCA transformation.

So, although you have less numbers than before, you still need to collect the original data. A real feature selection process would be able to do something like: "the proximity of the closest Applebees is not important to predict house prices, you should probably stop wasting your time calculating this number". As others have mentioned, L1 regression or some statistical procedure to identify useless features is typically how this is done. I would also add that domain knowledge is probably your #1 feature selection because we have to restrict the variables we input in the first place and which data we prioritize is inherently selecting the features.

dkersten · on April 1, 2022

So would this be a more-or-less correct tl;dr:

Dimensionality reduction is a compressing data in a way that retains the most important information for the task

Feature selection is removing unimportant information (keeping/collecting, or selecting, only the important parts)

Both cut down on the amount of data you end up with, but one does it by finding a representation that is smaller, the other does it by discarding unnecessary data (or, rather, telling you which data is necessary, so you can stop collecting the unnecessary data).

chombier · on April 3, 2022

> so you can stop collecting the unnecessary data

I think that's the key, thanks.

But still, if some inputs are redundant shouldn't this be somehow apparent in the eigen-vectors/values of the covariance matrix (making PCA an indirect feature selection algorithm)?

naijaboiler · on April 1, 2022

So they are functionally doing the same thing, reducing the amount of data used. I find the debate about what we should call it useless and pointless.

Understand what it is, what it doing, what it s limitations are, and use it appropriately based on your needs. Done.

bllguo · on April 1, 2022

indeed, predictable and disappointing how the discussion devolved into pedantry. should have been obvious what the author meant (plus it is clarified at the very beginning of the article). I'm not sure if this is a ML practitioner vs. statisticians thing or what

platz · on April 1, 2022

if you read the actual article, dimensionality reduction by PCA does not retain the most important information about the data

hervature · on April 1, 2022

Yes, I would say this is entirely correct.

mrow84 · on April 1, 2022

One method is to drop variables with small coefficients in the top n principal components. This is feature selection, by your definition.

raverbashing · on April 1, 2022

Here's a better idea, don't use generalized statements but rather test your data first.

If your models fits well (and doesn't overfit) after PCA, then go for it. If not, revisit.

PCA has its place, and as the other commenter said, sure, it's not a feature selection algorithm. Or you can just feature select manually.

coconuthacker42 · on April 1, 2022

You might not even need feature selection, just a dimensionality reduction in some cases

anonymoushn · on April 1, 2022

Much of the text is outside the viewport and cannot be scrolled to on mobile.

first_post · on April 1, 2022

What would you recommend for feature selection in say, single-cell RNA seq studies (Typical dataset is ~10,000 x ~30000 (cells x genes) with >90% of your table filled with 0s (which could be due to biological or technical noise)

PCA and UMAP are yes, dimensionality reduction methods, but are often seen as tools for feature selection.

See slide 61 Here: https://physiology.med.cornell.edu/faculty/skrabanek/lab/ang...

plandis · on April 1, 2022

I followed the math but I don’t know ML… Do practitioners really use “energy” and “conservation of energy”? That just seems overly confusing.

ylks · on April 5, 2022

Energy is a shorthand for L2 norm (in a probabilistic sense). This similar to the definition of the energy of a signal in signal processing. 'Information content' is an alternative for 'energy' here, but it can be mistaken for entropy.

dekhn · on April 2, 2022

I've never heard the term used before in ML or statistical modelling.

rybosworld · on April 1, 2022

PCA has a niche use-case. It's more often harmful than not in my experience.

coconuthacker42 · on April 1, 2022

Can you elaborate? What other techniques do you find useful or better for either feature extraction or dimensionality reduction?

time_to_smile · on April 1, 2022

The entire field of deep learning is essentially performing non-linear dimensionality reduction, ultimately projecting the data on to a manifold where it is linearly separable.

The last layer of most neural networks is essentially just logistic regression, so if you take that final layer before the soft max you have a lower dimensional representation of the original data.

You can extend this things to more sophisticate example where you're not just learning a target but some other more general objective. For example you might want to learn a representation that minimizes the cosine distance between objects in a similar category.

PCA is only useful for compression (which is rarely a problem in contemporary applications) as far as pure feature engineering goes. However when you realize that PCA is the same as a linear auto-encoder then it does serve as a good intellectual starting point for more sophisticated dimensionality reduction techniques.

naijaboiler · on April 1, 2022

> The entire field of deep learning is essentially performing non-linear dimensionality reduction, ultimately projecting the data on to a manifold where it is linearly separable.

I will generalize this to every data problem. They all essentially boil down to reducing dimensions and projecting to a space where they are linearly separable.

I was trying to show a junior colleague that this is essentially what any and every approach is doing

coconuthacker42 · on April 2, 2022

NYUs online deep learning course does a good job explaining the first two paragraphs.

ylks · on April 5, 2022

Hello,

I'm the author of the post. I'm slightly late to the party, but I'll try to clarify a few misunderstandings.

First and foremost, the post deals with the following scenario too many data scientists find themselves in: "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components". This is a terrible idea and the post explains why.

Second, there seems to be a debate about 'feature selection' vs. 'feature construction' (or 'feature generation'), and whether PCA is of the former or latter type. Here are the definitions I use in the whole blog.

Feature Construction is the process consisting of generating candidate features (i.e. transformations of the original inputs) that might have a simpler relationship with the target, one that models in our toolbox can reliably learn.

E.g. a linear model cannot learn a quadratic function. However, because a quadratic function of x is linear in [x, x^2], the feature transformation x -> [x, x^2] is needed to make our quadratic function learnable by a linear model.

Check out this post for more details on what's at stake during feature construction: https://blog.kxy.ai/feature-construction/.

Feature Selection is the process consisting of removing useless features in the set of candidates generated by feature construction. A feature is deemed useless when it is uninformative about the target or redundant.

Check out this post for a formulation of three key properties of a feature using Shapley values: feature importance, feature usefulness, and feature potential. https://blog.kxy.ai/feature-engineering-with-game-theory-bey...

In the scenario the blog post deals with (i.e. "I have (generated) a lot of features; let me do PCA and train my model using the top few principal components"), data scientists do both feature construction (full PCA, i.e. projecting the original input onto eigenvectors to obtain as many principal components as the dimension of the original input) AND feature selection (only selecting the first few principal components with the highest eigenvalues).

When the goal is to predict y from x, using PCA for either feature construction OR feature selection is a bad idea!

For feature construction, there is nothing in PCA that will intrinsically guarantee that a linear combination of coordinates of x will have a simpler relationship to y than x itself. PCA does not even use y in this case! E.g. Imagine all coordinates of x but the first are pure noise (as far as predicting y is concerned). Any linear combination of x will just make your inputs noisy!

For feature selection, even assuming principal components make sense as features, principal components with the highest variances (i.e. corresponding to the highest eigenvalues) need not be the most useful for predicting y! High variance does not imply high signal.