I really get confused in these fields. Everyone mixes up bigdata, data scientist and data analyst. Aren't they all different? Where does hadoop comes? Isnt it a file distribution/storage type of thing that distributes noSql data across nodes for faster data? Why bigdata needs to be unstructured always? Which field do I start first? Who is the guy that solves problems by looking at data? How a person reads bigdata from his bare eyes if it is unstructured its a mess, they are digits, where do you start solving problems?
Maybe im a mess but I have so many questions here. 5 months back when i wanted to pursue my FYP in bigdata, no one was there, so many jahils in universities and suddenly everyone pops. @
Crow @
sunny945
Any help/direction would be appreciated. Where do I go to get answers to my questions? Apparently google confuses me A Lot in this.
I am just a computer Science student still i want to answer your questions as far as i know about it.
Look basically in very simple words if i try to explain things to you remember there are some fields in CS that are interconnected with each other.
Now regarding this field first of all the main field is Data Mining.
So the question is :
From where the data mining is coming and how can it solve the problems?
before i tell you how can it solve the problem first you need to know how we setup it:
1: We have huge huge data to take care of
2: The data is usually in textual format as roughly 90% data around the world right now is in textual format
3: We need a way to save the data into the systems and usually its not stored in 1 system but multiple computers connected together, these computers are connected together in a distributed computer system that usually uses different kinds of file system that support Distributed data means a file system that allow us to distribute our data and save it on multiple computers
connected together and in case we need the data that file system should have a good technique to retrieve the data.
4: for this purpose the best and widely used DFS is Hadoop that is also called HDFS. we have multiple machines that are connected together that uses hadoop to store data across multiple
nodes/computers.
Now after our data is saved what we need to do is lets say if we want to solve a problem, lets say a business problem like one of the most common such problem is Market basket analysis in which we determine the trend of what people purchases like we can say
90% people who purchases bread will also purchase Jam. this data analysis is only possible if our data is saved in a data center.
Another Question here:
Why Save data in Data center ?
Ans: Data is very huge and can't access as it is
So our data is save in Data Center and ready to process and to achieve our goals we use Data Mining algorithms in the backend.
Now Question is
What is Big Data?
Ans: Bigdata is just a term that we use for data that are so large and complex and we can't save that sort of data in traditional systems.
In implementation of Big Data we need:
Basic Framework that is usually DFS like Hadoop Mapreduce, GFS etc
For
Retrieving and storing data we usually use NOSQL like MongoDB / Cassandra etc
For
Data Processing we apply different sort of algorithms according to our requirements on Hadoop like K-mean, Apriori, Pagerank, Knn,C4.5 etc according to our requirements
Now this whole setup explained above is BIG Data.
Persons who setup and process algorithms and analyse data are Data Scientists and analyst.
Now question arises from where can i start ?
Ans:
Data Mining because when you study data mining you will read about the concepts of
Data warehouse
Data Mining
Text Mining
Classification
Stats
And many basic things that are always require if you want to go into this field. then after you learn the basics you will yourself get idea about these big things but if you start from BIGData you will end up know just those useless things that are used for Marketing and nothing else.
Now i tried to explain each and everything as far as i know about it but still as i said i am just a CS Student.