Organisation of Data
Classification of Data and Variables
Once data are collected, they arrive as a confusing heap of raw figures. The next stage of a statistical study is to bring order to this heap — this is the organisation of data. The first step is classification: arranging the data into groups or classes according to their common characteristics, so that comparison and analysis become easy.
Data can be classified on four bases:
- Geographical — by place or region (e.g. population of different states).
- Chronological — by time (e.g. India's GDP year by year).
- Qualitative — by a quality or attribute that cannot be measured in numbers (e.g. people grouped by gender, literacy or religion).
- Quantitative — by a characteristic that can be measured in numbers (e.g. height, weight, income, marks).
A characteristic that can be measured and takes different numerical values is called a variable. Variables are of two kinds:
- Discrete variable — takes only whole, separate values, with jumps in between (e.g. the number of children in a family: 0, 1, 2, 3 — never 2.5).
- Continuous variable — can take any value within a range, including fractions (e.g. height, weight or temperature, which can be 160.5 cm, 55.25 kg, etc.).
Knowing whether a variable is discrete or continuous decides how its frequency table is built.
Data are grouped by a common characteristic.
- Geographical (place), chronological (time).
- Qualitative (attribute) and quantitative (measurable).
Can the value be a fraction?
- (a) Number of cars — only whole values — discrete.
- (b) Weight — can be any value like 48.6 kg — continuous.
It is arranged by time.
- Year-by-year data are arranged chronologically.
Key Points
- Classification = arranging raw data into groups by common characteristics.
- Bases: geographical (place), chronological (time), qualitative (attribute), quantitative (measurable).
- Variable = a measurable characteristic with different values.
- Discrete (whole values only, e.g. number of children) vs continuous (any value in a range, e.g. height).
Frequency Distribution and Class Intervals
When the same value occurs again and again in data, we record how many times it occurs. The number of times a value (or group of values) appears is its frequency, and a table showing values with their frequencies is a frequency distribution.
For a discrete variable we can list each value and its frequency directly. But for a continuous variable, or when the data spread over a wide range, we group the values into class intervals (such as 0–10, 10–20, 20–30). Some key terms:
- The two ends of a class are its class limits — the smaller is the lower limit, the larger the upper limit.
- The difference between the upper and lower limit is the class size (width); for 10–20 it is 10.
- The middle value of a class is its mid-point (class mark) = (lower limit + upper limit) ÷ 2; for 10–20 it is 15.
There are two ways to form class intervals:
- Inclusive method — both limits are included in the class (e.g. 0–9, 10–19, 20–29). There is a gap between classes, so it suits discrete data.
- Exclusive method — the upper limit of one class is the lower limit of the next, and the upper limit is excluded (e.g. 0–10, 10–20: a value of exactly 10 goes into 10–20). This avoids gaps and suits continuous data. An inclusive table can be converted to exclusive form to remove the gaps before drawing graphs.
Use the formulas.
- Class size = upper − lower = 30 − 20 = 10.
- Mid-point = (20 + 30) ÷ 2 = 25.
The upper limit is excluded, the lower included.
- In the exclusive method, 10 belongs to the class 10–20 (not to 0–10).
Continuous data has no gaps.
- The exclusive method, because it leaves no gap between classes.
- This matches the continuous nature of the data.
Key Points
- Frequency = how many times a value occurs; frequency distribution = values with their frequencies.
- Class interval (e.g. 10–20): class size = upper − lower; mid-point = (lower + upper) ÷ 2.
- Inclusive (both limits included; gaps; suits discrete) vs exclusive (upper limit excluded; no gaps; suits continuous).
Constructing a Frequency Table
Let us put it all together and build a frequency table from raw data. Suppose 20 students scored the following marks (out of 50):
12, 23, 35, 41, 9, 18, 27, 33, 45, 7, 22, 38, 16, 29, 31, 44, 11, 26, 39, 48
To organise these into a frequency distribution using the exclusive method with a class size of 10, we (1) find the range (highest − lowest = 48 − 7 = 41), (2) decide the classes (0–10, 10–20, …, 40–50), and (3) put a tally mark for each value in its class, then count the tallies to get the frequency:
| Marks (class) | Tally | Frequency |
|---|---|---|
| 0–10 | || | 2 |
| 10–20 | |||| | 4 |
| 20–30 | ||||| | 5 |
| 30–40 | ||||| | 5 |
| 40–50 | |||| | 4 |
| Total | 20 |
The total of all frequencies (2 + 4 + 5 + 5 + 4 = 20) must equal the number of observations — a quick check that nothing was missed. We have turned a messy list into a neat, readable table. The running total of frequencies down (or up) the table is called the cumulative frequency, which we use later for the median and ogive.
Every observation is counted once.
- Each value is tallied into exactly one class.
- So adding all frequencies must give back the total number of observations (here 20).
Add the first two classes.
- 0–10 has 2 and 10–20 has 4.
- 2 + 4 = 6.
It is a running total.
- Cumulative frequency is the running total of frequencies down (or up) the table.
Key Points
- Build a frequency table: find the range, choose classes, put tally marks, count to get the frequency.
- Sum of frequencies must equal the number of observations (a check).
- Cumulative frequency = running total of frequencies (used for median and ogive).