Organisation of Data

Classification of Data and VariablesFrequency Distribution and Class IntervalsConstructing a Frequency Table

Classification of Data and Variables

Once data are collected, they arrive as a confusing heap of raw figures. The next stage of a statistical study is to bring order to this heap — this is the organisation of data. The first step is classification: arranging the data into groups or classes according to their common characteristics, so that comparison and analysis become easy.

Data can be classified on four bases:

  • Geographical — by place or region (e.g. population of different states).
  • Chronological — by time (e.g. India's GDP year by year).
  • Qualitative — by a quality or attribute that cannot be measured in numbers (e.g. people grouped by gender, literacy or religion).
  • Quantitative — by a characteristic that can be measured in numbers (e.g. height, weight, income, marks).

A characteristic that can be measured and takes different numerical values is called a variable. Variables are of two kinds:

  • Discrete variable — takes only whole, separate values, with jumps in between (e.g. the number of children in a family: 0, 1, 2, 3 — never 2.5).
  • Continuous variable — can take any value within a range, including fractions (e.g. height, weight or temperature, which can be 160.5 cm, 55.25 kg, etc.).

Knowing whether a variable is discrete or continuous decides how its frequency table is built.

1
Worked Example
Example 1: On what four bases can data be classified?
Solution

Data are grouped by a common characteristic.

  • Geographical (place), chronological (time).
  • Qualitative (attribute) and quantitative (measurable).
2
Worked Example
Example 2: Classify these variables as discrete or continuous: (a) number of cars in a street, (b) weight of students.
Solution

Can the value be a fraction?

  • (a) Number of cars — only whole values — discrete.
  • (b) Weight — can be any value like 48.6 kg — continuous.
3
Worked Example
Example 3: India's wheat output recorded for each year from 2010 to 2020 is classified on which basis?
Solution

It is arranged by time.

  • Year-by-year data are arranged chronologically.

Key Points

    • Classification = arranging raw data into groups by common characteristics.
    • Bases: geographical (place), chronological (time), qualitative (attribute), quantitative (measurable).
    • Variable = a measurable characteristic with different values.
    • Discrete (whole values only, e.g. number of children) vs continuous (any value in a range, e.g. height).
✎ Quick Check — 2 questions0 / 2
Q1.Arranging India's yearly GDP figures is an example of ____ classification.
Explanation: Year-by-year (time-based) data are classified chronologically.
Q2.The number of children in a family is a:
Explanation: It takes only whole values, so it is a discrete variable.

Frequency Distribution and Class Intervals

When the same value occurs again and again in data, we record how many times it occurs. The number of times a value (or group of values) appears is its frequency, and a table showing values with their frequencies is a frequency distribution.

For a discrete variable we can list each value and its frequency directly. But for a continuous variable, or when the data spread over a wide range, we group the values into class intervals (such as 0–10, 10–20, 20–30). Some key terms:

  • The two ends of a class are its class limits — the smaller is the lower limit, the larger the upper limit.
  • The difference between the upper and lower limit is the class size (width); for 10–20 it is 10.
  • The middle value of a class is its mid-point (class mark) = (lower limit + upper limit) ÷ 2; for 10–20 it is 15.

There are two ways to form class intervals:

  • Inclusive method — both limits are included in the class (e.g. 0–9, 10–19, 20–29). There is a gap between classes, so it suits discrete data.
  • Exclusive method — the upper limit of one class is the lower limit of the next, and the upper limit is excluded (e.g. 0–10, 10–20: a value of exactly 10 goes into 10–20). This avoids gaps and suits continuous data. An inclusive table can be converted to exclusive form to remove the gaps before drawing graphs.
1
Worked Example
Example 1: For the class interval 20–30, find the class size and the mid-point.
Solution

Use the formulas.

  • Class size = upper − lower = 30 − 20 = 10.
  • Mid-point = (20 + 30) ÷ 2 = 25.
2
Worked Example
Example 2: In the exclusive class 10–20, where does a value of exactly 10 go?
Solution

The upper limit is excluded, the lower included.

  • In the exclusive method, 10 belongs to the class 10–20 (not to 0–10).
3
Worked Example
Example 3: Which method (inclusive or exclusive) is better suited to continuous data, and why?
Solution

Continuous data has no gaps.

  • The exclusive method, because it leaves no gap between classes.
  • This matches the continuous nature of the data.

Key Points

    • Frequency = how many times a value occurs; frequency distribution = values with their frequencies.
    • Class interval (e.g. 10–20): class size = upper − lower; mid-point = (lower + upper) ÷ 2.
    • Inclusive (both limits included; gaps; suits discrete) vs exclusive (upper limit excluded; no gaps; suits continuous).
✎ Quick Check — 2 questions0 / 2
Q1.The mid-point (class mark) of the class 40–50 is:
Explanation: Mid-point = (40 + 50) ÷ 2 = 45.
Q2.In which method is the upper limit of a class excluded from it?
Explanation: The exclusive method excludes the upper limit, leaving no gaps.

Constructing a Frequency Table

Let us put it all together and build a frequency table from raw data. Suppose 20 students scored the following marks (out of 50):

12, 23, 35, 41, 9, 18, 27, 33, 45, 7, 22, 38, 16, 29, 31, 44, 11, 26, 39, 48

To organise these into a frequency distribution using the exclusive method with a class size of 10, we (1) find the range (highest − lowest = 48 − 7 = 41), (2) decide the classes (0–10, 10–20, …, 40–50), and (3) put a tally mark for each value in its class, then count the tallies to get the frequency:

Marks (class)TallyFrequency
0–10||2
10–20||||4
20–30|||||5
30–40|||||5
40–50||||4
Total20

The total of all frequencies (2 + 4 + 5 + 5 + 4 = 20) must equal the number of observations — a quick check that nothing was missed. We have turned a messy list into a neat, readable table. The running total of frequencies down (or up) the table is called the cumulative frequency, which we use later for the median and ogive.

1
Worked Example
Example 1: Why must the sum of all frequencies equal the number of observations?
Solution

Every observation is counted once.

  • Each value is tallied into exactly one class.
  • So adding all frequencies must give back the total number of observations (here 20).
2
Worked Example
Example 2: In the table above, how many students scored less than 20 marks?
Solution

Add the first two classes.

  • 0–10 has 2 and 10–20 has 4.
  • 2 + 4 = 6.
3
Worked Example
Example 3: What is cumulative frequency?
Solution

It is a running total.

  • Cumulative frequency is the running total of frequencies down (or up) the table.

Key Points

    • Build a frequency table: find the range, choose classes, put tally marks, count to get the frequency.
    • Sum of frequencies must equal the number of observations (a check).
    • Cumulative frequency = running total of frequencies (used for median and ogive).
✎ Quick Check — 2 questions0 / 2
Q1.In a frequency table, the sum of all frequencies must equal the:
Explanation: Every observation is tallied once, so frequencies sum to the number of observations.
Q2.The running total of frequencies down a table is the:
Explanation: The running total of frequencies is the cumulative frequency.