Data Science Asked by Qwerto on December 24, 2020
i have a big dataset with 2 classes. The 70% of records belong to the first class.
It if create a training and test set with the operator “split validation” with stratified sampling i get a test set with 70% records of first class .
What can i do in order to create a test set with 50% or records belonging to the first class ?
you have two options. You can extract the number of records belonging to the smaller class and use that number to sample that many examples from the larger class to get an exact 50/50 ratio. In RapidMiner you can achieve this using the "Filter Examples" operator combined with "Extract Macro" and "Sample".
The other option is using only the "Sample" Operator and check the balance data option. With "absolute sampling" you can define the exact number of examples used from each class. With ratio sampling you can also reduce the number of records from one class, but hitting an exact 50/50 ratio might be more difficult.
See the attached RapidMiner process xml (just c&p it into the process view) showing an example process.
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000"
expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="39" y="136">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="136">
<list key="function_descriptions">
<parameter key="label" value="if(label !="Iris-setosa", TRUE, FALSE)"/>
</list>
<description align="center" color="yellow" colored="true" width="126">Generate a binary class with a 1/3 to 2/3 ratio</description>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="136"/>
<operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="136">
<list key="filters_list">
<parameter key="filters_entry_key" value="label.equals.false"/>
</list>
<description align="center" color="yellow" colored="true" width="126">Split between the two classes</description>
</operator>
<operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="136">
<parameter key="macro" value="size_smaller_class"/>
<list key="additional_macros"/>
<description align="center" color="yellow" colored="true" width="126">1) Extract the number of examples from the smaller class</description>
</operator>
<operator activated="true" class="sample" compatibility="8.2.000" expanded="true" height="82" name="Sample" width="90" x="648" y="340">
<parameter key="sample_size" value="%{size_smaller_class}"/>
<list key="sample_size_per_class">
<parameter key="true" value="50"/>
<parameter key="false" value="50"/>
</list>
<list key="sample_ratio_per_class">
<parameter key="true" value="0.5"/>
</list>
<list key="sample_probability_per_class"/>
<description align="center" color="yellow" colored="true" width="126">2) Sample as many examples from the bigger class as there are in the other class.</description>
</operator>
<operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="103" name="Append" width="90" x="849" y="136">
<description align="center" color="yellow" colored="true" width="126">Merge the two classes back together.<br/><br/>Now both have the same number of examples</description>
</operator>
<operator activated="true" class="sample" compatibility="8.2.000" expanded="true" height="82" name="Sample (2)" width="90" x="447" y="544">
<parameter key="balance_data" value="true"/>
<list key="sample_size_per_class">
<parameter key="true" value="50"/>
<parameter key="false" value="50"/>
</list>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<description align="center" color="yellow" colored="true" width="126">By using &quot;absolute sampling&quot; with the &quot;balanced data&quot; option, you can get an equally balanced result set.<br></description>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Sample (2)" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Filter Examples" from_port="unmatched example set" to_op="Sample" to_port="example set input"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Append" to_port="example set 1"/>
<connect from_op="Sample" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<connect from_op="Sample (2)" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="105"/>
<portSpacing port="sink_result 2" spacing="378"/>
<portSpacing port="sink_result 3" spacing="189"/>
<description align="center" color="green" colored="true" height="118" resized="true" width="683" x="39" y="10"><br> First solution is extracting the number of examples from the smaller class and sampling that many from the larger class. This gives an exact 50/50 ratio</description>
<description align="center" color="green" colored="true" height="122" resized="true" width="335" x="39" y="488">Second solution is using the<br>&quot;balance data&quot;<br>parameter of the sample operator.<br>With that you can define the ratio, or exact number of examples to be samples from each class.<br></description>
</process>
</operator>
</process>
Also feel free to ask further, or re-post, questions in the RapidMiner community forum.
Answered by David on December 24, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP