Basic PHP Form Fill Tutorial
For this tutorial I'm going to assume ( yeah I know trouble ) that you have read the basic scraping tutorial and have an understanding of the basic of PHP and Curl.
To start the tutorial I'm going to go into some useful tools and practices that are going to make posting to forms much easier so you don't have to go through the similar pitfalls and headaches I did when I first started. This is going to require your to download a couple of tools that are a must if you don't already have them.
- Webmaster Toolbar - I use this 200 times a day it's a must
- Live HTTP Header - This is the best tool I've found to track headers
Using the Webmaster toolbar
Once you have this toolbar installed all we're going to concern ourselves with for this tutorial is the first set of tools "disable". Which are going to allow us to disable all things that might confuse us as to what is "actually" happening when we make the form post in PHP using Curl.
Anytime you want to follow what is going on it is very important to disable:
- Meta Redirects
Because all these things can make things happen in the background that you don't know are going on and you might have to recreate manually in Curl. There are a few web sites out there that require you to pass a certain referrer in the headers, you'll know if this is required or not if you turn it off and it still works.
Using Live HTTP Headers
At Live Header might seems a tad bit confusing. What is this? What is all this stuff? Well if you aren't used to looking at headers all the time that's the information that the web browser sends to the server to get the page. It also gives you the headers returned from the server right below that. Lets take a look at some simple headers. I'm only going to cover the basic ones we need for this example. For this example we're going to use the Wordpress.com. To make your life easier I suggest turning off anything that accesses the internet like google pr check, adsense toolbar, or anything like that, as you'll see a bunch of strange headers if you do.
So lets go to http://wordpress.com/signup/
Now open your Live HTTP Headers and there should be a blank box that says "HTTP Headers" at the top. Now go back to the Wordpress.com signup page and fill in your details for a new account.Then hit Next>>. Now back on Live HTTP Headers you should see this.
For some real in depth info in plain english on understanding headers take a visit to:
POST /signup/?abcde=1 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:18.104.22.168) Gecko/20070725 Firefox/22.214.171.124
Headers Received from the server
HTTP/1.x 200 OK
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Set-Cookie: wp_test=WP+Cookie+check; path=/; domain=.wordpress.com
Content-Type: text/html; charset=utf-8
Date: Tue, 14 Aug 2007 07:21:49 GMT
For this exercise we only care about a couple of headers sent to server and one response from the server.
The URL of course, this is much easier then looking at the form and getting it from there as it's all in one place and you know there's no tricks.
And of course the POST string. This is the data/form fields that were sent to the server.
- HTTP/1.x 200 OK
This is the response code we received from the server we sent the data to. 200 is good, that means everything went ok. Some other codes you might be familiar with are:
- 302 - Redirect
- 301 - Redirect
- 404 - Page Not Found
- 5XX - An Error in the 500's means there's a server error
So if we see a 302 or 301 that tells us something. That tells us that the server is forwarding us to somewhere else after we make the post. Now there's a couple of ways to handle this the most elegant is going to be to use the curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 10); What this will do is follow the redirects as they come.The 10 means follow 10 redirects max. The other way would be to manually go to each redirect using curl, but we prefer not to do this as it's a lot of work and FOLLOWLOCATION works Most of the time..
Finally the actual scripting
Whole script -
The script for our super simple account creation on Wordpress.com is going to look like this.
- $url = "http://wordpress.com/signup/?abcde=1";
- $post = "stage=validate-user-signup&u=&user_name=spamsdr12&pass1=123qwe&pass2=123qwe&user_email=sad345lkj%40yahoo.com&tos=1&signup_for=user&Submit=Next+%C2%BB";
- $ch = curl_init($url);
- curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- echo curl_exec($ch);
Script Explanation -
Here we go Line 1 and 9 are of course the delimiters that are telling the PHP engine, "Hey process this as PHP"
$url = "http://wordpress.com/signup/?abcde=1";
Here we're setting the variable $url to our location to post to we found using Live HTTP Headers above.
$post = "stage=validate-user-signup&u=&user_name=spamsdr12&pass1=123qwe&pass2=123qwe&user_email=sad345lkj%40yahoo.com&tos=1&signup_for=user&Submit=Next+%C2%BB"; This is going to be one of the two most intersting lines in this script.What this is the actual post string that Curl is going to take and send to the server with your data. You'll see the fields there that you filled out in the form such as user_name, pass1 and pass2. So in your script if you wanted to make more then one account you might swap those out with variables like $user_name, $password1 and $password2. Then as you go through your loop just get new usernames, passwords or whatever you chose. You'd notice that the email symbol @ looks funny. This is called urlencoding. Postfields in curl doesn't this automatically. However if you want to be diligent you could always do a $email = urlencode($email); to manually encode it yourself.
$ch = curl_init($url);
Initializing the curl object with the url we set and then assigning that to $ch. For more on this please see the previous php and curl scraping tutorials
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
This is the other most important new part of this tutorial. This is what does all the actual posting work. Pretty simple huh, just that one little line and the list of field=var&field=var's. You are going to want to use this anytime you need to fill a form.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
Rather then having the page that your getting go into the standard buffer this will make it so that you can assign the curl_exec() functions output to a variable.
As a shortcut here I decided to just print the page that is gotten to the screen rather then assign it to a variable. I know there's a string (web page) coming back from the server so rather then assigning to variable and then echoing that variable I did it in one shot.
The last line of code just closes that curl object so it release the memory back into the system resources to be used again. This isn't required as it will be destroyed and returned on finish of the script, however it's good practice.
Trying things out -
Normally I make a link to see the result but as I don't want people using my server to make accounts on wordpress.com I'm not putting it up this time.
Other things to try -
Now, lets try a couple of things to make sure you have it down.
- Make a CSV file with one accounts worth of data per line and then run through it and make the accounts for that file. Hint-you could use the same password so that wouldn't need to be in there.
- Do something else cool.
And there you have it we've made the most basic scraping script there is. But now you have the idea of how we get data from the internet to work on in our basic PHP program.
In the next tutorial I'll show you how to take that data and do some basic processing on it.
Next: Use PHP and Curl with Cookies