View Full Version : [JAVA] Site Search Help
SKG_Scintill
05-1-2012, 03:59 AM
I was checking if I could make a program that searches for a certain word on a site. It that word was on it, it would do a system.out.println(); if it wasn't, it would simulate pressing f5 and loop the class.
Here's my code:
N.B: There's two classes, so you can't just copy paste.
//Tabswitch class is so it doesn't loop alt+tab later on
import java.awt.*;
import java.awt.event.*;
public class Tabswitch {
public static void main(String[] args) throws Exception{
Robot r = new Robot();
r.keyPress(KeyEvent.VK_ALT);
r.keyPress(KeyEvent.VK_TAB);
r.delay(100);
r.keyRelease(KeyEvent.VK_ALT);
r.keyRelease(KeyEvent.VK_TAB);
URLReader a = new URLReader();
}
}
//URLReader class is where my question lies
import java.io.*;
import java.net.*;
import java.awt.*;
import java.awt.event.*;
public class URLReader{
public URLReader() throws Exception{
URL site = new URL("http://kaction.com/badfanfiction/");
BufferedReader in = new BufferedReader(
new InputStreamReader(site.openStream()));
String inputLine;
while((inputLine = in.readLine()) != null)
if(inputLine.contains("the"))
System.out.println(inputLine);
else{
Robot s = new Robot();
s.delay(2000);
s.keyPress(KeyEvent.VK_F5);
s.keyRelease(KeyEvent.VK_F5);
s.delay(2000);
URLReader a = new URLReader();
}
}
}
My issue lies in the last "if"-statement. It doesn't print inputLine, it always goes to the "else"-statement.
I was trying to make a .containsnot() of some sorts so it could be a "while"-loop, but to no avail...
Any ideas how to fix this?
P.S: I used badfanfiction so my printline wouldn't be massive.
UserNameGoesHere
05-1-2012, 04:55 AM
The problem is you have an infinite loop of sorts. If the very first line has input but doesn't contain "the" it immediately loads the page again. The recursively-spawned URLReader then does the same. Ad-infinitum.
Do you just want it to read the entire page, catch all lines which have "the", print out only those lines, and only AFTER that reload the page if there weren't any? Then rewrite your while like this
boolean found = false;
while((inputLine = in.readLine()) != null){
if(inputLine.contains("the")){
found = true;
System.out.println(inputLine);
}
}
if(!found){
Robot s = new Robot();
s.delay(2000);
s.keyPress(KeyEvent.VK_F5);
s.keyRelease(KeyEvent.VK_F5);
s.delay(2000);
URLReader a = new URLReader();
}
Is there a reason you're using a recursive solution, by the way? A nonrecursive one would be less resource-hungry. Would you prefer a nonrecursive one?
Also, other than serving up delays when relevant (and those are relevant), what is your Robot doing? I don't think it's doing what you may think it's doing. In particular your Robot is in no way connected with your URL stream and if it is (as I am guessing/assuming) meant for scripting Firefox or another browser in the background, its scripting is independent of everything else this program is trying to do and is unrelated.
To me, the Robot looks useless and I'd remove it unless you can integrate it better with what you're actually trying to do. If it's not needed for anything else, replace the Robot delay calls with Thread.sleep calls instead.
SKG_Scintill
05-1-2012, 05:15 AM
I think it's got to do with the fact that I've only had 3 months of programming class so far ;)
SKG_Scintill
05-1-2012, 05:23 AM
It works when the word is on the first opened page, but when I tried to look for "Final Fantasy" it didn't println() and continued looping.
The robot is to simulate button presses, I couldn't find a refresh action in the java.io.* or java.net.* so this was my way round.
The delays are mostly so I can follow its progress, when it works I make it quicker
UserNameGoesHere
05-1-2012, 05:30 AM
Leave the delays in there. The delays are useful so you don't hammer the website. (At least that's why I assumed you had them).
Well as-is the Robot is scripting some button presses but those have nothing whatsoever to do with the loading, reloading, or reading of that particular website. It really has nothing else to do with what you're trying to do (that I can see) and your program doesn't need it to automate either. I'd remove it. (replace the delay calls with Thread.sleep calls)
SKG_Scintill
05-1-2012, 05:33 AM
I don't know if you have looked at the site itself, but it randomly generates a couple of words. It's to keep refreshing and looking if it randomly generated a given word, such as "Final Fantasy".
I know it's really long-winded to do it this way, but it's what I can understand with my current knowledge.
UserNameGoesHere
05-1-2012, 05:40 AM
Well yes I assume you wanted to stop once one page had your chosen words on it. That's why it stops after one page has the words. Did you want it to continue refreshing/printing indefinitely even after it found a page that had the words on it???
SKG_Scintill
05-1-2012, 05:41 AM
No, but it's doing that right now, after your adjustments :P
If it's there on the first try, it prints. If it comes up on later tries, it continues looping. I don't want that xD
(Looking for things I may have overlooked in your code)
UserNameGoesHere
05-1-2012, 05:46 AM
Okay so state specifically, and clearly, in as precise of detail as you can, exactly what your program is supposed to do? That's the first step.
I can read the code and see what it actually does. I can try to infer or guess what you wanted it to do. But if my guess as to what you wanted is wrong, well my solution will implement my guess lol.
The clearer you explain it the better I can help you. :)
SKG_Scintill
05-1-2012, 05:54 AM
The program is supposed to do this:
1. Switch from my programming software Eclipse to the site itself by simulating alt+tab.
2. Read the text on the given site.
3. Determine if a given word occurs in the text. (or given words occur)
4. If it occurs in the text, it's supposed to stop searching so I can read it. (The println() isn't necessary). The program ends here if this happens.
5. If it doesn't occur in the text, it's supposed to refresh the page. (Which I do by simulating the f5-button)
6. Go back to point 2.
UserNameGoesHere
05-1-2012, 05:57 AM
So you want it to stop the very FIRST line it finds? If it finds ANY lines with desired text, print only the first of such then terminate program?
SKG_Scintill
05-1-2012, 05:58 AM
yes
UserNameGoesHere
05-1-2012, 05:59 AM
put a break after the System.out.println and you get that functionality then.
SKG_Scintill
05-1-2012, 06:01 AM
Still continues searching after personally seeing "Final" come by O.o
public class URLReader{
public URLReader() throws Exception{
URL site = new URL("http://kaction.com/badfanfiction/");
BufferedReader in = new BufferedReader(
new InputStreamReader(site.openStream()));
String inputLine;
boolean found = false;
while((inputLine = in.readLine()) != null){
if(inputLine.contains("Final")){
found = true;
System.out.println(inputLine);
break;
}
}
if(!found){
Robot s = new Robot();
s.delay(2000);
s.keyPress(KeyEvent.VK_F5);
s.keyRelease(KeyEvent.VK_F5);
s.delay(2000);
URLReader a = new URLReader();
}
}
}
UserNameGoesHere
05-1-2012, 06:04 AM
That's because what you're seeing is what your Robot is doing (refreshing page in your browser) but what your program is searching is independent of that. That's what I was trying to get at earlier. ;-)
So it's really hitting up the website twice on each round. Once, to search it for what you want. Then a second time when your browser refreshes the page. The page you see and the page it searches are not the same or in any way connected.
SKG_Scintill
05-1-2012, 06:14 AM
So I get the robot being a motorized stubborn entity, what should I replace it with?
I replaced the delay calls with Thread.sleep(), but what about the keypresses?
UserNameGoesHere
05-1-2012, 06:19 AM
It depends what you want your program to do.
Do you just want it to find this information and inform you of it? Then you do nothing with the keypresses. It already is working in the background -- you just don't get visual confirmation. If you want you could save the .html file for later perusal or save data to a text file of some sort.
Do you want it to integrate with Firefox, automating Firefox itself? Unless Firefox has a command-line way to tell it "Open this within a new tab but within the currently-open window and not a new window" it's going to be hard.
Edit -- figured out how to tell Firefox exactly that from command line. firefox -new-tab http://blahblah
so this should be doable. Hold your horses for a bit then. ;)
SKG_Scintill
05-1-2012, 06:24 AM
I want the visual confirmation tbh :P
Just want it to refresh until it finds the word, stay on that page and refresh no longer.
UserNameGoesHere
05-1-2012, 07:07 AM
import java.io.*;
import java.net.*;
import java.awt.*;
import java.awt.event.*;
public class URLReader{
public static void main(String[] args) throws Exception{
URLReader a = new URLReader();
}
public URLReader() throws Exception{
URL site = new URL("http://kaction.com/badfanfiction/");
BufferedReader in = new BufferedReader(
new InputStreamReader(site.openStream()));
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream("savedsite.html")));
String inputLine;
boolean found = false;
while((inputLine = in.readLine()) != null){
out.write(inputLine, 0, inputLine.length());
if(inputLine.contains("the")){
if(!found){
System.out.println(inputLine);
found = true;
}
}
}
out.close();
if(found){
Runtime.getRuntime().exec("firefox -new-tab savedsite.html");
}
else{
Thread.sleep(2000);
Thread.sleep(2000);
in.close();
URLReader b = new URLReader();
}
}
}
Does this do what you want?
SKG_Scintill
05-1-2012, 07:44 AM
Well... it doesn't do anything now
UserNameGoesHere
05-1-2012, 07:36 PM
It should work. It's just really really slow. In particular the URL.openStream() method seems RIDICULOUSLY slow. Let it run a while and see if it eventually finds anything.
Oh and the site it pulls up may have broken gifs, jpegs, etc because it's not pulling down additional resources, but you should be seeing something after a while at least.
If after letting it run for like 5 or 10 minutes it's still not finding anything, see if firefox -new-tab http://www.google.com pulls up Google in a new tab within your Firefox. Because if not, then maybe it's your version of Firefox causing the problem. If that doesn't work, try removing the -new-tab and just let it pop up its own window in Firefox.
As I've coded it, it works exactly as you wanted, on my machine, but RIDICULOUSLY SLOW!
Here's a somewhat nicened-up version
import java.io.*;
import java.net.*;
public class URLReader{
public static void main(String[] args) throws Exception{
new URLReader();
}
public URLReader() throws Exception{
URL site = new URL("http://kaction.com/badfanfiction/");
BufferedReader in = new BufferedReader(
new InputStreamReader(site.openStream()));
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream("savedsite.html")));
String inputLine;
boolean found = false;
while((inputLine = in.readLine()) != null){
out.write(inputLine, 0, inputLine.length());
if(inputLine.contains("the")){
if(!found){
System.out.println(inputLine);
found = true;
}
}
}
out.close();
if(found){
Runtime.getRuntime().exec("firefox -new-tab savedsite.html");
}
else{
Thread.sleep(2000);
Thread.sleep(2000);
in.close();
new URLReader();
}
}
}
In particular that should clear up a few compiler warnings.
================================================== ===============
As modified, here is exactly what the program does.
1: Open a stream to http://kaction.com/badfanfiction/
1b: This takes a really long time. This is where the holdup is.
2: Open an output file named savedsite.html
3: Writes the contents of the HTML stream to savedsite.html
4: If it finds the desired word, it prints the line to standard out
4b: Then it loads savedsite.html in a new tab in Firefox and program ends.
5: If it didn't find anything
5b: Waits around a little both to avoid hammering the website and give it a bit more time to "randomize".
5c: This is NOT where the holdup is. This is merely a courtesy.
6: Then it tries again back from the start.
Notes: It is not a bug that the input stream isn't explicitly closed on a successful find since the program stops at that point and the JVM will automatically close any still-open handles.
The Thread.sleep calls are not ... I repeat are not the cause of the massive holdup. In fact you could remove them if you want and still have the massive holdup. They are there for courtesy to the site itself.
The contents of savedsite.html are overwritten on each round, so once it does find something you'll only see the HTML for the instance where it found it.
Since only the HTML itself is raw-pulled, but not any other files linked within the HTML, any gifs, jpegs, etc will be broken, some JavaScript may be broken (depends how it's called/stored), and so forth.
If you wanted to save the entire page, INCLUDING all externally-linked resources, that would be a lot more code.
Additionally if you wanted to account for JavaScript-generated text, that would be a lot more code.
You will need to manually delete savedsite.html when you are finished (unless you wanted to keep it).
URL.openStream() is just SLOW -- so if you want to speed up the program, you'll need to replace that with some other means of opening/reading the website. This is nothing fixable unless you work for Oracle or something.
Hopefully this gives you some idea of where you want to start for improvements.
And yes it does work as-is. You just need to be patient because it's slow.
SKG_Scintill
05-3-2012, 06:00 AM
Upon trying, it did find the text, but also gave an error:
Exception in thread "main" java.io.IOException: Cannot run program "firefox": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)
at java.lang.Runtime.exec(Unknown Source)
at java.lang.Runtime.exec(Unknown Source)
at java.lang.Runtime.exec(Unknown Source)
at URLReader.<init>(URLReader.java:30)
at URLReader.main(URLReader.java:6)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
... 6 more
UserNameGoesHere
05-3-2012, 05:34 PM
Open up a command prompt. In the command prompt type
firefox -new-tab http://www.google.com
Does that work for you? The code assumes that works for you (It works for me).
Because it looks like your system can't find Firefox (which I assumed you were using as your browser.)
If you are using Firefox, you'll need to do one of two things then, if it doesn't work. You could find the full path to your firefox installation and replace the call using the full path. You could change your PATH environment variable to include Firefox's path prior to running the program.
Because it works perfectly fine as-is (although slow) on Ubuntu using GCJ and a default Firefox install. It should work perfectly fine as-is on Windows using Oracle's Java and a recentish Firefox install as well. It should also work perfectly fine as-is on a Mac OSX which has some version of Java and a Firefox install.
Assuming Windows, pretend you installed firefox to C:\Browsers\Firefox (probably not, but for example). Then you'd do something like (in a command prompt) PATH=%PATH%;C:\Browsers or wherever the firefox.exe binary is actually installed to. Then run the program from that same command prompt.
So the full sequence would look something like this
PATH=%PATH%;C:\wherevertheheckyouinstalledfirefox
javac URLReader.java
java URLReader
Wait a minute or two (since it's slow)
Results. :p
SKG_Scintill
05-3-2012, 07:31 PM
It's working now, somehow...
I have to say, I honestly appreciate all the effort you've put into solving this.
The thing is, I mostly wanted to see if I overlooked a small mistake. It has come to a point where I can no longer understand the code xD
This code has now sidetracked to a point where I'm not really learning from my mistakes, but rather following a crash course in programming this one thing.
I think I'll leave it for what it is and continue my programming class regularly until I understand it myself :)
UserNameGoesHere
05-3-2012, 08:26 PM
Ah, well at least you gave it a shot.
Once you have learned more and feel you can understand the code, there are some simple improvements that can be made to it (and some difficult ones).
Simple improvements left as an exercise for when you're more ready for it:
1: Have it get the desired word from the user instead of hardcode.
2: Use a regular expression on the word to allow it to accept variants (for example, both lower and upper case, first letter capitol, etc...)
3: Have it get the desired URL from the user instead of hardcode.
4: Error handling -- as-is of course there is no error handling but it's really better to include it.
Difficult improvements:
1: Have it pull down the entire website AND all linked resources. (gifs, jpegs, CSS files, etc...)
2: Have it properly parse JavaScript to find and pull additional resources that the JavaScript links, including recursively doing so for JavaScript linking JavaScript.
3: Have it properly parse JavaScript to see if the desired word is constructed on-the-fly via JavaScript.
4: Speed up the execution time.
And keep in mind that Java is just slow too. It's quite known for this. Java is not built for speed.
SKG_Scintill
05-7-2012, 07:47 AM
"You gave it a shot" wouldn't do, so I gave it another couple of shots.
I started with the code in the OP and expanded from that.
First thing I noticed is that the InputStreamReader read the entire code of the site. Every time the in.readLine() put something inside my inputLine it was one line further down the entire code.
So unless the word was in the first line of the site code (which is <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"), it would refresh the site, instead of looking through the rest of the code.
There are two classes again.
This is the code that works for me and does what I want it to do:
import java.awt.*;
import java.awt.event.*;
public class Tabswitch {
public static void main(String[] args) throws Exception{
Robot r = new Robot();
r.keyPress(KeyEvent.VK_ALT);
r.keyPress(KeyEvent.VK_TAB);
r.delay(100);
r.keyRelease(KeyEvent.VK_ALT);
r.keyRelease(KeyEvent.VK_TAB);
URLReader a = new URLReader();
}
}
--------------------------------------------------------------------------------------
import java.io.*;
import java.net.*;
import java.awt.*;
import java.awt.event.*;
public class URLReader{
public URLReader() throws Exception{
URL site = new URL("http://kaction.com/badfanfiction/");
BufferedReader in = new BufferedReader(
new InputStreamReader(site.openStream()));
String inputLine;
while((inputLine = in.readLine()) != null){
if(inputLine.contains("</b> and")){
if(inputLine.contains("Final")){
System.exit(0);
}
}
if(inputLine.contains("</b>.")){
if(inputLine.contains("Final")){
System.exit(0);
}
else{
Robot s = new Robot();
Thread.sleep(2000);
s.keyPress(KeyEvent.VK_F5);
s.keyRelease(KeyEvent.VK_F5);
Thread.sleep(2000);
URLReader a = new URLReader();
}
}
}
}
}
It's probably a very unorthodox way of programming, but it's a way I could understand it myself.
UserNameGoesHere
05-7-2012, 08:32 AM
Here is what your code does. Note, your code assumes you already have Firefox open AND that you already have the desired URL page loaded once. It also assumes Alt Tabbing will get focus to Firefox.
Switch focus over to Firefox.
Open and read the site, looking for specific things, exiting if they are found.
If they're not found, refresh the page in Firefox and try again.
Problem is, since the page is dynamically-generated, the page your program parses and the page Firefox loads may not be the same. There is no connection between your Robot code (which scripts the refreshing of Firefox) and your parsing code.
What you have is known as a race condition. Basically, if it works, it's only lucky that it worked. You are banking on the two accesses (once by your program, once by Firefox) to be close enough together such that the randomizer on the site will "randomize" identical results. If the site based randomization on access rather than time, you would almost never get correct results.
The code I presented has no such race conditions and accesses the site only once per round, feeding the results directly into a new tab in firefox.
dAnceguy117
05-7-2012, 10:17 AM
I just gotta say, very nicely done, UserName. that's some comprehensive stuff on solving this problem. the code shows off some nice methods I've never taken a look at, too. *standing ovation*
vBulletin® v3.8.1, Copyright ©2000-2013, Jelsoft Enterprises Ltd.