Time |
Nickname |
Message |
00:03
🔗
|
kennethre |
tef: ping |
00:17
🔗
|
SketchCow |
Regarding the downloading of the new site you want to archive, kennethre: Remember archive.org wants WARC and the world wants WGET. |
00:18
🔗
|
kennethre |
SketchCow: I'm going to see what I can do. It may have to just be json dumps. The site's barely running and they have a heavy ajax interface. |
00:18
🔗
|
kennethre |
but we'll see. WARC would be ideal. |
00:20
🔗
|
SketchCow |
That's fine. |
00:20
🔗
|
SketchCow |
Do your best. |
01:20
🔗
|
SketchCow |
Over 300 albums added to my jamendo-albums collection today. |
01:20
🔗
|
SketchCow |
This is just really good stuff. |
01:46
🔗
|
balrog_ph |
Did MultiUpload just croak? |
01:48
🔗
|
nitro2k01 |
Yep |
01:48
🔗
|
arrith |
just wondering about that |
01:49
🔗
|
arrith |
multiupload has been unresponsive for a while, same with scroogle.org |
01:49
🔗
|
arrith |
been getting nxdomain for scroogle |
01:51
🔗
|
arrith |
not finding any officialish news about multiupload on the google, just people reporting issues not getting it to load |
02:47
🔗
|
tef |
kennethre: pong |
03:22
🔗
|
SketchCow |
http://www.flickr.com/photos/textfiles/sets/72157629411540695/with/6913872479/ |
03:51
🔗
|
nitro2k01 |
"Oh, the keyboard were bulky back then." |
10:55
🔗
|
DoubleJ |
So, question about wget: If it's unable to connect to a host (say, due to the Fios router being a piece of junk) does it keep trying that one URL over and over, or does it skip every file until the connection comes back? |
15:47
🔗
|
chronomex |
you usually get holes |
15:58
🔗
|
Schbirid |
hm, i think i broke something in my item with s3 |
15:58
🔗
|
Schbirid |
new uploads were not added to the filelist |
16:06
🔗
|
Schbirid |
maybe i sohuld have used that bucket size hint |
16:26
🔗
|
DoubleJ |
chronomex: So you're saying that any user I have with an unable-to-connect error needs to have that part of their account re-done. Crap. |
16:27
🔗
|
DoubleJ |
Though I guess that explains why sometimes losing the internet causes dld-client to bomb and sometimes not. If it does it during --mirror things can recover before wget finishes. |
16:28
🔗
|
DoubleJ |
Though if it results in incomplete downloads it should really be a fatal error. |
16:30
🔗
|
yipdw |
DoubleJ: wget does do retries, but only up to a point |
16:31
🔗
|
yipdw |
default number is 20 retries |
16:31
🔗
|
yipdw |
as far as I can tell, the mobileme-grab scripts don't change that behavior |
16:32
🔗
|
DoubleJ |
Hmm... I think the 5.5 hours I lost between midnight and waking up may account for more than 20 tries :) |
16:32
🔗
|
yipdw |
well |
16:32
🔗
|
yipdw |
hm |
16:33
🔗
|
yipdw |
yeah, probably |
16:33
🔗
|
yipdw |
I don't think wget uses exponential backoff |
16:33
🔗
|
yipdw |
yeah it doesn't |
16:33
🔗
|
DoubleJ |
Lemme try to find how many lines have the connect error. If it's 19 I'll be happy. |
16:34
🔗
|
yipdw |
that said, if there was such an error, that should cause dld-single to exit erroneously |
16:35
🔗
|
yipdw |
so you shouldn't have multiple users with holes, unless you're running dld-single in an unpredicated loop or something |
16:36
🔗
|
DoubleJ |
grep -i "unable " wget.log | wc -l |
16:36
🔗
|
DoubleJ |
307 |
16:36
🔗
|
DoubleJ |
D'oh. |
16:37
🔗
|
DoubleJ |
I'm running dld-client. |
16:37
🔗
|
DoubleJ |
Doesn't that just call dld-single, though? |
16:37
🔗
|
yipdw |
yes |
16:37
🔗
|
yipdw |
and it should exit on error |
16:37
🔗
|
yipdw |
well, once a user's download finishes |
16:38
🔗
|
DoubleJ |
Hm. So, either it didn't, or it's happy to try until the end of time. |
16:38
🔗
|
yipdw |
point is that the number of users with holes should normally have the maximum number of simultaneous dld processes as its upper bound |
16:38
🔗
|
DoubleJ |
Also: The actual error is "unable to resolve host", not "unable to connect" like I said before. |
16:38
🔗
|
yipdw |
(also, try seesaw) |
16:38
🔗
|
DoubleJ |
In case that changes anything. |
16:39
🔗
|
DoubleJ |
Well, yes. Except that this has happened multiple times since I started downloading MobileMe. I think my router isn't happy with being asked to do actual work. |
16:39
🔗
|
yipdw |
that should still generate an error exit |
16:39
🔗
|
yipdw |
yeah, mine isn't either |
16:39
🔗
|
yipdw |
I just use off-hours connection time at work |
16:39
🔗
|
yipdw |
they have slightly more robust networking equipment |
16:40
🔗
|
DoubleJ |
This would be less of a headache if the current user was able to write HTML and I hadn't been grabbing an endless series of 404s since Friday. |
16:40
🔗
|
DoubleJ |
I can't just shrug it off, in case somewhere in that pile was a link he actually managed to code correctly. |
16:40
🔗
|
yipdw |
? |
16:41
🔗
|
DoubleJ |
The user who set this question off. |
16:41
🔗
|
DoubleJ |
wget --mirror has been working since Friday morning. |
16:41
🔗
|
yipdw |
which one is that? |
16:41
🔗
|
DoubleJ |
web.me.com/snoozeman |
16:42
🔗
|
DoubleJ |
Not downloading the internet as far as I can tell. |
16:42
🔗
|
DoubleJ |
Just a very long parade of bad links within his own account. |
16:42
🔗
|
DoubleJ |
But like I said, somewhere in that 5.5 hours may have been a good link. |
16:42
🔗
|
DoubleJ |
And spending another 4 days on the guy is going to be annoying. |
16:49
🔗
|
yipdw |
ha, it's literally a family history |
16:54
🔗
|
DoubleJ |
Oh christ. |
16:54
🔗
|
DoubleJ |
If he goes back to the Vikings I'm screwed. |
16:54
🔗
|
yipdw |
more specifically, it's a large set of pages generated via Reunion |
16:55
🔗
|
yipdw |
I guess that user just didn't upload / deleted a bunch of them |
16:55
🔗
|
DoubleJ |
Now, the wget log just says "unable to resolve host web.me.com". Is there any way of finding out what URL it was trying? |
16:55
🔗
|
Schbirid |
i use ngrep for this kind of stuff |
16:55
🔗
|
DoubleJ |
I think he renamed a directory. My URL list has "Ray's Genealogy" but the 404s are just "Genealogy". |
16:56
🔗
|
DoubleJ |
Schbirid: 'splain? |
16:56
🔗
|
Schbirid |
if you mean realtime checking of http traffic etc |
16:56
🔗
|
DoubleJ |
Oh, not realtime. This happened while I was asleep. |
16:56
🔗
|
Schbirid |
ah no idea then :( |
16:56
🔗
|
SketchCow |
GREETINGS FROM THE PRESERVING VIRTUAL WORLDS 2 ADVISORY BOARD MEETING |
16:56
🔗
|
SketchCow |
<---in Stanford |
16:56
🔗
|
SketchCow |
Also awesome |
16:57
🔗
|
yipdw |
I went to Stanford once |
16:57
🔗
|
yipdw |
nice campus |
16:57
🔗
|
yipdw |
DoubleJ: you can try to correlate it with the URL list, but other than that I'm not sure |
16:57
🔗
|
yipdw |
probably better to just retry the download, or move on |
16:57
🔗
|
yipdw |
I'm downloading snoozeman now |
16:57
🔗
|
DoubleJ |
I think I'll just do that. |
16:58
🔗
|
DoubleJ |
And thanks. Hopefully he takes less time for you. And no router problems. |
16:59
🔗
|
DoubleJ |
I'll have to go through my other logs and see who else has that error. |
16:59
🔗
|
DoubleJ |
What's a good way to find all files that have a particular string in them? |
16:59
🔗
|
yipdw |
grep -R |
16:59
🔗
|
yipdw |
ack also works if you have that installed |
17:00
🔗
|
DoubleJ |
So in mobileme-grab/data I say grep -R "unable " and it spits out a file list? |
17:00
🔗
|
yipdw |
yes, but you probably will want to scope it to just logs |
17:00
🔗
|
DoubleJ |
All right, I'll give that a try after lunch. Thanks. |
17:01
🔗
|
yipdw |
actually |
17:02
🔗
|
yipdw |
if you just want a list, grep -lR |
17:02
🔗
|
yipdw |
there may be better tools, I'm not sure |
17:02
🔗
|
yipdw |
so |
17:02
🔗
|
yipdw |
i think the maximum number of 404s I can get on snoozeman is 17,982 |
17:02
🔗
|
yipdw |
awesome |
17:03
🔗
|
yipdw |
oh hey, it's this guy |
17:03
🔗
|
yipdw |
http://snoozemanscruiseblog.blogspot.com/ |
17:04
🔗
|
yipdw |
holy shit, he's done 69 cruises |
17:04
🔗
|
yipdw |
wtf |
17:05
🔗
|
yipdw |
considering that cruises generally run at least a couple grand per, that's kind of |
17:05
🔗
|
yipdw |
conspicuous |
17:07
🔗
|
Soojin |
maybe he hides in the cargo area |
17:11
🔗
|
SketchCow |
Randall Scwartz has done many cruises. |
17:11
🔗
|
SketchCow |
Mostly by being the guy co-running tour groups and the phoography. |
17:11
🔗
|
SketchCow |
He is on the Joco Cruise Crazy 2 as we speak. |
17:13
🔗
|
godane |
can you guys take like playstation demo discs? |
17:14
🔗
|
SketchCow |
I'll gladly take them. |
17:14
🔗
|
SketchCow |
I can suggest a better place. |
17:14
🔗
|
godane |
i know off sites that have them too |
17:14
🔗
|
* |
SketchCow is sitting here next to two guys who run game archives. |
17:14
🔗
|
SketchCow |
JP Dyson of the Strong Museum of Place and the CHEG |
17:15
🔗
|
SketchCow |
And Henry Lowood of the Stanford Software Archives |
17:15
🔗
|
SketchCow |
Yes, we're all in the same room so this room is awesome now |
17:15
🔗
|
godane |
http://www.emuparadise.me/Sony_Playstation_-_Demos_ISOs/25 |
17:15
🔗
|
godane |
the one rom size as 100s of demos |
17:16
🔗
|
godane |
i also found a torrent of 15 years worth of scientific american |
17:17
🔗
|
SketchCow |
Yeah, I snagged that |
17:17
🔗
|
SketchCow |
The public domain ones are now up on archive.org. |
17:17
🔗
|
SketchCow |
http://www.archive.org/details/scientific-american-1845-1909 |
17:18
🔗
|
SketchCow |
Oh, to explain cruiseman as another thing. |
17:18
🔗
|
godane |
i think you are pushing me to buy blu-ray |
17:18
🔗
|
godane |
only for archive reasons |
17:19
🔗
|
SketchCow |
If you can organize cruise groups (and plane groups too) and you hit the mark of, like, 10 or 15 people I forget, you get along for free. |
17:19
🔗
|
yipdw |
oh, hm |
17:19
🔗
|
SketchCow |
So assume snoozeman organizes cruises |
17:20
🔗
|
godane |
SketchCow: can you move this to software: http://www.archive.org/details/cdrom-linux-format-73 |
17:23
🔗
|
SketchCow |
You did it! |
17:24
🔗
|
godane |
thank you |
17:25
🔗
|
DFJustin |
somebody else pointed out that there is other stuff sitting in community texts http://www.archive.org/search.php?query=subject%3A%22cdbbsarchive%22 |
17:26
🔗
|
SketchCow |
I wish you had made the two CDs separate, but you didn't know and I need to think about it regardless. |
17:26
🔗
|
SketchCow |
http://www.archive.org/details/cdrom-linux-format-73&reCache=1 |
17:32
🔗
|
godane |
sorry about that |
17:35
🔗
|
godane |
i found 3 years of pc advisor |
17:36
🔗
|
godane |
there was cd that came in a magazine in 2005 |
17:36
🔗
|
balrog |
SketchCow: this might be worth archiving: http://adfly.simplaza.net/ |
17:36
🔗
|
balrog |
(as part of the link shortener project) |
17:36
🔗
|
balrog |
the cache contains most of adf.ly and similar links |
17:40
🔗
|
SketchCow |
Thanks for the tip, DFJustin - I just yanked all those into the CD archive. |
17:40
🔗
|
SketchCow |
They're all italian CDs! |
17:41
🔗
|
SketchCow |
godane: Any chance of scanning the fronts of the DVDs/CDs? |
17:44
🔗
|
godane |
maybe |
17:58
🔗
|
ersi |
Hm, Waybackmachine is a seperate front end that doesn't hang together with Heritrix -right? |
17:59
🔗
|
ersi |
Or, I mean - Heritrix seems to "only" be the crawler |
18:00
🔗
|
SketchCow |
Before you stamped off, ersi - the umich project turned out to be something alard could handle individually. |
18:01
🔗
|
ersi |
Yeah, I figured :-) |
18:03
🔗
|
ersi |
there we go, found the open-source wayback project |
18:03
🔗
|
soultcer |
As far as I understood it the open source project is different from the wayback machine they use at archive.org? |
18:04
🔗
|
soultcer |
Or did that change with the new wayback interface? |
18:04
🔗
|
SketchCow |
Supposed to be. |
18:04
🔗
|
SketchCow |
They've started to use our wget-warc now. :) |
18:04
🔗
|
SketchCow |
In some aspects. |
18:04
🔗
|
ersi |
According to http://archive-access.sourceforge.net/projects/wayback/ the current implementation is Perl, this one is Java |
18:05
🔗
|
soultcer |
I want to run a crawl of websites from my home country using Heritrix, just to see what I will find |
18:06
🔗
|
ersi |
And how would you crawl a country? |
18:06
🔗
|
soultcer |
Limit it to the country specific top level domain and domains that resolv to IPs that some geoip db thinks are from that country |
18:06
🔗
|
ersi |
alright |
18:08
🔗
|
yipdw |
soultcer: one tricky bit: the mapping from IP to domain is one-to-many |
18:08
🔗
|
ersi |
I'd say go for the ccTLD is probably the most viable if that's what you're choosing from |
18:08
🔗
|
yipdw |
actually it's many-to-many but the other direction isn't so important |
18:08
🔗
|
soultcer |
yipdw: Don't worry, I've got that part already covered. |
18:09
🔗
|
yipdw |
ok |
18:14
🔗
|
Schbirid |
is "zgrep -h something manyfiles*" not working for anyone else too? it should not list the filenames but for me it does anyways. |
19:02
🔗
|
ersi |
Hmmm, I'm considering building a spider/crawler |
19:03
🔗
|
tef |
ersi: what in |
19:04
🔗
|
ersi |
asking in what language? |
19:04
🔗
|
tef |
yes |
19:05
🔗
|
ersi |
I was thinking Python |
19:05
🔗
|
tef |
ah |
19:05
🔗
|
tef |
https://github.com/tef/crawler this might help |
19:05
🔗
|
ersi |
What I really want is to map a domain/subdomain |
19:05
🔗
|
tef |
although the warcs are terrible |
19:05
🔗
|
ersi |
checking it |
19:06
🔗
|
ersi |
ooh, there's a HTMLParser? I have only fiddled a little with SGLParse |
19:08
🔗
|
ersi |
actually, yeah I'll check this out ^_^ |
19:17
🔗
|
ersi |
tef: Hm, I pip installed hanzo-warc-tools - but the crawler can still not find it - suggestions? 'requests' went fine to pip install :o |
19:17
🔗
|
soultcer |
ersi: Map a domain/subdomain? You mean crawl the whole page and get a warc as output? |
19:17
🔗
|
tef |
ersi python -m hanzo.warctools should work |
19:18
🔗
|
ersi |
No, I'd just like to map all the URLs |
19:18
🔗
|
soultcer |
? |
19:18
🔗
|
ersi |
"Find ALL THE URLS!" |
19:19
🔗
|
soultcer |
You must have a lot of bandwidth and storage space ;-) |
19:19
🔗
|
ersi |
I would only save URLs corresponding to my target domain |
19:19
🔗
|
ersi |
note, Resource Locations. Not the *resources* |
19:20
🔗
|
soultcer |
When we started on urlteam stuff we thought "me, maybe a couple hundred megabytes of shorturls" |
19:20
🔗
|
ersi |
tef: Hmmm, I'm not sure what I'm doing wrong here |
19:20
🔗
|
soultcer |
bit.ly alone has probably over a TB of urls... |
19:20
🔗
|
tef |
ersi: I don't know pip that well :/ |
19:21
🔗
|
ersi |
tef: How'd you install hanzo-warc-tools? If you were to do it |
19:21
🔗
|
tef |
oh I use the hg repo and set PYTHONPATH, but I am like that |
19:21
🔗
|
ersi |
oh, heh |
19:21
🔗
|
tef |
you could just hg clone from code.hanzoarchives.com |
19:22
🔗
|
tef |
but I tend to push to the repo so |
19:25
🔗
|
ersi |
hmmmm, I added my hanzo clone to PYTHONPATH but it's not taking it up for some reason |
19:27
🔗
|
ersi |
augh, python packaging :( |
19:29
🔗
|
ersi |
I'll go poke around in crawler.py instead :] |
19:31
🔗
|
soultcer |
ersi: I still don't get the urls idea? what use would it be to only know the urls? |
19:33
🔗
|
ersi |
well, you could for example break down the downloading of the target in smaller chunks and distribute the owrk |
19:34
🔗
|
soultcer |
Hm, I guess |
19:34
🔗
|
tef |
ersi: as in export PYTHONPATH='..' |
19:35
🔗
|
tef |
ersi: maybe you're in a virtualenv |
19:35
🔗
|
ersi |
Yeah, but I did it with a absolute location |
19:36
🔗
|
ersi |
export PYTHONPATH=/home/ersi/libpython/ where hanzo/warc*/ is located |
19:36
🔗
|
tef |
ah you need the repo dir in |
19:36
🔗
|
tef |
not the dir contaiing the repo |
19:36
🔗
|
tef |
inside the repo is the package |
19:37
🔗
|
ersi |
Oh, heh |
20:02
🔗
|
ersi |
tef: Have you ran the crawler against any sites? |
20:03
🔗
|
tef |
a few but nothing heavy - the warc output is ugly and needs to be cleaned up |
20:03
🔗
|
tef |
the htmlparser should pick out a few more types of links at least |
20:05
🔗
|
ersi |
oh, found the issue I was getting. you're using the variable 'url' in the HTMLException on line 100 instead of response.url :) |
20:05
🔗
|
ersi |
tef: ^ |
20:08
🔗
|
tef |
whoops |
20:08
🔗
|
tef |
I may not have pushed everything |
20:08
🔗
|
tef |
welp |
20:09
🔗
|
ersi |
Hm, wonder what I'm throwing at it that's making it not able to extract the links :o |
20:13
🔗
|
tef |
hmmm |
20:13
🔗
|
ersi |
haha, my friends stupid webserver closes the connection prematurely |
20:14
🔗
|
ersi |
might be why |
20:14
🔗
|
soultcer |
Welcome to webcrawler 101 |
20:14
🔗
|
soultcer |
Lection 1: Web crawling is like running the most complete fuzz test ever against your program |
20:14
🔗
|
ersi |
How suitable |
20:15
🔗
|
ersi |
I'm a software tester at day |
20:15
🔗
|
tef |
soultcer: yes yes yes |
20:16
🔗
|
tef |
constantly changing fuzzer |
20:16
🔗
|
ersi |
quirky quicky quirky~ |
20:19
🔗
|
tef |
also if any of you are REST types i'd appreciate any comments https://github.com/tef/hate-rpc |
20:19
🔗
|
tef |
but I am going to bed because i have been up for 28 hours. |
20:20
🔗
|
ersi |
websites for robots? woot |
20:28
🔗
|
kennethre |
tef: https://github.com/kennethreitz/convore.json |
20:33
🔗
|
tef |
I saw. Nice :D |
20:38
🔗
|
kennethre |
tef: i'm working on making it browsable html |
21:41
🔗
|
ersi |
argh, I feel so silly using Requests |
21:42
🔗
|
ersi |
or I'm getting used rather :D |
21:43
🔗
|
ersi |
Oh, I'm so silly. :| Don't do requestR.text() instead of requestR.test >_> |
21:51
🔗
|
kennethre |
ersi: :( |
21:51
🔗
|
kennethre |
ersi: ah, gotcha. What do you think so far? |
21:52
🔗
|
ersi |
It's too easy! |
21:52
🔗
|
kennethre |
<3 |
21:52
🔗
|
ersi |
It feels awesome, feels pythonic |
21:53
🔗
|
kennethre |
perfect |