#urlteam 2015-11-09,Mon

↑back Search

Time Nickname Message
00:02 🔗 JesseW Now added to table
00:18 🔗 JesseW There are 4 more claims available from burl.se since we last scraped them. I've added them to the queue, I'll enable the project in a few minutes, unless someone objects ( arkiver, chfoo ...)
00:20 🔗 phuzion has quit IRC (Read error: Operation timed out)
00:27 🔗 JesseW hearing now objections, I've sent out the 4 burl.se claims...
00:28 🔗 JesseW and they're done
00:30 🔗 JesseW producing 193 artisanally scraped short URLs... :-)
01:23 🔗 * JesseW is now going through the "new table" making it more useful (at least to me)
02:11 🔗 phuzion has joined #urlteam
02:16 🔗 chfoo sounds good
02:18 🔗 chfoo i might have missed a few questions but i should be able to answer them now
02:21 🔗 aaaaaaaaa sets mode: +o chfoo
02:31 🔗 JesseW chfoo: ok. well, I've started a new project: migre-me which seems to be going well (over 900,000 links checked)
02:31 🔗 JesseW and I ran a tiny additional batch for burl-se
02:32 🔗 JesseW chfoo: are there any particular things to keep an eye on when running new projects?
02:33 🔗 chfoo JesseW: just make sure that it doesn't get banned or the site goes down
02:34 🔗 JesseW chfoo: what are the signs of it getting banned?
02:34 🔗 chfoo JesseW: the error reports should fill up with the http status code different or something similar
02:35 🔗 JesseW ok, cool; I've been watching http://tracker.archiveteam.org:1337/admin/error_reports
02:35 🔗 JesseW what is an "orphaned error report"?
02:36 🔗 JesseW migre-me does seem to have a few shorturls which return 301, but don't redirect anywhere
02:41 🔗 aaaaaaaaa sets mode: +o JesseW
02:44 🔗 JesseW chfoo: another question -- suggestions about identifying whether a project is worth re-scanning? For incremental ones that allow public creation, it's easy -- just create one, and see how far it is from the last one we scraped; but what about for other ones?
02:47 🔗 chfoo JesseW: orphaned error reports are error reports that no longer have the job related because the job was completed or deleted.
02:47 🔗 chfoo i haven't thought much about rescanning
02:50 🔗 JesseW ok, makes sense
02:52 🔗 JesseW next question: how do you estimate the number of shorturls for a non-incremental one? alphabet ** length / something? or...
03:07 🔗 JesseW yoolink.to appears to be incremental (and currently at 64666)
03:09 🔗 JesseW we scraped it (in 4 dumps) but don't seem to have done a very complete job. I'm tempted to just do a full incremental re-scrape, unless someone remembers what happened...
03:16 🔗 JesseW actually, we did a quite complete job *up to 3 characters*
03:24 🔗 JesseW I'm going to enable it (with autoqueue) starting (just before) 4 characters, now.
03:26 🔗 JesseW and enabled
03:29 🔗 bwn__ has quit IRC (Ping timeout: 252 seconds)
03:32 🔗 JesseW so, there there appear to be some permanent errors in some migre.me items (specifically, 17557 & 693427), such that the claims containing them keep failing and getting re-assigned. What do I do about this?
03:32 🔗 JesseW chfoo: ping
03:38 🔗 JesseW and yoolink-to is done again
03:38 🔗 JesseW (slightly overdone, actually :-/ )
04:02 🔗 chazchaz has quit IRC (ny.us.hub irc.umich.edu)
04:02 🔗 Domin- has quit IRC (ny.us.hub irc.umich.edu)
04:06 🔗 svchfoo1 has quit IRC (Read error: Operation timed out)
04:09 🔗 atlogbot has quit IRC (Read error: Operation timed out)
04:15 🔗 chazchaz has joined #urlteam
04:15 🔗 Domin- has joined #urlteam
04:16 🔗 atlogbot has joined #urlteam
04:18 🔗 svchfoo1 has joined #urlteam
04:19 🔗 svchfoo3 sets mode: +o svchfoo1
04:27 🔗 chazchaz has quit IRC (ny.us.hub irc.umich.edu)
04:27 🔗 Domin- has quit IRC (ny.us.hub irc.umich.edu)
05:04 🔗 aaaaaaaaa has quit IRC (Leaving)
05:18 🔗 chazchaz has joined #urlteam
05:23 🔗 Domin_ has joined #urlteam
05:25 🔗 chfoo JesseW: you can stop the queue and add items that skip the broken ones and then delete the item that has the broken url
05:26 🔗 Smiley has joined #urlteam
05:27 🔗 SmileyG has quit IRC (Read error: Operation timed out)
05:28 🔗 Barry has quit IRC (Read error: Operation timed out)
05:28 🔗 Barry has joined #urlteam
05:44 🔗 JesseW chfoo: ok, cool, will do that, thanks
05:49 🔗 WinterFox has joined #urlteam
06:35 🔗 atlogbot has quit IRC (Ping timeout: 369 seconds)
06:38 🔗 Domin_ has quit IRC (Read error: Operation timed out)
06:39 🔗 Domin_ has joined #urlteam
06:39 🔗 svchfoo1 has quit IRC (Ping timeout: 369 seconds)
06:40 🔗 svchfoo1 has joined #urlteam
06:40 🔗 atlogbot has joined #urlteam
06:41 🔗 svchfoo3 sets mode: +o svchfoo1
06:41 🔗 Atluxity has quit IRC (hub.se irc.underworld.no)
06:41 🔗 bwn has joined #urlteam
06:51 🔗 Smiley has quit IRC (Read error: Operation timed out)
06:58 🔗 JesseW doing another run of vbly.us -- from 2jp6 up to 2mwv
07:13 🔗 JesseW and done
07:13 🔗 JesseW about 4,000 new shorturls saved.
07:22 🔗 WinterFox Im getting a result for every url on migre.me :P
07:23 🔗 JesseW WinterFox: yep, it's incremental, so you should be. :-)
07:23 🔗 WinterFox Oh thats pretty neat
07:23 🔗 JesseW WinterFox: check out http://archiveteam.org/index.php?title=URLTeam#Warrior_projects --hopefully what I wrote there makes sense
07:24 🔗 WinterFox Makes sense
07:24 🔗 JesseW migre.me should give us nearly all results up to somewhere around #420,000,000, which is around rZIfF
07:25 🔗 JesseW we're currently only at about 2.4 million
07:25 🔗 JesseW not 420 million. :-)
07:25 🔗 WinterFox Do you guys plan on releasing a new torrent with all the urls?
07:25 🔗 JesseW we release new urls daily, as items on the Internet Archive. They are downloadable as torrents from there.
07:26 🔗 WinterFox Is there any way to mass download them all from the internet archive?
07:26 🔗 JesseW e.g. https://archive.org/details/urlteam_2015-11-07-20-00-08 is the one from yesterday
07:27 🔗 JesseW the names follow a pretty clear pattern, so you can generate a list (via IA Advanced Search), then pipe that into a bittorrent client, and get them that way
07:27 🔗 JesseW (that's how I did it)
07:28 🔗 WinterFox Sounds easy enough
07:28 🔗 JesseW Some of the torrents are broken, because IA's internal tool for moving items around doesn't update the webseed URLs in the torrents when it moves stuff. I've reported the problem, but it hasn't gotten to the top of the priority list yet.
07:29 🔗 WinterFox I just pulled the warier docker container onto my server a few days ago. At 900,000 urls scanned so far
07:29 🔗 JesseW You can just download those few manually (although most of the time, you can also get them from my seed (although that is down ATM))
07:29 🔗 JesseW WinterFox: excellent! thank you for doing that.
07:29 🔗 Deewiant has joined #urlteam
07:30 🔗 WinterFox No problem. My server is running at 1% cpu anyway and it feels like a waste to not use it for something :P
07:30 🔗 WinterFox Going to run an IPFS node on it later
07:31 🔗 JesseW we probably should consider making a giant item combining all the individual ones from 2014 (although we only started making daily dumps in Nov 2014) and from 2015, once the year is over.
07:32 🔗 JesseW WinterFox: how much storage does it have attached. You could consider participating in http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK
07:32 🔗 JesseW er, s/attached./attached?/
07:32 🔗 WinterFox Right now its got 1TB free space but Im getting a few 5TB drives soon
07:33 🔗 JesseW nice; if you feel like donating one of them to the cause, IA.BAK could certainly use it. :-)
07:33 🔗 WinterFox Could you give a bit of a tl;dr on what that is?
07:34 🔗 JesseW sure, although we should probably move over to #internetarchive.bak for a longer discussion
07:34 🔗 WinterFox Joined the channel
08:13 🔗 Fletcher has joined #urlteam
08:13 🔗 JesseW has quit IRC (Leaving.)
08:15 🔗 Smiley has joined #urlteam
08:41 🔗 bwn_ has joined #urlteam
08:45 🔗 bwn has quit IRC (Read error: Operation timed out)
09:25 🔗 bwn_ is now known as bwn
11:02 🔗 VADemon has joined #urlteam
12:19 🔗 WinterFox has quit IRC (Remote host closed the connection)
12:36 🔗 bwn has quit IRC (Read error: Operation timed out)
12:49 🔗 chazchaz has quit IRC (Ping timeout: 186 seconds)
12:50 🔗 chazchaz has joined #urlteam
13:16 🔗 bwn has joined #urlteam
13:56 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:59 🔗 dashcloud has joined #urlteam
14:00 🔗 svchfoo1 sets mode: +o dashcloud
14:42 🔗 VADemon has quit IRC (left4dead)
17:00 🔗 JesseW has joined #urlteam
17:01 🔗 svchfoo1 sets mode: +o JesseW
17:09 🔗 JesseW A bunch of bad items in migre.me; I'll fix them once I get a chance. Until then, sorry about the borked items...
17:22 🔗 JesseW has quit IRC (Leaving.)
17:43 🔗 Domin_ has quit IRC (hub.efnet.us irc.umich.edu)
18:17 🔗 Start has quit IRC (Read error: Connection reset by peer)
18:18 🔗 Start has joined #urlteam
19:56 🔗 aaaaaaaaa has joined #urlteam
19:56 🔗 swebb sets mode: +o aaaaaaaaa
20:15 🔗 bwn has quit IRC (Read error: Operation timed out)
20:26 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
20:45 🔗 bwn has joined #urlteam
20:45 🔗 SmileyG has joined #urlteam
20:52 🔗 Smiley has quit IRC (hub.se efnet.port80.se)
20:52 🔗 Fletcher has quit IRC (hub.se efnet.port80.se)
21:01 🔗 Fletcher_ has joined #urlteam
21:17 🔗 slang has quit IRC (Ping timeout: 240 seconds)
21:32 🔗 Fletcher_ is now known as fletcher
21:32 🔗 fletcher is now known as Fletcher
22:04 🔗 slang has joined #urlteam
22:36 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
22:45 🔗 svchfoo1 has quit IRC (Read error: Operation timed out)
22:47 🔗 chazchaz has quit IRC (Read error: Operation timed out)
22:48 🔗 atlogbot has quit IRC (Read error: Operation timed out)
22:49 🔗 atlogbot has joined #urlteam
22:51 🔗 svchfoo1 has joined #urlteam
22:51 🔗 svchfoo3 sets mode: +o svchfoo1
22:54 🔗 chazchaz has joined #urlteam
23:11 🔗 aaaaaaaaa has joined #urlteam
23:11 🔗 swebb sets mode: +o aaaaaaaaa
23:18 🔗 bwn has quit IRC (Read error: Operation timed out)
23:28 🔗 aaaaaaaa_ has joined #urlteam
23:28 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
23:28 🔗 swebb sets mode: +o aaaaaaaa_
23:29 🔗 aaaaaaaa_ is now known as aaaaaaaaa

irclogger-viewer