Time |
Nickname |
Message |
00:02
🔗
|
JesseW |
Now added to table |
00:18
🔗
|
JesseW |
There are 4 more claims available from burl.se since we last scraped them. I've added them to the queue, I'll enable the project in a few minutes, unless someone objects ( arkiver, chfoo ...) |
00:20
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
00:27
🔗
|
JesseW |
hearing now objections, I've sent out the 4 burl.se claims... |
00:28
🔗
|
JesseW |
and they're done |
00:30
🔗
|
JesseW |
producing 193 artisanally scraped short URLs... :-) |
01:23
🔗
|
* |
JesseW is now going through the "new table" making it more useful (at least to me) |
02:11
🔗
|
|
phuzion has joined #urlteam |
02:16
🔗
|
chfoo |
sounds good |
02:18
🔗
|
chfoo |
i might have missed a few questions but i should be able to answer them now |
02:21
🔗
|
|
aaaaaaaaa sets mode: +o chfoo |
02:31
🔗
|
JesseW |
chfoo: ok. well, I've started a new project: migre-me which seems to be going well (over 900,000 links checked) |
02:31
🔗
|
JesseW |
and I ran a tiny additional batch for burl-se |
02:32
🔗
|
JesseW |
chfoo: are there any particular things to keep an eye on when running new projects? |
02:33
🔗
|
chfoo |
JesseW: just make sure that it doesn't get banned or the site goes down |
02:34
🔗
|
JesseW |
chfoo: what are the signs of it getting banned? |
02:34
🔗
|
chfoo |
JesseW: the error reports should fill up with the http status code different or something similar |
02:35
🔗
|
JesseW |
ok, cool; I've been watching http://tracker.archiveteam.org:1337/admin/error_reports |
02:35
🔗
|
JesseW |
what is an "orphaned error report"? |
02:36
🔗
|
JesseW |
migre-me does seem to have a few shorturls which return 301, but don't redirect anywhere |
02:41
🔗
|
|
aaaaaaaaa sets mode: +o JesseW |
02:44
🔗
|
JesseW |
chfoo: another question -- suggestions about identifying whether a project is worth re-scanning? For incremental ones that allow public creation, it's easy -- just create one, and see how far it is from the last one we scraped; but what about for other ones? |
02:47
🔗
|
chfoo |
JesseW: orphaned error reports are error reports that no longer have the job related because the job was completed or deleted. |
02:47
🔗
|
chfoo |
i haven't thought much about rescanning |
02:50
🔗
|
JesseW |
ok, makes sense |
02:52
🔗
|
JesseW |
next question: how do you estimate the number of shorturls for a non-incremental one? alphabet ** length / something? or... |
03:07
🔗
|
JesseW |
yoolink.to appears to be incremental (and currently at 64666) |
03:09
🔗
|
JesseW |
we scraped it (in 4 dumps) but don't seem to have done a very complete job. I'm tempted to just do a full incremental re-scrape, unless someone remembers what happened... |
03:16
🔗
|
JesseW |
actually, we did a quite complete job *up to 3 characters* |
03:24
🔗
|
JesseW |
I'm going to enable it (with autoqueue) starting (just before) 4 characters, now. |
03:26
🔗
|
JesseW |
and enabled |
03:29
🔗
|
|
bwn__ has quit IRC (Ping timeout: 252 seconds) |
03:32
🔗
|
JesseW |
so, there there appear to be some permanent errors in some migre.me items (specifically, 17557 & 693427), such that the claims containing them keep failing and getting re-assigned. What do I do about this? |
03:32
🔗
|
JesseW |
chfoo: ping |
03:38
🔗
|
JesseW |
and yoolink-to is done again |
03:38
🔗
|
JesseW |
(slightly overdone, actually :-/ ) |
04:02
🔗
|
|
chazchaz has quit IRC (ny.us.hub irc.umich.edu) |
04:02
🔗
|
|
Domin- has quit IRC (ny.us.hub irc.umich.edu) |
04:06
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
04:09
🔗
|
|
atlogbot has quit IRC (Read error: Operation timed out) |
04:15
🔗
|
|
chazchaz has joined #urlteam |
04:15
🔗
|
|
Domin- has joined #urlteam |
04:16
🔗
|
|
atlogbot has joined #urlteam |
04:18
🔗
|
|
svchfoo1 has joined #urlteam |
04:19
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
04:27
🔗
|
|
chazchaz has quit IRC (ny.us.hub irc.umich.edu) |
04:27
🔗
|
|
Domin- has quit IRC (ny.us.hub irc.umich.edu) |
05:04
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:18
🔗
|
|
chazchaz has joined #urlteam |
05:23
🔗
|
|
Domin_ has joined #urlteam |
05:25
🔗
|
chfoo |
JesseW: you can stop the queue and add items that skip the broken ones and then delete the item that has the broken url |
05:26
🔗
|
|
Smiley has joined #urlteam |
05:27
🔗
|
|
SmileyG has quit IRC (Read error: Operation timed out) |
05:28
🔗
|
|
Barry has quit IRC (Read error: Operation timed out) |
05:28
🔗
|
|
Barry has joined #urlteam |
05:44
🔗
|
JesseW |
chfoo: ok, cool, will do that, thanks |
05:49
🔗
|
|
WinterFox has joined #urlteam |
06:35
🔗
|
|
atlogbot has quit IRC (Ping timeout: 369 seconds) |
06:38
🔗
|
|
Domin_ has quit IRC (Read error: Operation timed out) |
06:39
🔗
|
|
Domin_ has joined #urlteam |
06:39
🔗
|
|
svchfoo1 has quit IRC (Ping timeout: 369 seconds) |
06:40
🔗
|
|
svchfoo1 has joined #urlteam |
06:40
🔗
|
|
atlogbot has joined #urlteam |
06:41
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
06:41
🔗
|
|
Atluxity has quit IRC (hub.se irc.underworld.no) |
06:41
🔗
|
|
bwn has joined #urlteam |
06:51
🔗
|
|
Smiley has quit IRC (Read error: Operation timed out) |
06:58
🔗
|
JesseW |
doing another run of vbly.us -- from 2jp6 up to 2mwv |
07:13
🔗
|
JesseW |
and done |
07:13
🔗
|
JesseW |
about 4,000 new shorturls saved. |
07:22
🔗
|
WinterFox |
Im getting a result for every url on migre.me :P |
07:23
🔗
|
JesseW |
WinterFox: yep, it's incremental, so you should be. :-) |
07:23
🔗
|
WinterFox |
Oh thats pretty neat |
07:23
🔗
|
JesseW |
WinterFox: check out http://archiveteam.org/index.php?title=URLTeam#Warrior_projects --hopefully what I wrote there makes sense |
07:24
🔗
|
WinterFox |
Makes sense |
07:24
🔗
|
JesseW |
migre.me should give us nearly all results up to somewhere around #420,000,000, which is around rZIfF |
07:25
🔗
|
JesseW |
we're currently only at about 2.4 million |
07:25
🔗
|
JesseW |
not 420 million. :-) |
07:25
🔗
|
WinterFox |
Do you guys plan on releasing a new torrent with all the urls? |
07:25
🔗
|
JesseW |
we release new urls daily, as items on the Internet Archive. They are downloadable as torrents from there. |
07:26
🔗
|
WinterFox |
Is there any way to mass download them all from the internet archive? |
07:26
🔗
|
JesseW |
e.g. https://archive.org/details/urlteam_2015-11-07-20-00-08 is the one from yesterday |
07:27
🔗
|
JesseW |
the names follow a pretty clear pattern, so you can generate a list (via IA Advanced Search), then pipe that into a bittorrent client, and get them that way |
07:27
🔗
|
JesseW |
(that's how I did it) |
07:28
🔗
|
WinterFox |
Sounds easy enough |
07:28
🔗
|
JesseW |
Some of the torrents are broken, because IA's internal tool for moving items around doesn't update the webseed URLs in the torrents when it moves stuff. I've reported the problem, but it hasn't gotten to the top of the priority list yet. |
07:29
🔗
|
WinterFox |
I just pulled the warier docker container onto my server a few days ago. At 900,000 urls scanned so far |
07:29
🔗
|
JesseW |
You can just download those few manually (although most of the time, you can also get them from my seed (although that is down ATM)) |
07:29
🔗
|
JesseW |
WinterFox: excellent! thank you for doing that. |
07:29
🔗
|
|
Deewiant has joined #urlteam |
07:30
🔗
|
WinterFox |
No problem. My server is running at 1% cpu anyway and it feels like a waste to not use it for something :P |
07:30
🔗
|
WinterFox |
Going to run an IPFS node on it later |
07:31
🔗
|
JesseW |
we probably should consider making a giant item combining all the individual ones from 2014 (although we only started making daily dumps in Nov 2014) and from 2015, once the year is over. |
07:32
🔗
|
JesseW |
WinterFox: how much storage does it have attached. You could consider participating in http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK |
07:32
🔗
|
JesseW |
er, s/attached./attached?/ |
07:32
🔗
|
WinterFox |
Right now its got 1TB free space but Im getting a few 5TB drives soon |
07:33
🔗
|
JesseW |
nice; if you feel like donating one of them to the cause, IA.BAK could certainly use it. :-) |
07:33
🔗
|
WinterFox |
Could you give a bit of a tl;dr on what that is? |
07:34
🔗
|
JesseW |
sure, although we should probably move over to #internetarchive.bak for a longer discussion |
07:34
🔗
|
WinterFox |
Joined the channel |
08:13
🔗
|
|
Fletcher has joined #urlteam |
08:13
🔗
|
|
JesseW has quit IRC (Leaving.) |
08:15
🔗
|
|
Smiley has joined #urlteam |
08:41
🔗
|
|
bwn_ has joined #urlteam |
08:45
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
09:25
🔗
|
|
bwn_ is now known as bwn |
11:02
🔗
|
|
VADemon has joined #urlteam |
12:19
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
12:36
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
12:49
🔗
|
|
chazchaz has quit IRC (Ping timeout: 186 seconds) |
12:50
🔗
|
|
chazchaz has joined #urlteam |
13:16
🔗
|
|
bwn has joined #urlteam |
13:56
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:59
🔗
|
|
dashcloud has joined #urlteam |
14:00
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
14:42
🔗
|
|
VADemon has quit IRC (left4dead) |
17:00
🔗
|
|
JesseW has joined #urlteam |
17:01
🔗
|
|
svchfoo1 sets mode: +o JesseW |
17:09
🔗
|
JesseW |
A bunch of bad items in migre.me; I'll fix them once I get a chance. Until then, sorry about the borked items... |
17:22
🔗
|
|
JesseW has quit IRC (Leaving.) |
17:43
🔗
|
|
Domin_ has quit IRC (hub.efnet.us irc.umich.edu) |
18:17
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
18:18
🔗
|
|
Start has joined #urlteam |
19:56
🔗
|
|
aaaaaaaaa has joined #urlteam |
19:56
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
20:15
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
20:26
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
20:45
🔗
|
|
bwn has joined #urlteam |
20:45
🔗
|
|
SmileyG has joined #urlteam |
20:52
🔗
|
|
Smiley has quit IRC (hub.se efnet.port80.se) |
20:52
🔗
|
|
Fletcher has quit IRC (hub.se efnet.port80.se) |
21:01
🔗
|
|
Fletcher_ has joined #urlteam |
21:17
🔗
|
|
slang has quit IRC (Ping timeout: 240 seconds) |
21:32
🔗
|
|
Fletcher_ is now known as fletcher |
21:32
🔗
|
|
fletcher is now known as Fletcher |
22:04
🔗
|
|
slang has joined #urlteam |
22:36
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
22:45
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
22:47
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
22:48
🔗
|
|
atlogbot has quit IRC (Read error: Operation timed out) |
22:49
🔗
|
|
atlogbot has joined #urlteam |
22:51
🔗
|
|
svchfoo1 has joined #urlteam |
22:51
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
22:54
🔗
|
|
chazchaz has joined #urlteam |
23:11
🔗
|
|
aaaaaaaaa has joined #urlteam |
23:11
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
23:18
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
23:28
🔗
|
|
aaaaaaaa_ has joined #urlteam |
23:28
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
23:28
🔗
|
|
swebb sets mode: +o aaaaaaaa_ |
23:29
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |