[00:03] *** Sk1d has joined #archiveteam-bs [00:06] *** Stilett0 has joined #archiveteam-bs [00:08] *** Stiletto has quit IRC (Ping timeout: 252 seconds) [00:36] *** Sk1d has quit IRC (Read error: Operation timed out) [00:39] *** Sk1d has joined #archiveteam-bs [00:52] *** Sk1d has quit IRC (Read error: Operation timed out) [00:55] WHYYYYY [00:55] https://web.archive.org/web/20040610134131/http://www.kibria.de/frhed.html <- the frhed 1.1 beta link there [00:56] the author stuck it behind a stat counter link, and archive.org was unable to penetrate that link to archive the file [00:56] several of the betas and older versions suffered the same fate [00:56] FORTUNATELY, back in the day, I made a local copy [00:56] but there's no way to 'inject' that back into the mirror [00:57] https://www.dropbox.com/s/29ci9ma8ku5zzrt/frhed-v1.1.zip?dl=0 if anyone cares. [00:57] *** Sk1d has joined #archiveteam-bs [00:58] Lord_Nigh: Dropbox? Eww. Why not upload it to IA? [00:59] because i can't create items in IA until they fix their account system so it doesn't embed my email address in the metadata [00:59] I ran into the same issue with the CREXT program a few months ago [01:00] I don't mind my username being embedded in there, but i do mind my email address [01:00] Hrm, how about creating an email address username@provider.com? [01:02] I'd have to recreate my IA account, i think... [01:02] Yeah, true. [01:03] the CREXT program is interesting, i actually got far enough to upload the item and was filling in the user provided metadata, then realized the email thing [01:03] and i don't think IA can even EDIT the metadata, or possibly even delete it [01:04] so they just blacked the item out [01:04] which prevents the metadata from being grabbed [01:04] seems a bit dysfunctional system to my eye, tbh [01:05] so the 'crext' name is now permanently an incomplete blacked out item, i don't know if its even possible for them to remove it, or even un-black it out [01:05] its really nuts the way its set up now [01:05] I'm sure they can edit the email address. I've heard that suggestion several times to ask IA to update the metadata after changing an account's email address so the items stay associated with the account. [01:07] *** kiska1 has quit IRC (Read error: Operation timed out) [01:07] *** wmvhater has quit IRC (Read error: Operation timed out) [01:07] *** Mateon1 has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** TC01 has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** mr_archiv has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** ReimuHaku has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Valentine has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** purplebot has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** sknebel has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** decay has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Coderjo has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** PurpleSym has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Jusque has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** pikhq has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** joepie91_ has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Yurume has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Ing3b0rg has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** mistym has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Selavi has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** Fusl_ has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** wacky has quit IRC (ircd.choopa.net irc.mzima.net) [01:07] *** wmvhater has joined #archiveteam-bs [01:07] *** kiska1 has joined #archiveteam-bs [01:08] huh. as it turns out, the second version of crext (now called c2ext), DID survive [01:08] http://sfprod.shikadi.net/games/clyde2.htm [01:09] that site isn't archived AT ALL [01:09] one sec [01:12] *** Sk1d has quit IRC (Read error: Operation timed out) [01:13] *** Mateon1 has joined #archiveteam-bs [01:13] *** TC01 has joined #archiveteam-bs [01:13] *** mr_archiv has joined #archiveteam-bs [01:13] *** ReimuHaku has joined #archiveteam-bs [01:13] *** Valentine has joined #archiveteam-bs [01:13] *** purplebot has joined #archiveteam-bs [01:13] *** decay has joined #archiveteam-bs [01:13] *** Coderjo has joined #archiveteam-bs [01:13] *** PurpleSym has joined #archiveteam-bs [01:13] *** Jusque has joined #archiveteam-bs [01:13] *** pikhq has joined #archiveteam-bs [01:13] *** joepie91_ has joined #archiveteam-bs [01:13] *** Yurume has joined #archiveteam-bs [01:13] *** Ing3b0rg has joined #archiveteam-bs [01:13] *** mistym has joined #archiveteam-bs [01:13] *** Selavi has joined #archiveteam-bs [01:13] *** Fusl_ has joined #archiveteam-bs [01:13] *** wacky has joined #archiveteam-bs [01:13] *** irc.mzima.net sets mode: +o PurpleSym [01:14] *** sknebel_ has joined #archiveteam-bs [01:17] *** Sk1d has joined #archiveteam-bs [01:23] as for the crext mess, i just emailed the original author and asked if he could repost the original version alongside the new one [01:23] Lord_Nigh: By the way, here's the reference for IA being able to edit that field: https://archive.org/services/docs/api/metadata-schema/index.html#uploader [01:23] "edit access: IA admin" [01:24] ah. I'm still stumped as to why that field is an email address and not a username/id [01:52] *** Stiletto has joined #archiveteam-bs [01:55] *** Stiletto has quit IRC (Read error: Operation timed out) [01:58] *** Stiletto has joined #archiveteam-bs [01:59] *** Stilett0 has quit IRC (Read error: Operation timed out) [02:20] *** Stilett0 has joined #archiveteam-bs [02:21] *** Stiletto has quit IRC (Ping timeout: 264 seconds) [02:29] *** Sanqui has quit IRC (Ping timeout: 260 seconds) [02:37] *** Stiletto has joined #archiveteam-bs [02:40] *** Stilett0 has quit IRC (Read error: Operation timed out) [02:42] *** Sanqui has joined #archiveteam-bs [02:50] *** Sk1d has quit IRC (Read error: Operation timed out) [02:53] *** Sk1d has joined #archiveteam-bs [02:54] *** DopefishJ is now known as DFJustin [02:55] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [02:56] *** i0npulse has joined #archiveteam-bs [02:56] *** hook54321 has quit IRC (Ping timeout: 252 seconds) [02:58] *** Sanqui has quit IRC (Read error: Operation timed out) [02:59] *** hook54321 has joined #archiveteam-bs [03:01] *** Sanqui has joined #archiveteam-bs [03:05] *** Sk1d has quit IRC (Read error: Operation timed out) [03:09] *** Sk1d has joined #archiveteam-bs [03:13] *** Mayonaise has quit IRC (Read error: Operation timed out) [03:13] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [03:13] *** SmileyG has joined #archiveteam-bs [03:13] *** REiN^ has quit IRC (Read error: Operation timed out) [03:14] *** RedType has quit IRC (Write error: Broken pipe) [03:14] *** squires has quit IRC (Write error: Broken pipe) [03:14] *** sep332 has quit IRC (Write error: Broken pipe) [03:14] *** kiska1 has quit IRC (Read error: Operation timed out) [03:14] *** wmvhater has quit IRC (Read error: Operation timed out) [03:14] *** dxrt_ has quit IRC (Write error: Broken pipe) [03:14] *** Smiley has quit IRC (Read error: Operation timed out) [03:14] *** aschmitz has quit IRC (Read error: Operation timed out) [03:14] *** Mayonaise has joined #archiveteam-bs [03:15] *** Gfy has quit IRC (Read error: Operation timed out) [03:15] *** TigerbotH has quit IRC (Read error: Operation timed out) [03:15] *** Lord_Nigh has joined #archiveteam-bs [03:17] *** Odd0002 has quit IRC (Read error: Operation timed out) [03:17] *** PhrackD has quit IRC (Read error: Operation timed out) [03:17] *** Odd0002 has joined #archiveteam-bs [03:17] *** PotcFdk has quit IRC (Read error: Operation timed out) [03:18] *** Dimtree has quit IRC (Read error: Operation timed out) [03:21] *** Sk1d has quit IRC (Read error: Operation timed out) [03:25] *** Gfy has joined #archiveteam-bs [03:25] *** Sk1d has joined #archiveteam-bs [03:32] *** RedType has joined #archiveteam-bs [03:33] *** aschmitz has joined #archiveteam-bs [03:46] *** antomatic has quit IRC (west.us.hub irc.Prison.NET) [03:46] *** achip has quit IRC (west.us.hub irc.Prison.NET) [03:52] *** achip has joined #archiveteam-bs [04:01] *** achip has quit IRC (Ping timeout: 255 seconds) [04:03] *** antomatic has joined #archiveteam-bs [04:03] *** swebb sets mode: +o antomatic [04:12] *** kiska1 has joined #archiveteam-bs [04:12] *** wmvhater has joined #archiveteam-bs [04:12] *** TigerbotH has joined #archiveteam-bs [04:12] *** squires has joined #archiveteam-bs [04:12] *** REiN^ has joined #archiveteam-bs [04:12] *** dxrt_ has joined #archiveteam-bs [04:12] *** PhrackD has joined #archiveteam-bs [04:13] *** sep332 has joined #archiveteam-bs [04:14] *** PotcFdk has joined #archiveteam-bs [04:18] *** Sk1d has quit IRC (Read error: Operation timed out) [04:19] *** Dimtree has joined #archiveteam-bs [04:20] *** Sk1d has joined #archiveteam-bs [04:21] JAA: oh wow, I've looked for a page like .../metadata-scheme/index.html but never found one, this is great. [04:21] The nearest I knew about was https://internetarchive.readthedocs.io/en/latest/metadata.html [04:22] Still seems incomplete, though: missing at least "emulator*" for items and "autoplay" for files. [04:29] *** achip has joined #archiveteam-bs [04:30] *** user is now known as ggus [04:35] *** Sk1d has quit IRC (Read error: Operation timed out) [04:37] *** Sk1d has joined #archiveteam-bs [04:49] *** Sk1d has quit IRC (Read error: Operation timed out) [04:54] *** Sk1d has joined #archiveteam-bs [05:29] *** Stilett0 has joined #archiveteam-bs [05:32] *** Stiletto has quit IRC (Read error: Operation timed out) [06:40] *** Stiletto has joined #archiveteam-bs [06:41] *** Stilett0 has quit IRC (Ping timeout: 264 seconds) [08:22] *** Valentine has quit IRC (Read error: Operation timed out) [09:03] *** Valentine has joined #archiveteam-bs [09:04] *** Sk1d has quit IRC (Read error: Operation timed out) [09:08] *** Sk1d has joined #archiveteam-bs [09:18] *** Valentine has quit IRC (Ping timeout: 506 seconds) [09:24] *** Sk1d has quit IRC (Read error: Operation timed out) [09:26] *** Sk1d has joined #archiveteam-bs [10:19] *** Albardin has joined #archiveteam-bs [10:19] *** kiskabak has joined #archiveteam-bs [10:19] *** w0rmybak has joined #archiveteam-bs [10:19] *** Flashback has joined #archiveteam-bs [10:47] *** Valentine has joined #archiveteam-bs [12:46] *** Valentine has quit IRC (Read error: Operation timed out) [13:03] *** Valentine has joined #archiveteam-bs [14:11] *** wp494 has quit IRC (Ping timeout: 252 seconds) [14:11] *** wp494 has joined #archiveteam-bs [14:31] *** Sk1d has quit IRC (Read error: Operation timed out) [14:33] *** t2t2 has quit IRC (Read error: Operation timed out) [14:34] *** t2t2 has joined #archiveteam-bs [14:36] *** anarcat has joined #archiveteam-bs [14:36] *** Sk1d has joined #archiveteam-bs [14:36] so now i'm at [14:36] sqlite3 cnv.memoriasreveladas.gov.br/cnv.memoriasreveladas.gov.br.db 'SELECT status, COUNT(id) FROM queued_urls GROUP BY status' | column -s '|' -t [14:36] done 22988 [14:36] error 84 [14:36] in_progress 2 [14:36] skipped 282551 [14:36] todo 62888 [14:36] (Continuing from #archivebot about wpull following links to everywhere) [14:36] hey anarcat! [14:37] after a brief stint on iab.com, now it's crawling twitter [14:37] hello ggus [14:37] anarcat: Try running wpull with a low concurrency and --debug. That should print the URL filter results, among various other things. [14:38] there are two databases to scrap: sian.an.gov.br and www.usp.br/proin/home/index.php [14:39] DEBUG Robots filter verdict True reason filters [14:39] ? [14:40] "True" isn't much of an explanation... [14:40] i think the problem is filters do not get applied to items that are already queued [14:41] i mean now it's crawling frigging twitter [14:41] https://about.twitter.com/etc/designs/about-twitter/public/js/universal.js [14:41] DEBUG Check in ('about.twitter.com', 443, True) [14:44] *** Stilett0 has joined #archiveteam-bs [14:45] anarcat: No, it should definitely check the filters again when it starts processing a URL. [14:46] (wpull.processor.web.WebProcessorSession._process_loop) [14:47] *** Stiletto has quit IRC (Read error: Operation timed out) [14:48] But yeah, I forgot that it only says "filters", nothing more specific. [14:48] Debugging why a certain URL is grabbed is definitely quite an annoying part of wpull. [14:52] this still has stuff like https://www.facebook.com/bootload/ in the queue [14:52] or brasil247.com [14:54] *** Mateon1 has quit IRC (Read error: Operation timed out) [14:55] *** Mateon1 has joined #archiveteam-bs [14:56] https://github.com/ArchiveTeam/wpull/issues/399 :-) [14:58] * anarcat tempted to block twimg.com as well [14:58] although that's just 500 images [15:00] *** BlueMax has quit IRC (Read error: Connection reset by peer) [15:01] *** Valentine has quit IRC (Read error: Operation timed out) [15:05] *** Stiletto has joined #archiveteam-bs [15:05] *** Stilett0 has quit IRC (Read error: Operation timed out) [15:06] *** Valentine has joined #archiveteam-bs [15:35] going through flickr now, but the count is going down [15:35] todo 7078 [16:13] *** Sk1d has quit IRC (Read error: Operation timed out) [16:14] *** Pixi has quit IRC (Quit: Pixi) [16:15] *** Pixi has joined #archiveteam-bs [16:16] *** Sk1d has joined #archiveteam-bs [16:24] Have we archived any campaign websites related to the upcoming elections in the US yet? [16:30] *** Sk1d has quit IRC (Read error: Operation timed out) [16:31] *** fredgido_ has quit IRC (Ping timeout: 632 seconds) [16:34] *** Sk1d has joined #archiveteam-bs [16:36] *** REiN^ has quit IRC (Remote host closed the connection) [17:01] *** Valentine has quit IRC (Ping timeout: 506 seconds) [17:03] *** fredgido_ has joined #archiveteam-bs [17:38] *** alex___ has joined #archiveteam-bs [18:46] *** REiN^ has joined #archiveteam-bs [18:51] *** Sk1d has quit IRC (Read error: Operation timed out) [18:53] *** Sk1d has joined #archiveteam-bs [19:00] I'm worried we've not [19:30] I'm working on some US stuff [19:31] going to try to scrape the campaign sites, twitter and facebook accounts [19:33] but because I'm focusing on ALL the elections - not just the house / senate there are a lot of candidates [19:33] ~26,000, according to vote411.org, which is what I'm scraping to get lists of other stuff to scrape [19:38] *** Sk1d has quit IRC (Read error: Operation timed out) [19:41] *** Sk1d has joined #archiveteam-bs [19:52] *** Sk1d has quit IRC (Read error: Operation timed out) [19:55] *** Sk1d has joined #archiveteam-bs [19:57] SketchCow: What is arkiver up to these days? Warrior seems to be pretty dead without him. [20:03] I do wonder if we could do some kind of archivebot job, via the warrior [20:03] i.e. people can submit stuff, it'll run on warriors.. [20:16] Arkiver is either working on his school study, or doing full-time efforts for Internet Archive [20:18] nice [20:24] He's overseen hundreds of terabytes of uploads [20:24] He's happy [20:24] But it does mean it's harder for him to keep track of work here [20:24] So here I fucking am [20:24] Daddy's home [20:25] Giving people a chance to react or talk to me privately, tomorrow the fun starts [20:29] I'm happy to do all I can here, my target/megawarc hellhole is quiet [20:29] i want to archive all of brazil [20:29] :) [20:29] but now i gotta run [20:30] SketchCow: Can you give the user @daiphots write access to the collection archiveteam_chromebot? I’ll bulk-move chromebot’s uploads and make sure they end up there by default. [20:31] Done [20:32] I'll move the current splorp over to it right now [20:33] Done, your splorp will be there in literally a few seconds. [20:34] Thanks, updated the upload script. [20:36] I am throwing some random soryama images in to make them look better [20:37] avoid sorayama nudes, of course. [20:41] *** Sk1d has quit IRC (Read error: Operation timed out) [20:45] *** Sk1d has joined #archiveteam-bs [20:49] I've gone to different levels of chrome [20:52] SketchCow: I have a query about uploading footage from the Scottish Parliment to IA [20:52] Well, I could upload a screenshot/screenshots of the archived pages and we could use those. [20:52] Do it [20:52] What else [20:52] PurpleSym: I'm an advocate of that - I started doing that with the archivebot stuff, but boy, was it a nightmare [20:53] I had to do a bunch of stuff, like choose only to do like 10-20 an item, and then hand-fix the ones that exploded, and of course poooorrrrrnnnnnnnnnnnnn [20:53] Feel free to yank the chromebot.jpg if you improve it [20:54] Sure, they’re already in present in most WARC files. Just need to extract and stitch them into a single image. [20:54] OK, I'm waiting to a response from a FOI request to determine exactly what is uploaded to where, but once I know whats what I'll start getting it up [20:54] (it's all licensed in a way that allows it to be shared, so copyrights not an issue) [20:54] PurpleSym: I finished adding chromebot.jpgs to everything, replace that shiznat at will [20:58] *** Sk1d has quit IRC (Read error: Operation timed out) [21:02] *** Sk1d has joined #archiveteam-bs [21:08] *** alex___ has quit IRC (Read error: Connection reset by peer) [21:27] *** qw3rty116 has joined #archiveteam-bs [21:37] *** Valentine has joined #archiveteam-bs [21:47] *** Sk1d has quit IRC (Read error: Operation timed out) [21:50] *** Sk1d has joined #archiveteam-bs [22:11] *** Valentine has quit IRC (Ping timeout: 506 seconds) [22:11] *** m007a83 has quit IRC (Read error: Connection reset by peer) [22:30] *** Sk1d has quit IRC (Read error: Operation timed out) [22:33] *** Sk1d has joined #archiveteam-bs [22:39] *** Valentine has joined #archiveteam-bs [22:55] *** ranma has joined #archiveteam-bs [23:03] *** m007a83 has joined #archiveteam-bs [23:07] *** Valentine has quit IRC (Ping timeout: 506 seconds) [23:10] *** Valentine has joined #archiveteam-bs [23:23] *** espes__ has quit IRC (Ping timeout: 265 seconds) [23:26] *** Sk1d has quit IRC (Read error: Operation timed out) [23:29] *** Sk1d has joined #archiveteam-bs [23:36] *** espes__ has joined #archiveteam-bs [23:36] *** Darkstar has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [23:53] *** Darkstar has joined #archiveteam-bs