[00:05] *** nightpool has quit IRC (Ping timeout: 258 seconds) [00:07] *** Ghost_of_ has joined #archiveteam [00:07] arkiver: Dude, it's Thanksgiving [00:07] I realize you people are too busy fighting polar bears and getting healthcare to know that [00:08] Here, we stack all our vacation around it [00:10] it's basically the most important non-religious holiday here [00:10] probably because it's always a 4 day weekend [00:27] Right [00:27] SketchCow: you like files right [00:30] Somewhat [00:30] Not entirely the way you do them, but yes. [00:30] SketchCow: have this file https://leclan.ch/public/rsi-image-sources.txt [00:30] *** remsen has quit IRC (Read error: Operation timed out) [00:31] Why. Why do I want this. [00:31] SketchCow: someone told me you would, ask them! [00:31] For everyone that's not insane: I just wrote an archivebot screenshot cleaner. [00:32] Given an archivebot object that has screenshots, it kills any that are under 40k (they tend to be errors and broken) and then finds the biggest screenshot and makes that the "canonical" screenshot. [00:32] The collection is going to look pretty crazy shortly. [00:32] nice [00:33] Are you also planning on doing this with other non-archivebot archiveteam WARCs? [00:34] Maybe. [00:34] Just catching up to archivebot will take this poor fucking machine weeks. [00:34] ok [00:35] But I was getting sad that the thing would spend 20-120 minutes on an item and the main thumbnail was whitespace or an error. [00:35] (Like when the warcplayer tries to "thumbnail" a binary item [00:35] Yeah [00:35] I saw some of those [00:35] https://archive.org/details/archiveteam_archivebot_go_20150913160001 [00:35] SO much better [00:36] nice fix for that [00:36] So the FTP project may need some more changes, but it's probably almost ready to go [00:36] I sent them to Wayback team. [00:36] I think we should find some way of prioritizing FTPs [00:36] I'm sure they'll get back to us, but again, not until next week. [00:37] thanks [00:37] Chris DiBona might also not get back to us until next week [00:38] So we should go into every proposed FTP and see if it should be archived or not [00:41] for example ftp-cdc.dwd.de contains weather data [00:41] some 130 GB if I remember correct [00:45] 187 GB actually [00:45] arkiver: which reminds me that I've got a few dumps of upstreamless FTP sites that I rescued from mirrorservice.org before it was rebuilt in 2012 -- ftp.cdrom.com, ftp.wiretapped.net, and a few small ones [00:45] what's the best way of getting those to the FTP team? [00:46] The FTPs aren't online anymore? [00:46] question, what happens if an upload to archive.org, from the web uploader is interrupted half-way? [00:46] it then isn't uploaded [00:46] they're probably ones that other people had mirrored as well, but the original sites are long gone, and mirrorservice.org doesn't carry the mirrors of them any more [00:46] ok [00:46] jleclanch: "upload" and "move into item" are two separate steps [00:46] the latter only happens if the former has succeeded [00:47] OK, thanks [00:47] Please upload them to archive.org ats [00:47] jleclanch: however, this is on a per-file basis [00:47] jleclanch: so some files may have uploaded after all [00:47] before it failed [00:47] alright [00:47] arkiver: righto [00:47] if your internet connection is slow you can use a torrent. Create a torrent of the file(s), upload to archive.org, archive.org will download the torrent [00:47] yeah [00:47] I assume one item per site is preferable? [00:48] yes [00:48] should the item contain the files directly, or a .tar of them? [00:48] a tar file would be best [00:49] people can browse tar files directly from archive.org [00:49] joepie91: if i want to upload a set of large items together on one page, but can't upload them all at once, how would you recommend I do it? [00:50] jleclanch: you can add stuff to items later. in the HTML5 uploader, add ?identifier=xxx to the URL [00:50] where xxx is the identifier of your item [00:50] you can then select more files, and it will add them to the existing item [00:50] joepie91: thank you [00:50] jleclanch: keep in mind that you shouldn't put too many files in one item [00:50] few hundred tops, generally [00:51] this is about 60 items total [00:51] but they're very large videos [00:51] jleclanch: items or files? [00:51] files, sorry [00:51] jleclanch: how large is large? [00:51] 43g for the full archive [00:51] http://sprunge.us/OSJV [00:51] For whatever reason, the screenshot cleaner's work is taking some time to redraw on the collection [00:52] But I promise I'm dropping the cleaning constantly. [00:52] joepie91: anything else i should know? [00:53] jleclanch: hmm. I'd probably upload each video(+metadata) separately [00:53] as a separate item [00:53] with its own metadata tags [00:53] How come? [00:53] jleclanch: fits better into the "1 item == 1 thing" idea, plus you get to parallelize your uploads [00:54] My connection isn't good enough to parallelize anything heh [00:54] (upload speeds to IA are a known problem, parallelizing helps) [00:54] jleclanch: how bad is it? :P [00:54] 2meg up [00:54] From europe [00:54] you'll probably barely be able to fill that with one thread, on a bad day [00:55] S3 speeds have gotten a little less bad, but I definitely see sub-2mbps upload speeds per upload every now and then [00:56] Who's paying for all this bandwidth? I always feel guilty uploading stuff there [00:56] jleclanch: mystery man :) don't feel guilty! [00:56] that's what it's for [00:56] https://archive.org/donate/ [00:56] also that [00:56] jleclanch: we've been challenging this notion lately (as archiveteam), but the general answer is "there are enough resources, don't worry" [00:57] the upload speeds are also not a network issue [00:57] but this is going into #archiveteam-bs territory [00:57] heh [00:57] Is there a finances breakdown? Out of curiosity. [00:57] there *used* to be a very plain one on the donation page [00:57] not sure where it's gone [00:57] IA is a registered library, not sure what accounting requirements come with that [01:01] The Internet Archive is mostly a front for a meth lab [01:01] Brewster == Walter White [01:01] (I'm Jesse) [01:01] They're a damn good library for being a front [01:01] https://www.youtube.com/watch?v=BaO8beu-WpA [01:02] Lose the chili powder and pseudoephedrine [01:02] kid we gotta pull a heist for this methylamine [01:02] competition getting heated like the formic acid [01:02] killin'em softly with the phosphene gases [01:03] I missed one, but https://archive.org/details/archivebot [01:03] (I just restarted it) [01:03] SketchCow: when i walk into fast food restaurants, I tend not to think "who's paying for all this chicken" [01:03] SO MUCH BETTER [01:08] Nice, those are looking good! [01:12] *** xk_id_ has joined #archiveteam [01:12] *** xk_id has quit IRC (Read error: Connection reset by peer) [01:18] *** philpem has quit IRC (Ping timeout: 252 seconds) [01:31] SketchCow: do you punch people too [01:33] Are we going to debate friendship taps again [01:42] *** dashcloud has quit IRC (Read error: Operation timed out) [01:46] *** dashcloud has joined #archiveteam [01:47] arkiver: here you go: https://archive.org/details/@adam_sampson?and[]=subject%3A%22ftp%22 [01:47] arkiver: I've done wiretapped as a torrent because it's big; I'm seeding the file but it's not started downloading it yet [01:49] bear in mind that IA needs to be able to find your computer — needs an open port, and needs to be able to find your IP through DHT or a tracker [01:49] just mentioning since I've run into those issue [01:49] *** WinterFox has quit IRC (Read error: Operation timed out) [01:49] I've used the list of trackers on the AT wiki, of which exactly one appears to be alive ;-) [01:50] cool :P [01:50] anyway, I'll leave it overnight and see how it's doing in the morning -- hopefully it'll have picked it up by then :) [01:56] *** WinterFox has joined #archiveteam [01:57] ats: it usually took me up to 10-15 min until the torrent bot have found me. Upload spikes and interruptions (75KB/s out of ~240KB/s) were also a permanent issue. So lean back to enjoy the numbers and possibly leave your smartphone at home the next day to do the job :) [01:57] *** VADemon has quit IRC (left4dead) [01:57] aha, now it's kicked in :) [02:22] *** Ghost_of_ has quit IRC (Quit: Leaving) [02:31] *** xk_id_ has quit IRC (Remote host closed the connection) [02:33] *** SN4T14 has quit IRC (Read error: Operation timed out) [02:37] *** vitzli has joined #archiveteam [02:38] *** WinterFox has quit IRC (Read error: Operation timed out) [02:42] *** WinterFox has joined #archiveteam [02:56] *** Microguru has joined #archiveteam [03:04] *** bwn_ has quit IRC (Read error: Operation timed out) [03:05] OK, I've now dropped a thing that will clean up all the current screenshots. [03:05] I'm also going to add the screencrapper repair to the script [03:08] *** primus104 has quit IRC (Leaving.) [03:27] *** Ungstein1 has quit IRC (Quit: Leaving.) [03:35] *** Ungstein has joined #archiveteam [03:42] *** Ungstein has quit IRC (Quit: Leaving.) [03:44] *** Ungstein has joined #archiveteam [04:04] *** maseck has joined #archiveteam [04:04] *** W1nterFox has joined #archiveteam [04:05] *** maseck_ has quit IRC (Ping timeout: 310 seconds) [04:07] *** ggg_ has joined #archiveteam [04:08] *** WinterFox has quit IRC (Read error: Operation timed out) [04:10] *** bwn_ has joined #archiveteam [04:11] *** maseck_ has joined #archiveteam [04:11] Is there a FAQ for this channel? I'm wondering if/how it's possible to get a copy of gitorious.org/projects/gossamer/ (human powered aircraft model for flightgear) [04:11] *** bwn_ is now known as bwn [04:12] *** maseck has quit IRC (Read error: Operation timed out) [04:36] Like... a copy? [04:38] SketchCow: yes. I've helped get the DaSH HPA model published and included into flightgear...I'd like to track down a copy of the "Gossamer Condor" model that was hosted on gitorious (in 2008) and see if that could be revived as well. [04:39] DaSH model is here (alive and well) http://sourceforge.net/p/flightgear/fgaddon/HEAD/tree/trunk/Aircraft/DaSH/ [04:41] it's unfortunate that I had noted the gossamer project a year ago and mistakenly thought it wouldn't be a problem to find again "later" (well, more than a year later). [04:46] * ggg_ notes that "user:chronomex" was working on this. (per http://archiveteam.org/index.php?title=Gitorious) [04:48] SketchCow: FWIW, my "plan B" is to revive the threads on flightgear.org forums to see if any of the original developers of project/gossamer still have local copies. [04:50] joepie91: alright, first video is up: https://archive.org/details/4f87da0d7dc8c8cd67109c44d297acb1Video44001280Seg1Frag - are there checksums anywhere? [05:09] *** ndiddy has joined #archiveteam [05:29] *** godane has joined #archiveteam [05:32] *** jonimus is now known as Jonimus [05:33] * ggg_ wanders off again :) [05:33] *** ggg_ has quit IRC (Quit: Page closed) [05:42] *** remsen has joined #archiveteam [05:46] *** nertzy has joined #archiveteam [05:50] *** bwn has quit IRC (Read error: Connection reset by peer) [05:58] *** bwn has joined #archiveteam [06:00] *** wp494 has quit IRC (Read error: Connection reset by peer) [06:01] *** Sk1d has quit IRC (Read error: Operation timed out) [06:03] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [06:04] *** xk_id has joined #archiveteam [07:04] *** Sk1d has joined #archiveteam [07:05] *** superkuh has joined #archiveteam [07:31] *** xk_id has quit IRC (Remote host closed the connection) [07:32] *** xk_id has joined #archiveteam [07:47] *** Start_ has joined #archiveteam [07:47] *** Start has quit IRC (Read error: Connection reset by peer) [07:53] *** xk_id_ has joined #archiveteam [07:55] *** xk_id has quit IRC (Ping timeout: 183 seconds) [08:00] *** xk_id_ has quit IRC (Remote host closed the connection) [08:25] *** PrincessK has quit IRC () [08:50] *** philpem has joined #archiveteam [09:22] *** bwn has quit IRC (Read error: Operation timed out) [09:25] *** vitzli has quit IRC (Quit: Leaving) [09:42] *** atomotic has joined #archiveteam [09:44] *** schbirid has joined #archiveteam [10:14] *** wp494 has joined #archiveteam [10:17] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [10:39] *** primus104 has joined #archiveteam [10:41] *** philpem has quit IRC (Ping timeout: 252 seconds) [10:43] *** Ghost_of_ has joined #archiveteam [10:46] *** bwn has joined #archiveteam [11:01] chfoo: SketchCow: can you please create a rsync target on FOS for 'ftp'? [11:02] *** primus104 has quit IRC (Leaving.) [11:12] I'd like to start a test run of the FTP project when we have the rsync target [11:28] *** ironman_ has joined #archiveteam [11:29] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:41] *** VADemon has joined #archiveteam [11:43] *** Ghost_of_ has quit IRC (Quit: Leaving) [12:12] *** W1nterFox has quit IRC (Read error: Connection reset by peer) [12:34] *** SN4T14 has joined #archiveteam [13:03] *** primus104 has joined #archiveteam [13:38] jleclanch: yep, in the metadata file [13:59] Morning. [13:59] Girl wants a TV, TV we get [14:15] *** vitzli has joined #archiveteam [14:22] *** GLaDOS has quit IRC (Read error: Operation timed out) [14:24] *** GLaDOS has joined #archiveteam [14:35] arkiver: There is now a ftp [14:42] arkiver: these have all finished uploading now: https://archive.org/details/@adam_sampson?and[]=subject%3A%22ftp%22 [14:44] ftp o.O [14:45] long term archive project im guessing? [14:47] *** nertzy has joined #archiveteam [14:53] *** GLaDOS has quit IRC (Ping timeout: 252 seconds) [14:53] *** GLaDOS has joined #archiveteam [15:02] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [15:12] SketchCow: thanks [15:13] I'll use ftp-cdc.dwd.de as ftp for the test run [15:13] 187 GB [15:13] a lot of weather data [15:29] arkiver: was there something you wanted me to do earlier? i might have missed it [15:31] chfoo: I asked if you or SketchCow can create a rsync target for ftp [15:31] SketchCow created it a bit of time ago [15:31] ok [15:33] *** Ghost_of_ has joined #archiveteam [15:46] *** vitzli has quit IRC (Quit: Leaving) [17:16] *** vitzli has joined #archiveteam [17:43] We just started the FTP grab with a first batch of 750 items! [17:43] Instructions on how to run are on github] [17:44] SketchCow: can you please change the permissions on the ftp rsync target? [17:44] rsync: mkdir "/warrior/ftp/Arkiver" (in chfoo) failed: Permission denied (13) [17:45] *** Ungstein1 has joined #archiveteam [17:46] *** Ungstein has quit IRC (Ping timeout: 252 seconds) [18:08] arkiver, what disk space requirements? [18:10] *** nightpool has joined #archiveteam [18:12] *** remsen has quit IRC (Read error: Operation timed out) [18:16] *** VADemon has quit IRC (left4dead) [18:18] *** slyphic is now known as slyphic|a [18:29] arkiver, best concurrency or as high as I can go? [18:29] *** atomotic has joined #archiveteam [18:38] *** nightpool has quit IRC (Ping timeout: 252 seconds) [19:00] *** vitzli has quit IRC (Quit: Leaving) [19:00] arkiver: cool [19:03] arkiver: my workhorses are 15GB disks, will it work? [19:12] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [19:34] *** primus104 has quit IRC (Leaving.) [19:34] *** toad2 has joined #archiveteam [19:34] *** Start_ is now known as Start [19:35] *** toad1 has quit IRC (Read error: Operation timed out) [19:36] *** WinterFox has joined #archiveteam [19:46] damnit... I might have created a lot of abandoned items... [19:51] *** yakfish has quit IRC (Read error: Operation timed out) [19:51] *** riz has quit IRC (Read error: Connection reset by peer) [19:52] *** rizzzz has joined #archiveteam [19:52] *** Atom__ has joined #archiveteam [19:54] *** yakfish has joined #archiveteam [19:56] *** Atom-- has quit IRC (Read error: Operation timed out) [20:08] I'll add these instructions later to the manual too [20:08] and I'll try to make it easier to set up [20:08] items are at least 200 MB [20:08] however, it is possible some items are 50 GB or higher [20:09] if the DC OS installer didnt hang *ahem Online.net* then I would be all over it. Option to skip big items? [20:09] *** Ghost_of_ has quit IRC (Quit: Leaving) [20:09] In pipeline.py you can manually edit MAX_SIZE for the max number of bytes you want to have per item [20:09] https://github.com/ArchiveTeam/ftp-grab/blob/master/pipeline.py#L192 [20:09] it's currently at 10 GB [20:10] ive got a 1TB disk in this box so I will ramp that up [20:10] If an item is too big it'll abort the item and I'll requeue it later [20:10] HCross: sure just make sure you have enough space [20:10] now for these items they likely won't be bigger then 2 GB or so [20:11] I have 2x 2tb disks in this box thats doing nothing. I can always put them in Raid 0 to max the space [20:11] but we might get items from other FTP websites in the future that are 50 GB [20:11] HCross: ok! [20:12] Atluxity: aborted items due to a lack of diskspace? [20:14] 375809638400 should be enough [20:14] I hope so... :P [20:14] 350GB that is [20:14] yep [20:14] If my server didnt have a hardware failure atm [20:15] I expect the FTP project to be a long running project (maybe years) so take your time with those problems! [20:16] Its in the hands of the French DC people, so here goes [20:22] *** Jonimus has quit IRC (Read error: Operation timed out) [20:47] *** WinterFox has quit IRC (Read error: Operation timed out) [20:50] *** WinterFox has joined #archiveteam [20:52] *** Start_ has joined #archiveteam [20:53] *** Start has quit IRC (Read error: Connection reset by peer) [20:53] *** Start_ is now known as Start [21:00] Just testing on the ftp-grab, getting a permission error: [21:00] rsync: mkdir "/warrior/ftp/matthusby" (in chfoo) failed: Permission denied (13) [21:05] arkiver: no, I started pipelines, but did a mistake and had to kill them [21:06] *** xk_id has joined #archiveteam [21:12] *** bwn has quit IRC (Read error: Operation timed out) [21:13] *** philpem has joined #archiveteam [21:16] *** WinterFox has quit IRC (Read error: Operation timed out) [21:19] does archive.org do any deduplication? eg if multiple people upload the same item [21:20] nope [21:27] *** WinterFox has joined #archiveteam [21:31] *** remsen has joined #archiveteam [21:51] *** bwn has joined #archiveteam [21:51] *** cvb has joined #archiveteam [21:56] *** WinterFox has quit IRC (Read error: Operation timed out) [22:04] *** WinterFox has joined #archiveteam [22:12] *** primus104 has joined #archiveteam [22:42] *** WinterFox has quit IRC (Read error: Operation timed out) [22:48] *** WinterFox has joined #archiveteam [22:49] *** WinterFox has quit IRC (Read error: Connection reset by peer) [22:52] *** WinterFox has joined #archiveteam [22:52] *** St has joined #archiveteam [22:53] *** St_ has joined #archiveteam [22:57] *** St_ has quit IRC (Ping timeout: 241 seconds) [23:06] *** St has quit IRC (Ping timeout: 240 seconds) [23:23] *** nightpool has joined #archiveteam [23:44] *** cvb has quit IRC (Read error: Operation timed out) [23:45] Channel for the FTP project is #effteepee [23:47] If you have any requests for FTP servers, please let me know in #effteepee [23:54] *** nightpool has quit IRC (Read error: Operation timed out)