[00:01] *** benuski has quit IRC (Quit: Leaving) [00:18] *** maelstrom has quit IRC (Quit: Leaving) [00:22] *** maelstrom has joined #archiveteam [00:26] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [00:26] *** hive-mind has joined #archiveteam [00:31] *** jrwr has quit IRC (Remote host closed the connection) [00:33] *** jrwr has joined #archiveteam [00:49] *** powerKitt has joined #archiveteam [00:59] *** powerKitt has quit IRC (Quit: Page closed) [01:06] *** JesseW has joined #archiveteam [01:45] *** kristian_ has quit IRC (Quit: Leaving) [02:01] *** ravetcofx has quit IRC (Ping timeout: 506 seconds) [02:10] *** ravetcofx has joined #archiveteam [02:24] *** rudolphos has joined #archiveteam [02:25] *** jrwr has quit IRC (Remote host closed the connection) [02:29] *** rudolphos has quit IRC (Leaving) [02:52] *** ndiddy has quit IRC (Quit: Leaving) [02:53] *** Froggypwn has quit IRC (Read error: Operation timed out) [02:53] *** Froggypwn has joined #archiveteam [02:53] *** BlueMaxim has quit IRC (Read error: Operation timed out) [02:54] *** BlueMaxim has joined #archiveteam [03:54] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [04:18] *** maelstrom has quit IRC (Remote host closed the connection) [05:12] *** balrog has quit IRC (Ping timeout: 260 seconds) [05:21] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:27] *** Sk1d has joined #archiveteam [05:52] *** Start has quit IRC (Quit: Disconnected.) [05:55] *** Start has joined #archiveteam [06:15] *** balrog has joined #archiveteam [06:15] *** swebb sets mode: +o balrog [06:30] *** JesseW has quit IRC (Ping timeout: 370 seconds) [08:00] *** Observer has quit IRC (Ping timeout: 268 seconds) [08:15] *** WinterFox has joined #archiveteam [08:36] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [08:37] *** khaoohs_ has joined #archiveteam [08:42] *** W1nterFox has joined #archiveteam [08:48] *** WinterFox has quit IRC (Read error: Operation timed out) [09:31] who was doing the home.arcor.de discovery? Some more google scraping, again, raw output, no dedup etc. https://www.medowar.de/lab/at/arcor/liste2.txt [10:04] That would be me, Medowar0. [10:05] *** ravetcofx has quit IRC (Ping timeout: 506 seconds) [10:19] *** BlueMaxim has quit IRC (Quit: Leaving) [10:43] *** antomati_ has joined #archiveteam [10:43] *** swebb sets mode: +o antomati_ [10:49] *** antomatic has quit IRC (Read error: Operation timed out) [11:39] *** Budgiebra has joined #archiveteam [11:53] rip. DNShistory is now officially offline. I was crawling it very slowly, but it is now officially dead. [12:25] *** Budgiebra has left [12:32] *** khaoohs__ has joined #archiveteam [12:34] *** khaoohs_ has quit IRC (Read error: Operation timed out) [12:57] *** tatata has joined #archiveteam [12:58] *** tatata has quit IRC (Client Quit) [13:23] *** bRick5772 has joined #archiveteam [13:39] *** W1nterFox has quit IRC (Read error: Operation timed out) [13:51] *** sep332 has joined #archiveteam [14:25] *** ndiddy has joined #archiveteam [14:26] *** ndizzle has joined #archiveteam [14:26] *** ndizzle has quit IRC (Read error: Connection reset by peer) [15:25] *** arkiver sets mode: +o HCross [15:48] *** JesseW has joined #archiveteam [16:08] *** RichardG has joined #archiveteam [16:09] *** atomotic has joined #archiveteam [16:33] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:48] *** JesseW has quit IRC (Ping timeout: 370 seconds) [17:16] Define "Megawarc seems stuck" [17:19] Greetings, I'm home [17:19] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [17:19] I expect to drink 45 5-hour energy drinks and go through our (and other) backlogs [17:19] *** BartoCH has joined #archiveteam [17:19] POSTIMAGE [17:19] HELLO POSTIMAGE [17:20] *** kristian_ has joined #archiveteam [17:21] Hello ArchiveTeam, [17:21] Our project hosts over 140 million images used in ~450k websites all over the web, including a number of vibrant communities and bulletin boards. [17:21] We have recently found ourselves in financial dire straits, and would like to investigate the opportunities for archiving our collection in case we do not survive after all [although there's still a good chance that we do]. Our total image database is nearly 100Tb large, but almost 40% of that is adult imagery that we believe can be safely sacrificed. [17:21] What do you think about this? [17:21] let's take it [17:21] it's uh [17:22] I'm going to write them about it [17:22] And request a phone call, etc. [17:22] 20x as big as gitorious, and i was the only person willing to host gitorious [17:22] I just want hard drives [17:22] aye [17:27] *** JW_work has joined #archiveteam [17:33] Anyway, that's on the mantle [17:39] *** powerKitt has joined #archiveteam [17:39] SketchCow: I think they announced on their page they're not in trouble anymore [17:40] Which, Postimage? [17:47] won't somebody think of the adult imagery [17:49] http://postimage.org/ [17:49] yeah [17:50] but it looks like it's changed/removed now [17:50] do we still want to grab it? [17:58] Looks like they're still saying their in danger of closing. [18:00] *** Aranje has joined #archiveteam [18:04] *** PepsiMax has joined #archiveteam [18:09] *** bwn has quit IRC (Read error: Operation timed out) [18:13] They've mailed me and I'm working to get a con call [18:13] rad [18:16] SketchCow: i'm going to discuss internally to see if we can step in to help postimage with our cdn [18:20] You got it. [18:20] But ideally they just send us 50 hard drives. [18:20] We have tons of hard drives. [18:21] jesus, reading their blog post [18:21] true, but then project goes into read only mode over at archive.org. not a bad thing if it's kept alive instead [18:22] Kaz: there's plenty of other traffic heavy sites that are hiding behind cloudflare, just a matter of time before they get snuffed out [18:22] I guess [18:22] most people just assume cloudflare would offer them free bw forever [18:22] lol [18:22] just the fact that they really had no plans at all, "We didn’t pay enough attention to making money off Postimage" [18:22] they're probably developers > business people [18:30] *** bwn has joined #archiveteam [18:40] *** bwn has quit IRC (Ping timeout: 244 seconds) [18:47] *** bwn has joined #archiveteam [18:51] Kenshin: do you mean heroku can provide a cheaper option when cloudflare stops paying all the bills? [18:51] Nemo_bis: don't get you [18:53] There are a lot of cheaper providers. [18:53] The only question is if they can handle the load. [18:57] *** cadbury_ has quit IRC (Read error: Operation timed out) [18:57] i have a lot of nice things to say about Heroku but "cheap" is not one of them [18:58] well, maybe at company budgets it is [18:58] on individual scale though [19:05] *** ravetcofx has joined #archiveteam [19:06] *** bsmith093 has quit IRC (Read error: Operation timed out) [19:08] *** cadbury_ has joined #archiveteam [19:25] *** BlackoutI has joined #archiveteam [19:27] *** BlackoutI has left [19:27] *** Blackout has joined #archiveteam [19:28] So is there any point in setting my warrior to vine rn? [19:31] Blackout: Nope. Use Archiveteam's choice [19:32] That's always the best, unless you are around your instance all the time and obsessed. [19:37] @Yoshimura I have a gigabit line and I figured I'd probably want to run multiple projects. I'm assuming a ton of warriors is inefficient? [19:39] Glad you ask, running a lot of VMs, is inefficient, yes. Warrior itself is inefficient, yes. So you got it squared. [19:39] Simplest thing is to run multiple warriors with modified code in a docker. [19:39] also. you shouldn't run more than one warrior per IP [19:39] xmc: Why not? [19:39] Even for different projects? [19:39] if you stack multiple warriors on the same ip address, you're twice as likely to get ip-banned [19:39] etc [19:40] for the same project, that is [19:40] Yeah, ip bans are obvious thing. [19:40] So basically only have one on auto [19:40] because we try to make it so that each warrior sneaks in under IP bans by whoever we're archiving [19:40] Blackout: Is that Linux? [19:40] Well my main box is Win 10 With hyperV but I have an ESXi host as well on my lan [19:42] I would say I'm new here but I've been around once before for a tracker I can't quite recall the name of [19:42] Linux host and Warrior dockers (just forward UI to different port), one per project. And use mounting parameters: relatime. And idealy also writeback (need filesystem tweak first). Or mount /data in the containers to a ramdisk. [19:42] Each wget thread competes for IO, plus syslog, so it is pretty inefficient without tweaks (which the VM has) [19:43] How much data do they download before shipping it back? [19:48] *** bsmith093 has joined #archiveteam [19:49] o/ [19:49] you could also just run the scripts for each project if you're confortable with that. means you can run a lot higher in terms of concurrency etc [19:50] You can just modify single number in code and it will for Warrior also, but you need to watch the IP bans. [19:51] Like Panoramio will not care, but with more threads and process context switching you do not get much performance above about 10-20 threads. [19:52] What kind of disk space should I allocate though? [19:57] Depends on project. Panoramio items are small, so running that of tmpfs is fine. tmpfs can swap if needed. I run on gigabytes, but keep close eye, and I run on 100Mbit line. [19:58] Panoramio needs few dozen MB per thread. [19:58] The Internet Archive S3 infrastructure just got a boost [19:58] They use S3? [20:00] Great to hear. [20:00] Blackout: S3 is API. S3 compatible products are not a rare thing. [20:01] Oh ok right that's a widely adopted api. Gotcha [20:05] We tend to use "S3-like" but most people in here get it. It's the moving of the term from S3 as a Amazon brand and "S3" as a format. [20:05] There was an FTP company once, after all [20:05] Fuck those guys [20:07] TIL [20:09] what's the 'boost'? [20:13] Additional 16cpu machine with 10gig connection [20:13] I mean, you and Kenshin are going to assault it to within an inch of its life anyway [20:13] but there it is [20:14] whee [20:22] Is that an ingest server you're talking about @SketchCow ? [20:22] i thought was just me 'assault it to within an inch of its life' :P [20:23] I think you all can be blamed [20:23] You're all monsters [20:34] the kleenexing of amazon [20:46] http://prawfsblawg.blogs.com/.a/6a00d8341c6a7953ef0134851907f7970c-500wi [20:48] https://www.adobe.com/legal/permissions/trademarks.html [20:48] Aoede: what? [20:49] Adobe has same problem with trademarks [20:49] oh [20:49] Specifically, with the usage of "photoshop" to mean "edit an image" [20:49] "Correct: The image was enhanced using Adobe® Photoshop® software." [20:50] " Incorrect: The image was photoshopped." [20:50] "Incorrect: My hobby is photoshopping.: [20:50] I love that [20:50] Adobe: OUR NEW BARN DOOR IS GOING TO COMPLETELY CONTAIN THE ESCAPED HORSE [20:50] Good luck Adobe [20:50] "Incorrect: The photoshop pokes fun at the Senator." [20:50] is it better or worse if i call it a shoop [20:51] Better [20:51] gr8 [20:52] shoop the woop [20:54] *** maelstrom has joined #archiveteam [21:01] Postimage guy gave me his skype. [21:01] We'll talk [21:10] How do you set max rsync jobs with run-pipeline? [21:12] *** Start has quit IRC (Read error: Connection reset by peer) [21:13] *** Start has joined #archiveteam [21:19] Blackout: that's just a legal safeguard for the trademark [21:20] Just like kleenex tries not to lose the trademark due to the word becoming a common noun [21:21] Wikimedia Foundation makes the same stupid request for that reason. [21:24] woop woop woop off-topic siren [21:27] SketchCow: are things still bumpy? 40Mbps up atm, 20tb to shift [21:56] *** BlueMaxim has joined #archiveteam [22:03] Yes [22:16] *** db48x has joined #archiveteam [22:20] *** maelstrom has quit IRC (Remote host closed the connection) [22:29] *** RichardG_ has joined #archiveteam [22:29] *** RichardG has quit IRC (Ping timeout: 370 seconds) [22:33] *** maelstrom has joined #archiveteam [22:35] *** maelstrom has quit IRC (Client Quit) [22:51] https://pbs.twimg.com/media/CwIL6ezWAAAC0id.jpg:large [22:51] Everyone got that? [22:52] Nice [22:54] *** powerKitt has quit IRC (Ping timeout: 268 seconds) [23:03] hrmph [23:07] *** atomotic has joined #archiveteam [23:13] That's a lot of broken links, though... [23:16] more detail: https://www.whitehouse.gov/participate/opening-our-data-public [23:16] https://www.whitehouse.gov/blog/2016/10/31/digital-transition-how-presidential-transition-works-social-media-age [23:18] so i'm going to assume they are getting special dispensation from twitter to enable them to migrate tweets from one account to another [23:18] it would be the only way to keep ids and timestamps reasonably the same, which is necessary for any archival at all amio [23:18] *imo [23:19] but we should def throw a scraper or two at them [23:20] https://www.reddit.com/r/trackers/comments/5aew97/sciencehd_says_farewell_on_november_31/ [23:21] apparently big private(-ish?) torrent tracker closing with sciencey(?) stuff [23:21] enabled site-wide freeleech until shutdown [23:21] unsure if within scope, I'd imagine there's a lot of rare materials [23:21] sounds sciencey, no idea what it really is [23:21] seems it's not free signup though [23:24] Would need someone with account [23:24] The applications are closed. [23:24] https://sciencehd.me/applications.php [23:27] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [23:27] *** RichardG has joined #archiveteam [23:29] *** bRick5772 has quit IRC (Quit: Leaving.) [23:34] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [23:38] do people not know that there is no november 31st [23:38] thats the second time a closing site has november 31st in the closing post [23:39] yeah! there was one last week, too