[00:01] *** dserodio has joined #archiveteam [00:16] *** Zialus has quit IRC (Read error: Operation timed out) [00:20] *** Zialus has joined #archiveteam [00:25] Frogging: joepie91: there's also a dump of all reddit posts from 2007 to mid-2015 on IA, I believe a majority of all imgur URI would be contained in reddit posts. [00:25] *** dserodio has quit IRC (Read error: Operation timed out) [00:26] *** Zialus has quit IRC (Read error: Operation timed out) [00:29] they uploaded the scrape is a slightly unfortunate format, it takes a terrible amount of time to unpack 1.7B JSON objects. [00:29] *** dserodio has joined #archiveteam [00:30] *** Zialus has joined #archiveteam [00:35] *** joepie91 has quit IRC (Read error: Operation timed out) [00:44] *** joepie91 has joined #archiveteam [00:44] *** swebb sets mode: +o joepie91 [00:46] *** dxrt- sets mode: +o dxrt [00:46] *** kevin has quit IRC (Connection closed) [00:46] *** VonGuard_ has quit IRC (Connection closed) [00:46] *** _desu___ has quit IRC (Connection closed) [00:47] *** kevin has joined #archiveteam [00:47] *** VonGuard_ has joined #archiveteam [00:50] *** AlexLehm has quit IRC (Ping timeout: 260 seconds) [01:02] *** RichardG_ is now known as RichardG [01:09] *** antomati_ has joined #archiveteam [01:09] *** swebb sets mode: +o antomati_ [01:09] *** sep332_ has quit IRC (Read error: Operation timed out) [01:09] *** midas1 has quit IRC (Read error: Operation timed out) [01:10] *** antomatic has quit IRC (Read error: Operation timed out) [01:10] *** fie__ has joined #archiveteam [01:10] *** midas1 has joined #archiveteam [01:10] *** swebb sets mode: +o midas1 [01:10] *** VonGuard_ has quit IRC (Ping timeout: 260 seconds) [01:10] *** kevin has quit IRC (Ping timeout: 260 seconds) [01:10] *** fie_ has quit IRC (Read error: Operation timed out) [01:11] *** SketchCo2 has joined #archiveteam [01:11] *** swebb sets mode: +o SketchCo2 [01:11] *** SketchCow has quit IRC (Read error: Operation timed out) [01:11] *** yuitimoth has quit IRC (Read error: Operation timed out) [01:14] *** sep332_ has joined #archiveteam [01:14] *** signius has quit IRC (Read error: Operation timed out) [01:16] *** yuitimoth has joined #archiveteam [01:20] *** Jogie has quit IRC (Read error: Operation timed out) [01:22] *** kevin has joined #archiveteam [01:22] *** VonGuard_ has joined #archiveteam [01:23] *** signius has joined #archiveteam [01:27] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [01:48] *** SketchCo2 is now known as SketchCow [01:49] *** kristian_ has quit IRC (Leaving) [01:57] *** BlueMaxim has joined #archiveteam [02:06] *** ndiddy has quit IRC (Read error: Connection reset by peer) [02:23] *** dashcloud has joined #archiveteam [02:53] *** thefinn93 has joined #archiveteam [02:56] *** tomwsmf has quit IRC (Read error: Operation timed out) [03:43] *** dashcloud has quit IRC (Read error: Operation timed out) [04:11] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:15] *** ravetcofx has joined #archiveteam [04:16] *** JesseW has quit IRC (Ping timeout: 370 seconds) [04:17] *** Sk1d has joined #archiveteam [04:32] *** murfjr has joined #archiveteam [04:32] *** murfjr is now known as polysynde [04:32] *** polysynde is now known as maelstrom [04:33] *** maelstrom has quit IRC (Client Quit) [04:40] *** maelstrom has joined #archiveteam [04:49] *** robink has quit IRC (Ping timeout: 1208 seconds) [05:00] *** robink has joined #archiveteam [05:02] *** c_rippa has joined #archiveteam [05:04] *** c_rippa has left Textual IRC Client: www.textualapp.com [05:18] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [05:23] *** maelstrom has quit IRC (Ping timeout: 260 seconds) [05:25] *** Famicoman has joined #archiveteam [05:54] *** yipdw_ has joined #archiveteam [05:54] *** Frogging sets mode: +o yipdw_ [05:56] *** yipdw has quit IRC (Read error: Operation timed out) [06:00] *** Jogie has joined #archiveteam [06:19] *** BlueMaxim has quit IRC (Read error: Operation timed out) [06:20] *** BlueMaxim has joined #archiveteam [06:21] *** brayden has quit IRC (Quit: Leaving) [07:00] zout: how so? assuming it's NDJSON, it's super easy to stream it [07:03] joepie91: trivial to work with, just 1.7B of anything takes a long time. [07:04] zout: sure, but that seems unrelated to the format it's in, more related to the fact that it's 8 years of content :P [07:04] zout: you may find `jq` a useful tool though [07:05] and yes, newline delimited json. [07:05] I'll look into it next time I'm using that data set, previously I've just used a python one liner. [07:09] Compress the files with lz4 (less disk I/O), limit the number of items per file (jobs can run in parallel). Worked fine with 2 billion Flickr items. [07:33] *** Honno has joined #archiveteam [07:48] *** ravetcofx has quit IRC (Ping timeout: 370 seconds) [07:59] jq seems to work well. [08:12] *** RichardG_ has joined #archiveteam [08:12] *** RichardG has quit IRC (Read error: Connection reset by peer) [08:21] *** RichardG_ has quit IRC (Ping timeout: 370 seconds) [08:35] *** tuankiet has quit IRC (Remote host closed the connection) [08:36] *** brayden has joined #archiveteam [08:36] *** swebb sets mode: +o brayden [08:40] *** WinterFox has joined #archiveteam [08:49] *** AlexLehm has joined #archiveteam [09:40] *** BartoCH_ has joined #archiveteam [09:41] *** BartoCH has quit IRC (Read error: Connection reset by peer) [09:44] *** Spartan has joined #archiveteam [09:46] *** Spartan has quit IRC (Client Quit) [10:07] *** BartoCH_ is now known as BartoCH [10:11] *** RichardG has joined #archiveteam [10:20] *** RichardG_ has joined #archiveteam [10:21] *** RichardG has quit IRC (Read error: Connection reset by peer) [10:32] *** tuankiet has joined #archiveteam [10:52] *** morbus_ has quit IRC (Read error: Operation timed out) [11:13] *** tuankiet has quit IRC (Ping timeout: 244 seconds) [11:26] *** tuankiet has joined #archiveteam [12:19] *** dashcloud has joined #archiveteam [12:59] *** Morbus has joined #archiveteam [13:05] The imgur sitemaps refered to 44 million unique urls (pictures and galleries). [13:05] See https://6xq.net/paste/imgur-sitemap.txt.xz [13:10] PurpleSym: wow. that sounds like everything they have. [13:10] 44m? I’m not sure about that. [13:10] surprised they went to all the effort of actually having sitemaps. [13:11] imgur deletes images which aren't viewed for 180 days. 44M sounds plausible to me if all the cruft is being dropped. [13:12] Oh, I did not know that. [13:19] imgur has a maximum upload size of 20MB for static images and 200M for GIFs, placing the upper bound on archiving all their content at 8.8PB. [13:21] oh, that's not true. their sitemap might contain albums not just images. [13:22] imgur does not delete old images anymore [13:24] at least, I thought they didn't. they changed their help site around and I can't find a source for that anymore [13:25] yeah, I'm not able to find a source for 180 days either. [13:25] but at some point they dropped the Pro accounts, and they stopped deleting old images [13:25] *** WinterFox has quit IRC (Read error: Operation timed out) [13:25] I looked into that when people made tools for turning the flickr 1TB of free image storage into a file system. [13:28] 30 million without galleries, < 5.6 PiB. [13:37] *** tuankiet has quit IRC (Ping timeout: 244 seconds) [13:44] *** superkuh has quit IRC (Remote host closed the connection) [13:45] *** superkuh has joined #archiveteam [13:50] *** kristian_ has joined #archiveteam [13:52] *** tuankiet has joined #archiveteam [13:58] *** tomwsmf has joined #archiveteam [14:16] https://archive.org/web/petabox.php To archive all of imgur we will need a couple of those. :P [14:16] kickstarter go [14:16] :p [14:17] yeah totally :P [14:24] are the backblade boxes higher density? [14:24] yes, much much higher. [14:24] https://www.backblaze.com/blog/open-source-data-storage-server/ [14:26] 4RU, 240-480GB. down to 4.3c/GB. [14:30] *** metalcamp has joined #archiveteam [14:31] *** _desu___ has joined #archiveteam [14:49] *** vOYtEC has quit IRC (Ping timeout: 255 seconds) [14:53] *** BlueMaxim has quit IRC (Quit: Leaving) [14:53] well, i bet a lot of the image data on there isn't unique [14:54] people reup all the time, so i bet you could cut 20% off if nothing else [14:56] *** Kksmkrn has joined #archiveteam [14:58] *** kristian_ has quit IRC (Read error: Operation timed out) [14:58] *** tomaspar1 has quit IRC (Ping timeout: 370 seconds) [15:00] atrocity: https://xkcd.com/1683/ :) [15:08] *** superkuh has quit IRC (Remote host closed the connection) [15:08] *** tuankiet has quit IRC (Ping timeout: 244 seconds) [15:12] *** superkuh has joined #archiveteam [15:21] *** tuankiet has joined #archiveteam [15:31] *** tuankiet has quit IRC (Ping timeout: 244 seconds) [15:36] *** Kksmkrn has quit IRC (Remote host closed the connection) [15:38] *** vOYtEC has joined #archiveteam [15:42] *** vOYtEC has quit IRC (Read error: Connection reset by peer) [15:42] *** vOYtEC has joined #archiveteam [15:44] *** tuankiet has joined #archiveteam [16:02] *** vOYtEC has quit IRC (Ping timeout: 633 seconds) [16:23] *** JW_work has quit IRC (Read error: Connection reset by peer) [16:24] *** JW_work has joined #archiveteam [16:25] *** vOYtEC has joined #archiveteam [16:25] *** Kksmkrn has joined #archiveteam [16:26] *** dcmorton has quit IRC (Quit: ZNC - http://znc.in) [16:32] *** dcmorton has joined #archiveteam [16:32] *** swebb sets mode: +o dcmorton [16:35] *** AlexLehm has quit IRC (Ping timeout: 260 seconds) [17:14] *** schbirid has joined #archiveteam [17:20] *** vOYtEC has quit IRC (Ping timeout: 633 seconds) [17:59] *** VADemon has joined #archiveteam [18:14] *** antomati_ is now known as antomatic [18:16] *** vOYtEC has joined #archiveteam [18:16] *** vOYtEC has quit IRC (Client Quit) [18:18] *** vOYtEC has joined #archiveteam [18:20] *** vOYtEC has quit IRC (Read error: Connection reset by peer) [18:26] http://blog.dshr.org/2016/08/evanescent-web-archives.html [18:26] *** vOYtEC has joined #archiveteam [18:30] *** schbirid2 has joined #archiveteam [18:32] *** schbirid has quit IRC (Read error: Operation timed out) [18:33] If people can help me with a slight thing [18:33] I'd like to suck ALL data out of a flickr account and it put into another flickr account [18:33] There MUST be some utility out there [18:38] I found Bulkr, but that's for uploads [18:38] I mean one-time download backups [18:39] SketchCow: just getting the photo? or you want to backup the meta [18:39] All [18:39] http://sunkencity.org/flickredit ? [18:39] I want an album in a page to be in another account [18:47] *** bRick5772 has joined #archiveteam [18:51] *** Jogie has quit IRC (Quit: ZNC - http://znc.in) [19:11] *** robink has quit IRC (Ping timeout: 501 seconds) [19:20] *** username1 has joined #archiveteam [19:22] *** schbirid2 has quit IRC (Read error: Operation timed out) [19:29] *** vOYtEC has quit IRC (Ping timeout: 633 seconds) [19:29] *** RichardG_ is now known as RichardG [19:30] *** username1 has quit IRC (Read error: Operation timed out) [19:38] *** username1 has joined #archiveteam [19:40] *** schbirid2 has joined #archiveteam [19:40] *** robink has joined #archiveteam [19:43] *** username1 has quit IRC (Read error: Operation timed out) [19:45] *** schbirid2 has quit IRC (Read error: Operation timed out) [19:50] *** schbirid has joined #archiveteam [19:50] *** WubTheCap has joined #archiveteam [19:51] I'm archiving the only known Finnish tulpa forum, which has historically mass-deleted posts to get a clean start. It's gone mostly dormant today. [19:52] Should I do anything special like tag it with ArchiveTeam or get the media type right on archive.org? [19:56] *** schbirid2 has joined #archiveteam [19:59] https://github.com/blog/2239-updates-to-our-privacy-statement Beep word "archivists" used [20:01] WubTheCap: How are you archiving it? What's the size? [20:01] (very roughly, like, number of posts?) [20:02] luckcolor: what's the tl;dr? [20:03] Sanqui: 15M tulpa.palstani.com_phpBB3_20160829.warc.gz [20:03] *** username1 has joined #archiveteam [20:03] Downloaded: 1376 files, 52M in 2m 6s (422 KB/s) [20:03] Is that all? [20:03] Sanqui: "For example, the new Privacy Statement describes how people such as researchers or archivists can use your public information on GitHub.com," [20:04] Sanqui: It's not a large forum. [20:04] WubTheCap: That's small enough to just throw into ArchiveBot, which is what we generally use for sites like these [20:04] check #archivebot. I'll put it in [20:04] Generally yes, but I took the effort to do it myself as friends on a different channel were away [20:05] *** schbirid2 has quit IRC (Read error: Operation timed out) [20:05] There are no images on the forum to begin with, so it's small [20:05] *** schbirid2 has joined #archiveteam [20:05] #archiveteam-bs [20:05] *** schbirid has quit IRC (Read error: Operation timed out) [20:06] WubTheCap: anyway, I do not know what the process is for getting your own dumps into the Wayback Machine, so I'm not the right person here. SketchCow? [20:07] Jason has moved one, but it has to be tagged ArchiveTeam [20:07] info@archive.org said they don't generally ingest WARCs to Wayback Machine [20:07] But that was months ago [20:11] *** schbirid2 has quit IRC (Read error: Operation timed out) [20:12] *** username1 has quit IRC (Read error: Operation timed out) [20:12] *** schbirid has joined #archiveteam [20:16] *** schbirid2 has joined #archiveteam [20:19] chain-of-custody is important [20:19] don't want people submitting falsified warcs [20:22] *** schbirid has quit IRC (Read error: Operation timed out) [20:25] *** schbirid2 has quit IRC (Read error: Operation timed out) [20:28] Jason / SketchCow I'm not sure how you think two wrongs make some sort of right, but considering you literally just cursed me out called me names and banned me from a channel, I think I'll just devote my time and energy somewhere else alltogether. I'm not interested in participating and contributing time and resources to a group who's led by someone who conducts themselves this way. I [20:28] did something borderline at best, and you want to cuss me out, call me names, and ban me from channels. Grow up. [20:28] *** ErkDog has left [20:29] Is there a Gawker dump running? [20:29] filippo__, i think godane1 has that in hand [20:29] Gawker is covered [20:30] https://media.giphy.com/media/112W3UgYjCxzLq/giphy.gif [20:30] Yes, the Gawker project is progressing nicely, and has expanded to many other related projects that could be affected by the buyout. [20:31] It is not clear when/if Gawker will be yanked down, as it's currently read-only, but it'll likely be random, from a month to years from now. [20:31] godane's been on it hardcore, as have others. [20:47] REGARDING FOREIGN WARCS [20:48] The archivebot infrastructure is the main alternative pipeline outside of Internet Archive's own crawlers that brings in WARCs that are in use by the Wayback machine. [20:48] It does so because it's using the full faith and credit of an employee (me) to bring them in and verify that they're in a chain of custody that is clean. [20:48] Does that extend to third-party run ArchiveBots such as bot.yui.cat? [20:48] No. [20:49] People are uploading outside WARCs all the time, actually. Not just you. [20:49] Now, I'm happy to aim you at mark@archive.org, who is Mark Graham, director of the Wayback Machine. [20:49] You can ask him about what you want. He'll go to me, "is this one of yours", and I will go "no" [20:49] And then he'll make a decision or he won't. [20:50] While you're orating here: What is the correct way to upload WARCs into the Archiveteam collection (I remember you weren't pleased the way we used to just upload WARCs on our own, as such I still have maxfile.ro outstanding) [20:50] So, that's what I'd say you'd have to do, if you're concerned something you saved isn't in the wayback. [20:51] The correct way is archivebot [20:51] Otherwise, I'm having people I know upload items to me, and I have to go through and validate them [20:56] *** jdude104 has joined #archiveteam [21:09] *** AlexLehm has joined #archiveteam [21:20] Channel #chaingang created [21:20] go join it, make it something formalized [21:21] *** ndiddy has joined #archiveteam [21:21] * luckcolor will start a wiki page [21:21] Title? [21:22] cryptocurrencies blochain backup initiative? [21:22] lel [21:22] ArchiveTeam Chain Gang [21:23] k [21:23] You will all be delighted to know, I'm sure, that I've successfully ripped 1,800 Apple II floppes and I THINK I'm down to the last 500 or so unless there's a hidden box in this room, which is not a 100% unlikely scenarior [21:24] In related news, fuck Apple II floppies, I'm sick to death of them [21:26] Sketchcow: can you make a empty repo on github for eventual sourcecode? [21:26] https://archive.org/details/softwarelibrary_apple_workbench?&sort=-downloads&page=2 [21:26] luckcolor: No, others do that better than I. [21:26] Ask arkiver or HCross or yuitimoth [21:26] yipdw I mean [21:26] Ok [21:26] (archive.org link was if you want to boot my most recently ripped floppies) [21:27] Let's make sure we get it right before we start flooding [21:27] But I definitely think a once a month grab is best [21:27] We're just bringing up the rear, as blockchains slow down in favor [21:38] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [21:39] *** tomaspark has joined #archiveteam [21:42] Lots of the floppies are displaying "This is a data disk, not a startup disk". [21:42] And nothing happens after that on MAME [21:45] SketchCow: Missed something? Any feedback? http://archiveteam.org/index.php?title=ArchiveTeam_Chain_Gang [21:54] *** VADemon has quit IRC (left4dead) [21:56] *** WubTheCap has quit IRC (Quit: Leaving) [21:57] *** bRick5772 has quit IRC (Quit: Leaving.) [22:07] WubTheCap: http://tulpa.palstani.com/ was ArchiveBot'd. [22:26] *** Stiletto has joined #archiveteam [22:42] *** Honno has quit IRC (Read error: Operation timed out) [23:05] *** JonimusP is now known as Jonimus [23:13] Looking [23:19] Made a couple additions [23:20] *** tomaspark has quit IRC (Read error: Connection reset by peer) [23:35] *** Jordan_ has joined #archiveteam [23:47] *** dx has quit IRC (Ping timeout: 633 seconds)