[00:09] *** schbirid has quit IRC (Quit: Leaving) [00:37] *** kyan has quit IRC (Quit: This computer has gone to sleep) [00:38] *** kyan has joined #archiveteam [00:57] *** xk_id has quit IRC (Remote host closed the connection) [01:18] *** xk_id has joined #archiveteam [01:18] *** vitzli has joined #archiveteam [01:20] *** Guest100 has joined #archiveteam [01:20] *** Guest100 has quit IRC (Client Quit) [01:33] *** Guest100 has joined #archiveteam [01:45] *** aaaaaaaaa sets mode: +o chfoo [01:49] I'm about to go into the FTP dump on FOS. [01:49] 632gb of god know what the fuck what [01:56] * xmc hands SketchCow scuba gear [01:57] *** Guest100 has quit IRC (My Mac has gone to sleep. ZZZzzz…) [02:00] *** xk_id_ has joined #archiveteam [02:00] *** xk_id has quit IRC (Read error: Connection reset by peer) [02:00] *** BlueMaxim has joined #archiveteam [02:06] I already found a directory with 132gb of what-the-fuck [02:06] mysql2014-08-31.tar.gz [02:06] somewhere2014-09-01.tar.gz [02:06] streaming_content2010-12-03.tar.gz [02:06] tomcat2014-08-31.tar.gz [02:06] turbulence2014-09-01.tar.gz [02:08] very descriptive XD [02:16] Well, after I study it slightly more, it goes up as is. [02:17] *** vitzli has quit IRC (Quit: Leaving) [02:24] *** zgrep has left When-if-ever I become an archiver, I shall join. For now... meh. [02:29] *** vitzli has joined #archiveteam [02:44] *** primus104 has quit IRC (Leaving.) [02:52] *** VADemon has quit IRC (Read error: Connection reset by peer) [03:04] *** xk_id_ has quit IRC (Remote host closed the connection) [03:09] *** vitzli has quit IRC (Quit: Leaving) [03:42] *** phuzion has quit IRC (Read error: Operation timed out) [03:49] *** phuzion has joined #archiveteam [04:20] *** aaaaaaaaa has quit IRC (Leaving) [04:40] *** vitzli has joined #archiveteam [04:45] Found it. It's the "New American Radio" site, grabbed down to the tomcat and mysql instances. [04:56] *** Cameron_D has quit IRC (Ping timeout: 483 seconds) [05:05] *** Ravenloft has quit IRC (Ping timeout: 252 seconds) [05:10] *** Cameron_D has joined #archiveteam [05:28] that's a hell of a grab [06:09] *** Guest100 has joined #archiveteam [06:10] SketchCow: What is that? [06:12] *** db48x` has quit IRC (Remote host closed the connection) [06:29] *** jspiros has quit IRC (Ping timeout: 186 seconds) [06:44] *** PurpleSym has joined #archiveteam [06:57] *** jspiros has joined #archiveteam [07:40] *** vitzli has quit IRC (Quit: Leaving) [07:52] *** vitzli has joined #archiveteam [07:52] *** Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [08:34] *** Stiletto has quit IRC () [08:35] *** Stiletto has joined #archiveteam [08:40] *** schbirid has joined #archiveteam [08:49] So anyone has a good name for a DocStoc channel? [08:53] docstocandbarrel [09:15] *** Ungstein has joined #archiveteam [09:18] *** arkiver2 has joined #archiveteam [09:25] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [09:33] *** primus104 has joined #archiveteam [09:48] docoutofstoc [10:56] ^ [11:01] *** arkiver2 has joined #archiveteam [11:24] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [11:26] *** arkiver2 has joined #archiveteam [11:26] *** primus104 has left [11:33] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [11:57] *** nmnn has joined #archiveteam [12:04] *** RichardG has quit IRC (Read error: Connection reset by peer) [12:11] *** RichardG has joined #archiveteam [12:54] *** robink has quit IRC (Ping timeout: 492 seconds) [13:48] Is there an easy way to archive a discussion between to twitteraccounts? [14:00] *** arkiver2 has joined #archiveteam [14:04] *** BlueMaxim has quit IRC (Quit: Leaving) [14:11] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [14:28] SilSte: there isn't even an easy way to /follow/ a discussion between two Twitter accounts. bah. [14:28] *** khaoohs_ has joined #archiveteam [14:28] you see my problem ;) [14:28] I want something like "All the tweets between x and y from time a to z" [14:31] *** khaoohs has quit IRC (Read error: Operation timed out) [14:46] *** primus104 has joined #archiveteam [15:06] *** vitzli has quit IRC (Quit: Leaving) [15:07] storify might be the best way, but if you need something that will work independently of twitter (in the case of people deleting tweets or accounts), you'll need to archive the conversation yourself- archivebot can do that [15:37] *** nmnn has quit IRC (Remote host closed the connection) [15:39] *** zenguy_pc has quit IRC (Excess Flood) [15:43] *** xk_id has joined #archiveteam [15:45] *** SmileyG has quit IRC (Remote host closed the connection) [15:45] *** Smiley has joined #archiveteam [15:46] *** zenguy_pc has joined #archiveteam [15:48] *** Laverne has quit IRC (Ping timeout: 369 seconds) [15:49] *** dxrt has quit IRC (Ping timeout: 369 seconds) [15:49] *** dxrt has joined #archiveteam [15:49] *** Laverne has joined #archiveteam [15:52] *** zenguy_pc has quit IRC (Excess Flood) [15:52] *** zenguy_pc has joined #archiveteam [16:38] *** atomotic has joined #archiveteam [17:10] *** robink has joined #archiveteam [17:32] *** arkiver2 has joined #archiveteam [17:39] Posted this in #archivebot by mistake: [17:39] SketchCow: So Google Code is taking a little longer to start [17:39] It needs a bit more tweaking in what will be downloaded and what not [17:39] For example for every commit made we will download the page which shows the file changes made in the commit [17:39] However, we will not download the files which have not been changed with the commit [17:39] The git, hg and svn repo's will be downloaded through a special project, just like SourceForge. Those files also contain all commits. [17:43] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [17:47] *** arkiver2 has joined #archiveteam [17:50] *** arkiver2 has quit IRC (Client Quit) [17:52] *** aaaaaaaaa has joined #archiveteam [17:52] *** swebb sets mode: +o aaaaaaaaa [17:53] *** scyther has joined #archiveteam [18:10] *** primus104 has quit IRC (Leaving.) [18:11] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [18:12] *** aaaaaaaaa has joined #archiveteam [18:12] *** swebb sets mode: +o aaaaaaaaa [18:15] arkiver: This is nice and all, but… don't you kinda feel like google should be doing this? [18:16] anomie: that is the central thesis of archiveteam [18:24] *** SimpBrain has joined #archiveteam [18:24] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [18:53] *** primus104 has joined #archiveteam [18:53] *** scyther has quit IRC (Read error: Connection reset by peer) [19:01] *** khaoohs has joined #archiveteam [19:06] arkiver: #docstop [19:07] *** khaoohs_ has quit IRC (Ping timeout: 483 seconds) [19:07] *** khaoohs_ has joined #archiveteam [19:08] *** khaoohs has quit IRC (Read error: Operation timed out) [19:27] *** scyther has joined #archiveteam [19:53] *** habi has joined #archiveteam [19:54] *** habi has left [20:09] *** schbirid has quit IRC (Quit: Leaving) [20:17] Shall we do #docstop? [20:29] Any eta on GoogleCode or not? [20:44] HCross: ok if I PM you when we start? [20:44] Yeah, I am out tomorrow intil around 2pm British - I dont have any big boxes this time. [20:45] Intening to test what https://www.scaleway.com is like for the ArchiveTeam stuff [20:45] *** godane has left [20:45] *** godane has joined #archiveteam [20:49] *** PurpleSym has quit IRC (Remote host closed the connection) [20:50] *** bsmith096 has joined #archiveteam [20:50] i need a command to remove all leading . characters from folder and file names, recursively, [20:51] *** Guest100 has joined #archiveteam [20:51] the way i did the fanfic grab, some folders have a dot char as the first character, so they are hidden, and some even have 2 or three dots. [20:57] *** c_b has joined #archiveteam [20:57] *** bsmith096 has quit IRC (Ping timeout: 240 seconds) [21:05] HCross: thats.. really cheap [21:05] They arent bad either [21:06] *** bsmith096 has joined #archiveteam [21:06] I can see them being quite good little ArchiveTeam servers [21:06] ahhh right, it's online.net's ssd vps thing [21:06] yeah [21:07] Only thing ive noticed is support is a tad slower than Onlines main support [21:07] and theyre french? :P [21:08] I remember the time that Online.net's main support English left a lot to be desired [21:09] Ive also worked out how to make URLTeam go on them, you need to remove the address thing from the command and off it goes [21:11] cool [21:12] works fine even though its arm? [21:12] Yeah, watch HCross on the tracker and see [21:14] Does seem to be 404'ing a lot - will dial it down and see [21:15] i'm the fsanfic grab uy on reddit, heres the magnet link to the gzip file magnet:?xt=urn:btih:3E2HBHI4P4N7E3MCM4MIATPF66STOV64&dn=Fanfiction.tar.gz&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80 [21:16] https://www.reddit.com/r/DataHoarder/comments/3jl3qm/nearly_complete_archive_of_fanfictionnet/ [21:17] bsmith096: Is it on archive.org? Should it be on archive.org? [21:22] nvm https://www.reddit.com/r/DataHoarder/comments/3jl3qm/nearly_complete_archive_of_fanfictionnet/cuqg5jw [21:25] yay. i found a peer [21:25] How big is it? [21:26] 107.64 GB [21:27] bsmith096: your magnet link is broken.. :P [21:30] *** scyther has quit IRC (Read error: Connection reset by peer) [21:32] So, about 108 GB of *compressed* fanfictions, can humanity handle this? [21:32] Going to URLTeam under 2 usernames to compare. HCross is a VM on my x86 OVH server and HCrossScaleway is on my ARM server [21:42] bsmith096: What is that? [21:43] *** xk_id has quit IRC (Remote host closed the connection) [22:00] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [22:01] *** zenguy_pc has joined #archiveteam [22:03] *** xk_id has joined #archiveteam [22:09] was thinking about getting that scaleway server [22:15] *** bsmith097 has joined #archiveteam [22:16] *** Coderjoe_ has joined #archiveteam [22:18] *** qwebirc56 has joined #archiveteam [22:19] Rotab: how do i fix it? [22:19] qwebirc56: magnet:?xt=urn:btih:3E2HBHI4P4N7E3MCM4MIATPF66STOV64&dn=Fanfiction.tar.gz&tr=udp://tracker.openbittorrent.com:80 [22:19] *** bsmith097 has quit IRC (Ping timeout: 240 seconds) [22:19] Rotab: SketchCow DFJustin swebb anyone have a thing to strip the dot characters from the front of folder and filenames? [22:26] *** Coderjoe has quit IRC (Ping timeout: 624 seconds) [22:29] bash [22:29] Wow, anomie said something adorable. [22:31] *** Ravenloft has joined #archiveteam [22:32] SketchCow: ok, but the syntax for all the commands iv'e found is very starnge, i just need to unhide some hidden folders and files by removing the dots from the front of their names [22:32] SketchCow: i'm up to 2015-05-10 of medium.com urls [22:33] godane: Are we archving that whole site? [22:33] yes [22:33] based on sitemap [22:33] Nice. [22:33] they delete alot of articles so needs to be done [22:34] Probably a good idea. There's a lot of good stuff in there. [22:34] that way i can just download the full daily sitemap every few day or so [22:35] once its up today [22:35] SketchCow: all iv'e found is things to reanme files not folders, and just changing the typr to -d doesn't seem to help [22:36] anomie: 188 410 errors just in 2015-05-10 dump [22:36] godane: Are we (or you alone?) going to create a cron job and update it continuisly? [22:36] i will update it continuisly [22:37] Nice. [22:38] so 2015-05-08 has 409 urls with 410 errors [22:38] and 2015-05-09 as 262 with 410 errors [22:39] anomie: you're new. godane is one of our best vacuums. [22:42] Nice. [22:42] *** xk_id has quit IRC (Remote host closed the connection) [22:43] *** Start has quit IRC (Ping timeout: 306 seconds) [22:43] *** xk_id has joined #archiveteam [22:46] *** Start has joined #archiveteam [22:51] *** xk_id has quit IRC (Remote host closed the connection) [23:01] Does anyone know of any forum scrapers? I'm planning to extract texts out of a private vBulletin powered forum and don't need any markup or resources. [23:02] *** Guest100 has quit IRC (My Mac has gone to sleep. ZZZzzz…) [23:04] *** c_b2 has joined #archiveteam [23:04] *** c_b2 has quit IRC (Client Quit) [23:05] *** c_b has quit IRC (Ping timeout: 252 seconds) [23:05] *** wyatt8740 has quit IRC (Remote host closed the connection) [23:06] PotcFdk: best to grab in WARC first, and then extract from that [23:06] no idea if any such tools exist though [23:07] joepie91: I did some searching, but haven't been able to find anything. I guess I might have to fiddle something together by RegExing or parsing the HTML [23:10] PotcFdk: don't use regex :) [23:10] regex for html is bad [23:10] PotcFdk: what languages do you speak? [23:11] Some scripting langs (Lua, bash), C(++), Java, a bit of Go [23:11] hmmm. [23:11] PotcFdk: the only one of those that I'd expect to have a reasonable HTML parser, would be Go [23:11] like, one where you aren't busy writing boilerplate for the next 2 months [23:11] to extract a username [23:11] lol [23:12] haha [23:12] PotcFdk: my first recommendation would generally be Cheerio (JS), and second recommendation lxml/BeautifulSoup (Python), but neither of those were in your list :P [23:13] I can move myself forward in JS, I just might need more time, but I guess that's okay [23:13] PotcFdk: then Cheerio might be a good choice. it's basically jQuery without a browser [23:13] you'd generally run it in Node.js, but technically you could run it in pretty much any JS runtime [23:13] Sounds interesting, that might help me [23:14] PotcFdk: https://github.com/cheeriojs/cheerio [23:14] PotcFdk: combine with http://cryto.net/~joepie91/blog/2015/05/04/functional-programming-in-javascript-map-filter-reduce/ [23:14] if you want nice code [23:14] and https://docs.npmjs.com/ + https://nodejs.org/api/modules.html if you haven't used Node before [23:15] et voila [23:15] and I just realized that this is #archiveteam [23:15] so we should probably move this to #archiveteam-bs [23:15] :P [23:15] Truth [23:23] *** xk_id has joined #archiveteam [23:29] is the current best method for imaging old Mac GCR floppies still just using an old Mac to read/image them? [23:37] *** aaaaaaaa_ has joined #archiveteam [23:37] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [23:37] *** swebb sets mode: +o aaaaaaaa_ [23:37] *** aaaaaaaa_ is now known as aaaaaaaaa