[00:07] *** JesseW has joined #archiveteam [00:46] *** primus105 has quit IRC (Leaving.) [00:53] *** arkiver2 has joined #archiveteam [01:01] *** arkiver2 has quit IRC (Quit: Nettalk6 - www.ntalk.de) [01:17] *** schbirid2 has joined #archiveteam [01:19] *** schbirid has quit IRC (Read error: Operation timed out) [01:20] *** pokeball9 has joined #archiveteam [01:21] Hello [01:24] There seems to be no archive of the site/service of Flipnote Hatena,it closed a few years back,is it just I haven't found it yet,or did It really shut down with out a public backup? [01:31] pokeball9: URL? [01:32] For the site? [01:32] yes [01:32] I presume you are referring to the site mentioned here? https://en.wikipedia.org/wiki/Flipnote_Studio#Overview [01:33] http://flipnote.hatena.com/thankyou [01:33] And yes [01:33] This may also be a relevant link: http://ugomemo.hatena.ne.jp/ [01:34] That's the japan ver of what I sent,but yea [01:35] good to look up both [01:36] I can't seem to find any hint of it being archived anywhere,it would be a shame if that was truly the case,probably 20tb of user gen content there [01:37] The Wayback Machine appears to be down right now, so I can't check there... [01:39] I don't know of other places to look, offhand -- but other people in the channel might have further suggestions. [02:19] *** Emcy_ has quit IRC (Read error: Connection reset by peer) [02:31] *** VADemon has quit IRC (left4dead) [02:44] *** JesseW has quit IRC (Read error: Operation timed out) [02:45] *** pokeball9 has quit IRC (Read error: Operation timed out) [02:59] *** vitzli has joined #archiveteam [03:39] *** aaaaaaaaa has quit IRC (Leaving) [04:04] *** Emcy has joined #archiveteam [04:07] *** BlueMaxim has joined #archiveteam [04:21] *** JesseW has joined #archiveteam [04:24] pokeball9 -- well, there are over 250,000 URLs at that domain saved in the Wayback Machine... [05:07] *** Stilett0 has joined #archiveteam [05:09] *** Stiletto has quit IRC (Read error: Operation timed out) [05:11] *** vitzli has quit IRC (Quit: Leaving) [05:17] *** dashcloud has quit IRC (Read error: Operation timed out) [05:20] *** dashcloud has joined #archiveteam [05:36] so This Is My Jam's shutdown has very much impressed me [05:37] they had a deadline, they executed on it, there's a nice style on the archive, and there's no data loss (that I can see, anyway) [05:37] the spotify playlist thing is a nice touch [05:37] if you have a Spotify account [06:06] *** primus104 has joined #archiveteam [06:09] *** primus105 has joined #archiveteam [06:09] ADrive.com is closing its free service on November 16: http://www.adrive.com/basicDiscontinued [06:12] Most of it is private files, but Basic users could share files so there might be some public stuff. [06:13] *** primus104 has quit IRC (Read error: Operation timed out) [06:23] *** logan2 has joined #archiveteam [06:25] *** Ymgve__ is now known as Ymgve [06:26] *** logan has quit IRC (Read error: Operation timed out) [07:31] *** JesseW has quit IRC (Read error: Operation timed out) [07:32] *** Infreq_ is now known as Infreq [07:52] meh, researchgate partially 429d me overnight [08:03] is there a way to make wpull not fail totally if a line in inputfile is malformed? [08:08] *** PurpleSym has joined #archiveteam [08:15] *** schbirid2 has quit IRC (Ping timeout: 306 seconds) [09:39] *** schbirid has joined #archiveteam [09:43] *** primus105 has quit IRC (Leaving.) [09:51] zhongfu: can you have a look at your connection? [09:51] It looks like you're returning bad thingiverse items [09:52] chfoo: can you please send me the logs of thingiverse? We probably have around 3000 bad items [09:53] *** signius has quit IRC (Read error: Operation timed out) [09:53] arkiver: lemme check [09:54] arkiver: so sorry! I got banned by Thingiverse, will stop now [09:54] what status code did you get from thingiverse? [09:56] https://znx.cc/s1443347758.png [09:56] thanks! [09:56] any idea how you got banned? was it because of this grab? [09:57] I think so, maybe my concurrent was too high [09:57] I think it was at 10 or so [09:57] hmm [09:57] we should warn people then to leave their concurrent at 2 or something around that [10:00] I can't tell if it was a manual or automatic ban [10:05] *** signius has joined #archiveteam [10:06] *** mksplg has quit IRC (Remote host closed the connection) [10:12] zhongfu: ok [10:29] *** SimpBrain has quit IRC (Quit: Leaving) [10:33] *** BlueMaxim has quit IRC (Quit: Leaving) [10:51] *** sylt has joined #archiveteam [10:53] has anybody got any experience with the common crawl datasets? I'd like to grapple with it but I'm having a hard time working out what the scale and costs involved will be, it's simply colossal. [10:53] *** schbirid has quit IRC (Read error: Operation timed out) [10:54] *** schbirid has joined #archiveteam [10:55] it sounds like a simple task "find matches for this regex", but just the sheer amount of CPU time involved is mind boggling. I can't even make reasonable guesses of how large the full WARC files are when decompacted to put an upper limit on matching for a particular EC2 instance size. [11:14] *** zenguy_pc has quit IRC (Read error: Operation timed out) [11:18] *** zenguy_pc has joined #archiveteam [11:28] phew, researchgate is about 20 million publications [11:30] *** primus104 has joined #archiveteam [11:30] seems like elasticmapreduce is the way to go for cost reasons at least. [12:16] *** primus104 has quit IRC (Leaving.) [12:21] *** GLaDOS has quit IRC (Ping timeout: 252 seconds) [12:25] *** mksplg has joined #archiveteam [12:36] *** GLaDOS has joined #archiveteam [13:03] *** godane has quit IRC (Leaving.) [13:11] *** khaoohs_ has quit IRC (Read error: Connection reset by peer) [13:20] *** khaoohs has joined #archiveteam [13:29] *** mksplg has quit IRC (Quit: WeeChat 1.0.1) [13:36] *** Ungstein has quit IRC (Quit: Leaving.) [13:38] *** Ungstein has joined #archiveteam [13:57] We can do a discovery project where we go through the 56,8 billion possible public adrive urls [13:57] *** xk_id has quit IRC (Remote host closed the connection) [13:58] We likely won't make it, but if we go through only 10% of those urls we will also discover ~10% of the puslic files hosted on adrive [13:58] public* [14:12] SketchCow ^ [14:22] arkiver: wonder how many of those files would have links in common crawl. [14:45] *** primus104 has joined #archiveteam [14:55] http://harrycross.me/430.png [14:55] getting that [14:56] hmmm [14:56] my worker also not pushing [14:56] eh [14:56] I think FOS is having issues [14:56] wuts that [14:57] the place that all the files are being uploaded too [15:00] *** SN4T14 has quit IRC (Read error: Connection reset by peer) [15:06] *** SN4T14 has joined #archiveteam [15:18] HCross: err. that looks proper broken/disabled [15:19] paging arkiver [15:19] Permission denied is not a good error... [15:19] arkiver: need my box as backup target? [15:20] yipdw: the rsync from the above image is the one you set up right? [15:20] joepie91: yeah, let's do that [15:23] *** mksplg has joined #archiveteam [15:42] *** SimpBrain has joined #archiveteam [15:47] chfoo: yipdw: can you please have a look if the thingiverse FOS rsync permits creating new dirs? currently it looks like it's not [15:56] chfoo: yipdw: extra urgency; my box is not currently available, as it likely has a failing drive [15:58] next ip banned from researchgate, after about 10k /publication/ downloads [16:00] joepie91, was it a 4TB Dacentec box? [16:00] HCross: yeah [16:00] is there a trivial dumb-schbirid way to tell wpull "stop if you encounter a 429, save the state" and resume that later? [16:00] the 2x2tb one joepie91 [16:00] HCross: no, 1x4 [16:00] bad batch? [16:00] ah, ive got a 2x2tb server thats currently grabbing. Happy to hand it over [16:01] HCross: put SMART data in support ticket, going to wait and let them sort it out [16:01] :P [16:01] HCross: like, if you have space available, you can offer also [16:01] but check SMART carefully [16:01] I don't know what their purchasing strategy is, and if it's from a bad batch, your server may be affected also [16:02] Will check [16:02] *** zenguy_pc has quit IRC (Read error: Operation timed out) [16:02] atm its in softraid [16:02] *** dashcloud has quit IRC (Read error: Operation timed out) [16:02] HCross: how much load have you put on the disks? [16:02] nothing. All its done is AT Grabs [16:02] mmm [16:03] HCross: http://sprunge.us/XLWh [16:03] look at Runtime_Bad_Block [16:03] HCross: maybe try to put a bit of (read?) load on your box, see if that remains 0 [16:03] if it does, it's probably unaffected, and we can use it for this maybe [16:03] though burn-in is always nice [16:03] (most HDDs fail either early or late - bathtub curve) [16:06] joepie91, problem is that its in softraid [16:07] HCross: ok? [16:07] Not sure how to check the 2 disks [16:07] HCross: are they not still exposed as /dev/sd* in softraid? [16:08] *** zenguy_pc has joined #archiveteam [16:10] joepie91, http://harrycross.me/56e.png [16:11] HCross: right. `smartctl --attributes /dev/sda` and idem for /dev/sdb, you may need to install smartmontools first [16:13] Do http://harrycross.me/7d8.png [16:13] look healthy [16:13] no reallocated sectors which is a good start. [16:14] been spinning for 2.6 years [16:14] no idea what multi zone error rate is. [16:14] thing which SMART status output is that there's zero standards whatsoever about what they mean. [16:15] "Write Error Rate / Multi-Zone Error Rate (Western Digital) S.M.A.R.T. parameter indicates the total number of errors appearing while recording data to a hard disk. This may be caused by problems with disk surface or the read/write heads." [16:16] but no indication anywhere of what the value means. [16:17] *** dashcloud has joined #archiveteam [16:26] Shall I do some read testing and see [16:31] http://harrycross.me/0ae.png [16:31] /dev/sdb seems weaker [16:32] HCross: ah, your drives aren't new [16:32] HCross: yeah, that looks fine to me, at a glance [16:33] Yep. Will 1.5TB go anywhere? [16:33] HCross: if it's not showing any pre-fail after 2.6 years then you're unlikely to run into issues [16:33] not any time soon anyway [16:33] Ok [16:33] HCross: think so. would have to ask arkiver [16:35] *** SimpBrain has quit IRC (Quit: Leaving) [16:36] HCross: one more thing to check [16:37] Ok. Fire ahead [16:37] HCross: install hdparam, then run `hdparam -I /dev/sdb` [16:37] should give you a model number [16:37] load cycle count is a bit weird for sdb [16:37] E: Unable to locate package hdparam [16:37] trying to interpret SMART values is abotu as bad as trying to read javascript [16:37] hdparm [16:38] er [16:38] yeah [16:38] hdparm [16:38] lol [16:38] SMART would have been a lot better if every value was just a boolean "SHIT IS BROKEN" [16:38] Model Number: WDC WD2002FYPS-02W3B0 [16:39] HCross: okay, that's an IntelliPower drive [16:39] so that should be normal [16:39] HCross: can you also check sda? [16:40] WDC WD2002FAEX-007BA0 [16:40] HCross: we should probably move to -bs :) [16:45] *** JesseW has joined #archiveteam [16:46] *** sylt has quit IRC (Ping timeout: 240 seconds) [16:47] *** robink has quit IRC (Read error: Operation timed out) [16:58] *** Ravenloft has joined #archiveteam [17:04] *** midas has quit IRC (Ping timeout: 362 seconds) [17:09] *** dashcloud has quit IRC (Read error: Operation timed out) [17:16] *** dashcloud has joined #archiveteam [17:25] *** Ungstein has quit IRC (Quit: Leaving.) [17:28] *** Ungstein has joined #archiveteam [17:40] *** aaaaaaaaa has joined #archiveteam [17:58] arkiver: if it comes to it, you can force a limitation on how many concurrency is allowed for a particular job. [18:02] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [18:02] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [18:05] *** JesseW has quit IRC (Read error: Operation timed out) [18:10] It looks like all my thingiverse grabs are failing with 429 errors :-/ I have concurrency set to 1 per IP - anything I can do there? [18:10] matthusby, its a known issue [18:11] ok [18:12] Guess I will toss more at blingee then :) [18:19] *** zenguy_pc has joined #archiveteam [18:19] ive just taken my main server down for maintainance [18:27] *** SimpBrain has joined #archiveteam [18:33] *** SignT has joined #archiveteam [18:33] *** Ravenloft has joined #archiveteam [18:33] Hello [18:33] *** Dark_Star has joined #archiveteam [18:34] hi. just fyi, the digitize and fileformat wikis are broken again. lots and lots of PHP errors [18:35] SketchCow: ^ [18:36] I been trying to get ArchiveTeam Warrior working by following the quick start, however I been getting a breakpoint exception while running the virtual machine. [18:36] Is this a known issue? [18:37] when? on boot or selecting a project? [18:37] On boot. [18:38] try nuking the machine and reimporting it [18:38] I did, as well as redownloading the appliance. [18:39] Quick nit pick, the http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior seems to claim it's a 174MB appliance [18:39] which does not seem to be the case when I downloaded it. [18:39] Perhaps that's the problem? [18:39] it should be around 667MB, I believe [18:43] Do you have a link where I can get it? [18:44] *** JesseW has joined #archiveteam [18:48] oops, wrong ova [18:48] it is 166MB [18:49] Yeah, I got that one. Still a breakpoint exception. [18:52] I don't know, diagnosing things like this is a bit out of my wheelhouse. But breakpoint exception makes be think you are running some sort of debugger. Maybe someone else will have an idea. [18:58] Ah, I have solved the problem by raising the available memory in the virtual machine. [18:58] Looks like the defaults were not enough. [18:58] Thanks aaaaaaaaa [19:00] Out of curiosity, how much was assigned and how much did you give it? [19:00] also you are welcome, sorry I couldn't have been more help [19:00] 512 MB Base and 1MB Video Memory [19:01] I raised both of them to 1024 and 8 respectively (and together so I don't know which one did the trick) [19:04] *** aaaaaaaa_ has joined #archiveteam [19:04] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [19:11] *** aaaaaaaa_ is now known as aaaaaaaaa [19:30] *** ersi has quit IRC (Read error: Operation timed out) [19:45] *** robink has joined #archiveteam [19:49] *** ersi has joined #archiveteam [19:51] *** JesseW has quit IRC (Read error: Operation timed out) [19:54] *** dashcloud has quit IRC (Read error: Operation timed out) [20:01] *** dashcloud has joined #archiveteam [20:05] *** JesseW has joined #archiveteam [20:10] *** aaaaaaaa_ has joined #archiveteam [20:10] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [20:18] *** aaaaaaaa_ is now known as aaaaaaaaa [20:20] *** JesseW has quit IRC (Read error: Operation timed out) [20:21] *** godane has joined #archiveteam [20:23] *** RichardG has quit IRC (Read error: Operation timed out) [20:25] *** schbirid has quit IRC (Quit: Leaving) [20:26] *** JesseW has joined #archiveteam [20:30] *** RichardG has joined #archiveteam [20:32] *** JesseW has quit IRC (Read error: Operation timed out) [20:38] *** aaaaaaaa_ has joined #archiveteam [20:38] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [20:43] *** dashcloud has quit IRC (Read error: Operation timed out) [20:50] *** dashcloud has joined #archiveteam [20:52] *** phuzion has quit IRC (Read error: Operation timed out) [21:03] *** lhobas has quit IRC (Ping timeout: 252 seconds) [21:04] *** aaaaaaaa_ is now known as aaaaaaaaa [21:06] *** lhobas has joined #archiveteam [21:08] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [21:14] *** SimpBrain has quit IRC (Read error: Connection reset by peer) [21:17] *** dx has quit IRC (Quit: RIP) [21:17] *** dx has joined #archiveteam [21:29] *** Spring has joined #archiveteam [21:30] I worry a bit about archive.is [21:30] it has many pages saved on demand that no where else does [21:41] *** fie has joined #archiveteam [21:50] *** PurpleSym has quit IRC (Remote host closed the connection) [21:51] *** philpem has joined #archiveteam [21:56] *** robink has quit IRC (Ping timeout: 492 seconds) [21:57] *** robink has joined #archiveteam [21:58] Spring: yeah, same [21:59] *** JesseW has joined #archiveteam [22:03] *** zhongfu_ has joined #archiveteam [22:05] *** zhongfu has quit IRC (Ping timeout: 483 seconds) [22:07] SketchCow: can you please take a look at this permission problem with thingiverse? http://harrycross.me/430.png [22:07] And if you can, please remove all files from thingiverse grabbed by zhongfu [22:11] *** garyrh has quit IRC (hub.dk irc.homelien.no) [22:11] *** Ymgve has quit IRC (hub.dk irc.homelien.no) [22:11] *** limebyte has quit IRC (hub.dk irc.homelien.no) [22:11] *** mafrasi2 has quit IRC (hub.dk irc.homelien.no) [22:11] *** i0npulse has quit IRC (hub.dk irc.homelien.no) [22:11] *** yipdw has quit IRC (hub.dk irc.homelien.no) [22:11] *** altlabel has quit IRC (hub.dk irc.homelien.no) [22:11] *** Meeh has quit IRC (hub.dk irc.homelien.no) [22:11] *** Pythia has quit IRC (hub.dk irc.homelien.no) [22:13] *** Specular has joined #archiveteam [22:15] *** Spring has quit IRC (Ping timeout: 369 seconds) [22:16] *** yipdw_ has joined #archiveteam [22:18] *** HCross has quit IRC (Ping timeout: 252 seconds) [22:20] *** diacope has quit IRC (Ping timeout: 252 seconds) [22:20] *** Rickster has quit IRC (Ping timeout: 252 seconds) [22:20] *** dx has quit IRC (Ping timeout: 252 seconds) [22:21] *** Famicoman has quit IRC (Ping timeout: 252 seconds) [22:22] *** terburg has joined #archiveteam [22:24] *** kalyx has joined #archiveteam [22:24] hi [22:24] Is there a archive team member here? [22:25] *** diacope has joined #archiveteam [22:25] *** HCross has joined #archiveteam [22:27] *** SignT has quit IRC (Ping timeout: 492 seconds) [22:29] arkiver [22:31] *** diacope has quit IRC (Remote host closed the connection) [22:31] *** terburg has quit IRC (Read error: Connection reset by peer) [22:33] *** Rickster has joined #archiveteam [22:35] *** diacope has joined #archiveteam [22:38] *** Famicoman has joined #archiveteam [22:39] kalyx: many - but most are probably not paying attention [22:39] kalyx: just ask your question / say your thing :) [22:39] (and ideally stick around for a bit, to wait for a response) [22:39] I'd like to take about the email i received stating that you are archiving lainchan.org and ask them to stop [22:40] talk* whoops [22:40] kalyx: what email? [22:41] kalyx: keep in mind that archiveteam is comprised of volunteers, and is organized ad-hoc, so there's no real central point of authority, nor is everybody necessarily aware of everything that's going on [22:41] ah [22:41] should I then dm the person directly who contacted me? [22:41] kalyx: if possible, that is probably preferable. I'm curious what the situation is, though :) [22:42] I received a donation stating the person was sorry for using wget to archive about 20GB of my sites data since they plan on making an archive of it. In the name of archive team. [22:43] And as an imageboard, we actively block archives and would prefer that people respect that. [22:43] kalyx: right. it's unlikely that anybody is going to honour that, though :) [22:43] kalyx: it's part of culture, and thus history [22:43] sure but since this is a respectable organization I can ask politely [22:43] kalyx: you'll have to stick around and see if anyone fesses up [22:44] I don't know who was doing that [22:44] alternatively, block them [22:44] (if you haven't already) [22:44] ok [22:44] do you know the user-agent? [22:44] * ersi gets interested in archiving lainchain.org [22:45] I haven't looked into it yet [22:45] it might be an archivebot job, in which case the UA is easily identifiable and we can stop it [22:45] * joepie91 personally considers the interests of society at large to trump the preferences of site owners, especially in cases where 1) it's user content, not the site owner's content and 2) there are no technical issues being caused by it [22:45] although I don't think so since it's not in the active job list [22:45] My users generally agree that archives are not what we want. [22:45] people before principles [22:46] I don't want to argue about that, I'd just like to ask someone in charge to stop. [22:46] I'll wait [22:46] if it's not running on one of our more well-known systems (and AFAIK it isn't) it's probably someone here who may or may not speak up [22:47] Like an unofficial type of thing? [22:48] considering there aren't an actual organisation.. yeah [22:48] kalyx: archiveteam is completely ad-hoc. [22:48] Archive team is what you're looking at. This isn't an organization, just a bunch of people on IRC who think archiving is a good thing. [22:48] kalyx: yes [22:48] echo echo echo [22:49] the most we have in the way of central coordination is archiveteam.org's wiki, and there's no results for a lainchan page on there [22:49] kalyx: also, just for the sake of clarity - we're not archive.org / the Internet Archive [22:49] :) [22:49] Interesting. [23:00] *** Muad-Dib has quit IRC (Quit: ZNC - http://znc.in) [23:20] *** godane has quit IRC (Read error: Operation timed out) [23:21] *** aaaaaaaaa has joined #archiveteam [23:30] *** dashcloud has quit IRC (Read error: Operation timed out) [23:31] *** nertzy has joined #archiveteam [23:37] *** dashcloud has joined #archiveteam [23:38] *** garyrh has joined #archiveteam [23:38] *** Ymgve has joined #archiveteam [23:38] *** limebyte has joined #archiveteam [23:38] *** mafrasi2 has joined #archiveteam [23:38] *** i0npulse has joined #archiveteam [23:38] *** altlabel has joined #archiveteam [23:38] *** Meeh has joined #archiveteam [23:38] *** Pythia has joined #archiveteam [23:41] *** ersi has quit IRC (Ping timeout: 240 seconds) [23:43] *** ersi has joined #archiveteam [23:57] *** BlueMaxim has joined #archiveteam [23:57] *** philpem has quit IRC (Ping timeout: 252 seconds)