[00:13] *** Stil3tt0 is now known as Stiletto [00:14] *** GE has quit IRC (Remote host closed the connection) [01:00] *** __sagitai has joined #archiveteam-bs [01:05] *** _sagitair has quit IRC (Ping timeout: 370 seconds) [01:15] *** db420 is now known as dboard [01:16] *** icedice2 has joined #archiveteam-bs [01:17] *** icedice has quit IRC (Read error: Connection reset by peer) [01:20] *** rocode_ has joined #archiveteam-bs [01:27] *** rocode has quit IRC (Ping timeout: 246 seconds) [01:27] *** rocode_ is now known as rocode [01:30] *** kristian_ has quit IRC (Quit: Leaving) [02:29] *** ndiddy has quit IRC (Read error: Connection reset by peer) [02:41] *** _sagitair has joined #archiveteam-bs [02:47] *** __sagitai has quit IRC (Ping timeout: 370 seconds) [02:55] *** SchroSct has joined #archiveteam-bs [02:55] I made it! [02:58] *** schbirid2 has joined #archiveteam-bs [03:03] *** schbirid has quit IRC (Read error: Operation timed out) [03:29] *** odemg has joined #archiveteam-bs [03:37] *** pizzaiolo has left [04:43] *** NONSS has joined #archiveteam-bs [04:48] *** Nons has quit IRC (Read error: Operation timed out) [05:08] *** VADemon has quit IRC (Quit: left4dead) [05:18] *** icedice2 has quit IRC (Quit: Leaving) [05:28] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:34] *** Sk1d has joined #archiveteam-bs [05:42] *** User405 has joined #archiveteam-bs [05:43] *** User404 has quit IRC (Read error: Connection reset by peer) [06:22] *** GE has joined #archiveteam-bs [06:33] *** unkn0wn_ has quit IRC () [07:05] *** Aranje has quit IRC (Quit: Three sheets to the wind) [07:21] *** GE has quit IRC (Remote host closed the connection) [07:32] *** odemg has quit IRC (Remote host closed the connection) [07:33] *** odemg has joined #archiveteam-bs [07:48] the gory details of why gitlab failed: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ [07:48] (very good write-up) [08:35] *** paparus has joined #archiveteam-bs [08:36] okay so, i'd say we're interested [08:36] (though interest is pretty much defined by the number of people willing to help) [08:37] search is of course an achilles heel [08:37] I think the problem here is that specialized enumeration is needed [08:37] for each site [08:37] but: is all the important data available on GET endpoints, like how you linked http://courtindex.sdcourt.ca.gov/CISPublic/casedetail?casenum=SCA153865&casesite=SD&applcode=C [08:37] ? [08:37] no [08:38] it depends on the specific site [08:38] ah [08:38] *** namespace has joined #archiveteam-bs [08:38] that's just an example [08:38] what would the result be on archive.org? [08:39] we have archivebot, which allows for websites to be archived and absorbed into the wayback machine [08:39] but there is no link structure leading to this specific page [08:39] so my idea was that the searches could be scraped locally in order to gather the URLs, then those would be put into archivebot [08:40] so wayback wouldn't have the search but would have the detail pages [08:40] is the data in the wayback machine full text searchable even if there is no link structure? [08:40] i believe there are plans for that [08:40] and either way, the entire collection could be downloaded [08:41] also some results will not even have a unique url, it would be a result of some cgi script [08:41] on another site [08:41] yeah, that's a bigger issue [08:42] realistically, the best thing you could do right now is to start a wiki article with a list of different sites, url structures, and requirements [08:42] ok, let me think this over [08:43] would the archive.org have problems with this type of information? [08:43] I mean it has some personal names and stuff, but it's all public [08:43] i (and the majority of archive team) don't speak for archive.org [08:43] ok, but in your opinion? [08:44] like I've done some research and there were cases where people got in trouble for similar stuff [08:44] for instance this is a similar case: https://www.reddit.com/r/Denmark/comments/42w67s/i_am_the_person_who_made_tingbogenstatistikorg/ [08:45] it's government websites, i think currently those are not just accepted but welcomed. personally, i don't have enough of a conscience to know what sort of data is present and what dangers it could pose to people [08:45] a guy crawled the danish property registry, and published a site online [08:45] with the data [08:45] but the danish apparently have a thing called address protection where you register to have your address not showing in the registry for some time [08:46] but when he crawled it it was still showing [08:46] so it cause a piss storm in denmark and he had to bring the site down [08:46] i see [08:47] Well. [08:47] yeah, well, i remember news stories saying you can be sued just for accessing a website that wasn't "supposed" to be public, so [08:47] In the US, protected information basically isn't a thing AFAIK. [08:47] The access thing can be an issue depending on interpretation of the CFAA. [08:47] But meh. [08:47] ArchiveTeam deals with that all the time. [08:48] As does IA. Whether IA wants to host the info just on decency grounds is a different story though. [08:49] i think you could page somebody from IA with an exact description of what's to be uploaded. [08:50] do you have a contact? [08:50] SketchCow [08:51] ok, I'll try [08:56] paparus: Sanqui: there are plans for full-text search, but archive as if there aren't [08:56] it's quite likely to take quite some time before it appears [08:56] I'd imagine that stuff like the Canada backup is higher-priority right now [08:56] and full-text search on a dataset of this magnitude is *really expensive* [08:56] (ie. it's likely a question of resources, not of tech) [08:57] fair [08:58] search or not, 1. it'd be in wayback, 2. warcs would be up for download; somebody could make their own site with fulltext search if desired [09:07] I am reading the comments on the danish guy's website and apparently was faster and better than the gov site it crawled [09:08] the gov site only had search by address but he added a full text search including by name [09:08] that's government for you [09:14] *** paparus has left [09:15] *** paparus has joined #archiveteam-bs [09:17] was archive.org ever sued for violation website TOS? [10:22] *** GE has joined #archiveteam-bs [10:50] *** __sagitai has joined #archiveteam-bs [11:02] *** _sagitair has quit IRC (Read error: Operation timed out) [11:26] *** GE has quit IRC (Remote host closed the connection) [11:50] *** odemg has quit IRC (Remote host closed the connection) [12:04] *** icedice has joined #archiveteam-bs [12:06] *** odemg has joined #archiveteam-bs [12:32] *** BlueMaxim has quit IRC (Read error: Operation timed out) [12:38] *** pizzaiolo has joined #archiveteam-bs [12:41] archive.org isn't a user, how could they? [12:47] it should be noted that I was on an intercept path with Nyany until we ran out of work. [13:05] *** GE has joined #archiveteam-bs [13:09] so i have about 215 more episodes to go with Tech News Today [13:09] i feel alot better now with that collection [13:13] *** yan has quit IRC (Quit: leaving) [13:39] *** BiggieJon has quit IRC (Quit: Page closed) [13:44] i'm uploading the nhk world radio japan english news [13:44] for 2017-01 [15:06] *** VADemon has joined #archiveteam-bs [15:10] *** icedice has quit IRC (Quit: Leaving) [15:17] *** SmileyG has quit IRC (Ping timeout: 250 seconds) [15:19] *** VADemon has quit IRC (Quit: left4dead) [15:50] *** Aranje has joined #archiveteam-bs [15:50] *** odemg has quit IRC (Remote host closed the connection) [16:09] *** VADemon has joined #archiveteam-bs [16:09] *** odemg has joined #archiveteam-bs [16:15] *** odemg has quit IRC (Remote host closed the connection) [16:22] *** odemg has joined #archiveteam-bs [16:35] *** odemg has quit IRC (Remote host closed the connection) [16:36] *** odemg has joined #archiveteam-bs [16:44] *** icedice has joined #archiveteam-bs [16:47] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [16:48] *** pizzaiolo has joined #archiveteam-bs [16:48] *** pizzaiol1 has joined #archiveteam-bs [16:49] *** pizzaiolo has quit IRC (Remote host closed the connection) [16:49] *** pizzaiol1 has quit IRC (Remote host closed the connection) [16:49] *** pizzaiolo has joined #archiveteam-bs [17:00] *** odemg has quit IRC (Remote host closed the connection) [17:05] *** odemg has joined #archiveteam-bs [17:35] *** icedice2 has joined #archiveteam-bs [17:38] *** icedice has quit IRC (Ping timeout: 260 seconds) [17:39] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [17:44] *** ItsYoda has joined #archiveteam-bs [17:58] *** odemg has quit IRC (Remote host closed the connection) [18:14] *** Smiley has joined #archiveteam-bs [18:31] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [18:32] *** GE has quit IRC (Remote host closed the connection) [18:38] *** ItsYoda has joined #archiveteam-bs [18:41] *** GE has joined #archiveteam-bs [18:43] *** odemg has joined #archiveteam-bs [19:01] 178.62.61.231/ytglitch.mp4 [19:05] https://www.youtube.com/watch?v=9E6dWfVwFCI [19:10] *** Muad-Dib has quit IRC (Ping timeout: 260 seconds) [19:22] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [19:25] *** ItsYoda has joined #archiveteam-bs [19:33] *** Muad-Dib has joined #archiveteam-bs [20:08] *** Stiletto has quit IRC (Ping timeout: 250 seconds) [20:09] *** odemg has quit IRC (Remote host closed the connection) [20:42] *** odemg has joined #archiveteam-bs [20:49] *** bsmith093 has quit IRC (Remote host closed the connection) [20:50] is there a team to get pewdiepie to negative? [20:50] *** odemg has quit IRC (Remote host closed the connection) [20:52] *** bsmith093 has joined #archiveteam-bs [21:03] *** kristian_ has joined #archiveteam-bs [21:04] *** ndiddy has joined #archiveteam-bs [21:13] hi all [21:14] can I do something so that a website is archived in full regularly? [21:14] not with our existing tools, but you're welcome to make new tools [21:15] how big of a site, what is it, how often? [21:15] xmc, I can barely code ;) [21:15] http://starwarsmesse.dk/ [21:15] I'm thinking ... once every 60 days or so [21:16] kristian_, I do something similar with several websites, where I will archive them every 30 days. You can use grab-site and a cron job. [21:16] ah, rocode ... I see [21:18] kristian_: the website is tiny. you can could by #archivebot every 60 days yourself and ask for it to be archived :p [21:18] err [21:18] you could stop by #archivebot* [21:18] thanks, Sanqui ... will look into that [21:19] it's quite small, yes ... and the genius webmaster (me) tried to make it future proof ;) [21:30] *** dashcloud has quit IRC (Read error: Operation timed out) [21:35] *** odemg has joined #archiveteam-bs [21:36] *** Stil3tt0 has joined #archiveteam-bs [21:46] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [21:48] *** pizzaiolo has joined #archiveteam-bs [21:52] *** dashcloud has joined #archiveteam-bs [22:01] *** icedice2 has quit IRC (Quit: Leaving) [22:13] kristian_: make sure if you are using a robots.txt file it doesn't block the Internet Archive crawler (ia_archiver I believe) [22:16] hurm ... the archiving does not show up here: http://web.archive.org/web/*/starwarsmesse.dk [22:17] I can't see a robots.txt: http://starwarsmesse.dk/robots.txt [22:20] what doesn't show up? [22:21] I see snapshots there [22:21] such ast this one http://web.archive.org/web/20170204114255/http://www.starwarsmesse.dk/ [22:21] Wayback's robots.txt parser is insanely broken or outdated - whatever you call it. [22:21] there's no robots issue here however [22:21] just in case. e.g. whitelisting it won't actually "allow" the access [22:23] Frogging, I requested an archiving about an hour ago [22:26] stuff from archivebot won't instantly show up in wayback [22:26] it takes time. days at least, I think [22:26] *** dashcloud has quit IRC (Read error: Operation timed out) [22:26] thanks, Frogging ... I'll check in in a few days [22:27] archivebot isn't the IA, it just uploads there ultimately [22:27] :) [22:36] neat, how deep does it crawl? [22:39] infinitely (on the specified domain) unless you tell it not to [22:47] *** pizzaiolo has quit IRC (Ping timeout: 506 seconds) [22:53] *** BlueMaxim has joined #archiveteam-bs [23:01] much swifter than the waybackmachine interface, though [23:21] *** BlueMaxim has quit IRC (Quit: Leaving) [23:24] *** Stil3tt0 has quit IRC (Read error: Operation timed out) [23:30] *** GE has quit IRC (Remote host closed the connection) [23:34] *** kristian_ has quit IRC (Quit: Leaving)