#archiveteam-bs 2017-02-11,Sat

↑back Search

Time Nickname Message
00:13 🔗 Stil3tt0 is now known as Stiletto
00:14 🔗 GE has quit IRC (Remote host closed the connection)
01:00 🔗 __sagitai has joined #archiveteam-bs
01:05 🔗 _sagitair has quit IRC (Ping timeout: 370 seconds)
01:15 🔗 db420 is now known as dboard
01:16 🔗 icedice2 has joined #archiveteam-bs
01:17 🔗 icedice has quit IRC (Read error: Connection reset by peer)
01:20 🔗 rocode_ has joined #archiveteam-bs
01:27 🔗 rocode has quit IRC (Ping timeout: 246 seconds)
01:27 🔗 rocode_ is now known as rocode
01:30 🔗 kristian_ has quit IRC (Quit: Leaving)
02:29 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
02:41 🔗 _sagitair has joined #archiveteam-bs
02:47 🔗 __sagitai has quit IRC (Ping timeout: 370 seconds)
02:55 🔗 SchroSct has joined #archiveteam-bs
02:55 🔗 SchroSct I made it!
02:58 🔗 schbirid2 has joined #archiveteam-bs
03:03 🔗 schbirid has quit IRC (Read error: Operation timed out)
03:29 🔗 odemg has joined #archiveteam-bs
03:37 🔗 pizzaiolo has left
04:43 🔗 NONSS has joined #archiveteam-bs
04:48 🔗 Nons has quit IRC (Read error: Operation timed out)
05:08 🔗 VADemon has quit IRC (Quit: left4dead)
05:18 🔗 icedice2 has quit IRC (Quit: Leaving)
05:28 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
05:34 🔗 Sk1d has joined #archiveteam-bs
05:42 🔗 User405 has joined #archiveteam-bs
05:43 🔗 User404 has quit IRC (Read error: Connection reset by peer)
06:22 🔗 GE has joined #archiveteam-bs
06:33 🔗 unkn0wn_ has quit IRC ()
07:05 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
07:21 🔗 GE has quit IRC (Remote host closed the connection)
07:32 🔗 odemg has quit IRC (Remote host closed the connection)
07:33 🔗 odemg has joined #archiveteam-bs
07:48 🔗 joepie91 the gory details of why gitlab failed: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
07:48 🔗 joepie91 (very good write-up)
08:35 🔗 paparus has joined #archiveteam-bs
08:36 🔗 Sanqui okay so, i'd say we're interested
08:36 🔗 Sanqui (though interest is pretty much defined by the number of people willing to help)
08:37 🔗 Sanqui search is of course an achilles heel
08:37 🔗 paparus I think the problem here is that specialized enumeration is needed
08:37 🔗 paparus for each site
08:37 🔗 Sanqui but: is all the important data available on GET endpoints, like how you linked http://courtindex.sdcourt.ca.gov/CISPublic/casedetail?casenum=SCA153865&casesite=SD&applcode=C
08:37 🔗 Sanqui ?
08:37 🔗 paparus no
08:38 🔗 paparus it depends on the specific site
08:38 🔗 Sanqui ah
08:38 🔗 namespace has joined #archiveteam-bs
08:38 🔗 paparus that's just an example
08:38 🔗 paparus what would the result be on archive.org?
08:39 🔗 Sanqui we have archivebot, which allows for websites to be archived and absorbed into the wayback machine
08:39 🔗 paparus but there is no link structure leading to this specific page
08:39 🔗 Sanqui so my idea was that the searches could be scraped locally in order to gather the URLs, then those would be put into archivebot
08:40 🔗 Sanqui so wayback wouldn't have the search but would have the detail pages
08:40 🔗 paparus is the data in the wayback machine full text searchable even if there is no link structure?
08:40 🔗 Sanqui i believe there are plans for that
08:40 🔗 Sanqui and either way, the entire collection could be downloaded
08:41 🔗 paparus also some results will not even have a unique url, it would be a result of some cgi script
08:41 🔗 paparus on another site
08:41 🔗 Sanqui yeah, that's a bigger issue
08:42 🔗 Sanqui realistically, the best thing you could do right now is to start a wiki article with a list of different sites, url structures, and requirements
08:42 🔗 paparus ok, let me think this over
08:43 🔗 paparus would the archive.org have problems with this type of information?
08:43 🔗 paparus I mean it has some personal names and stuff, but it's all public
08:43 🔗 Sanqui i (and the majority of archive team) don't speak for archive.org
08:43 🔗 paparus ok, but in your opinion?
08:44 🔗 paparus like I've done some research and there were cases where people got in trouble for similar stuff
08:44 🔗 paparus for instance this is a similar case: https://www.reddit.com/r/Denmark/comments/42w67s/i_am_the_person_who_made_tingbogenstatistikorg/
08:45 🔗 Sanqui it's government websites, i think currently those are not just accepted but welcomed. personally, i don't have enough of a conscience to know what sort of data is present and what dangers it could pose to people
08:45 🔗 paparus a guy crawled the danish property registry, and published a site online
08:45 🔗 paparus with the data
08:45 🔗 paparus but the danish apparently have a thing called address protection where you register to have your address not showing in the registry for some time
08:46 🔗 paparus but when he crawled it it was still showing
08:46 🔗 paparus so it cause a piss storm in denmark and he had to bring the site down
08:46 🔗 Sanqui i see
08:47 🔗 namespace Well.
08:47 🔗 Sanqui yeah, well, i remember news stories saying you can be sued just for accessing a website that wasn't "supposed" to be public, so
08:47 🔗 namespace In the US, protected information basically isn't a thing AFAIK.
08:47 🔗 namespace The access thing can be an issue depending on interpretation of the CFAA.
08:47 🔗 namespace But meh.
08:47 🔗 namespace ArchiveTeam deals with that all the time.
08:48 🔗 namespace As does IA. Whether IA wants to host the info just on decency grounds is a different story though.
08:49 🔗 Sanqui i think you could page somebody from IA with an exact description of what's to be uploaded.
08:50 🔗 paparus do you have a contact?
08:50 🔗 Sanqui SketchCow
08:51 🔗 paparus ok, I'll try
08:56 🔗 joepie91 paparus: Sanqui: there are plans for full-text search, but archive as if there aren't
08:56 🔗 joepie91 it's quite likely to take quite some time before it appears
08:56 🔗 joepie91 I'd imagine that stuff like the Canada backup is higher-priority right now
08:56 🔗 joepie91 and full-text search on a dataset of this magnitude is *really expensive*
08:56 🔗 joepie91 (ie. it's likely a question of resources, not of tech)
08:57 🔗 Sanqui fair
08:58 🔗 Sanqui search or not, 1. it'd be in wayback, 2. warcs would be up for download; somebody could make their own site with fulltext search if desired
09:07 🔗 paparus I am reading the comments on the danish guy's website and apparently was faster and better than the gov site it crawled
09:08 🔗 paparus the gov site only had search by address but he added a full text search including by name
09:08 🔗 paparus that's government for you
09:14 🔗 paparus has left
09:15 🔗 paparus has joined #archiveteam-bs
09:17 🔗 paparus was archive.org ever sued for violation website TOS?
10:22 🔗 GE has joined #archiveteam-bs
10:50 🔗 __sagitai has joined #archiveteam-bs
11:02 🔗 _sagitair has quit IRC (Read error: Operation timed out)
11:26 🔗 GE has quit IRC (Remote host closed the connection)
11:50 🔗 odemg has quit IRC (Remote host closed the connection)
12:04 🔗 icedice has joined #archiveteam-bs
12:06 🔗 odemg has joined #archiveteam-bs
12:32 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
12:38 🔗 pizzaiolo has joined #archiveteam-bs
12:41 🔗 SchroSct archive.org isn't a user, how could they?
12:47 🔗 SchroSct it should be noted that I was on an intercept path with Nyany until we ran out of work.
13:05 🔗 GE has joined #archiveteam-bs
13:09 🔗 godane so i have about 215 more episodes to go with Tech News Today
13:09 🔗 godane i feel alot better now with that collection
13:13 🔗 yan has quit IRC (Quit: leaving)
13:39 🔗 BiggieJon has quit IRC (Quit: Page closed)
13:44 🔗 godane i'm uploading the nhk world radio japan english news
13:44 🔗 godane for 2017-01
15:06 🔗 VADemon has joined #archiveteam-bs
15:10 🔗 icedice has quit IRC (Quit: Leaving)
15:17 🔗 SmileyG has quit IRC (Ping timeout: 250 seconds)
15:19 🔗 VADemon has quit IRC (Quit: left4dead)
15:50 🔗 Aranje has joined #archiveteam-bs
15:50 🔗 odemg has quit IRC (Remote host closed the connection)
16:09 🔗 VADemon has joined #archiveteam-bs
16:09 🔗 odemg has joined #archiveteam-bs
16:15 🔗 odemg has quit IRC (Remote host closed the connection)
16:22 🔗 odemg has joined #archiveteam-bs
16:35 🔗 odemg has quit IRC (Remote host closed the connection)
16:36 🔗 odemg has joined #archiveteam-bs
16:44 🔗 icedice has joined #archiveteam-bs
16:47 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
16:48 🔗 pizzaiolo has joined #archiveteam-bs
16:48 🔗 pizzaiol1 has joined #archiveteam-bs
16:49 🔗 pizzaiolo has quit IRC (Remote host closed the connection)
16:49 🔗 pizzaiol1 has quit IRC (Remote host closed the connection)
16:49 🔗 pizzaiolo has joined #archiveteam-bs
17:00 🔗 odemg has quit IRC (Remote host closed the connection)
17:05 🔗 odemg has joined #archiveteam-bs
17:35 🔗 icedice2 has joined #archiveteam-bs
17:38 🔗 icedice has quit IRC (Ping timeout: 260 seconds)
17:39 🔗 ItsYoda has quit IRC (Ping timeout: 260 seconds)
17:44 🔗 ItsYoda has joined #archiveteam-bs
17:58 🔗 odemg has quit IRC (Remote host closed the connection)
18:14 🔗 Smiley has joined #archiveteam-bs
18:31 🔗 ItsYoda has quit IRC (Ping timeout: 260 seconds)
18:32 🔗 GE has quit IRC (Remote host closed the connection)
18:38 🔗 ItsYoda has joined #archiveteam-bs
18:41 🔗 GE has joined #archiveteam-bs
18:43 🔗 odemg has joined #archiveteam-bs
19:01 🔗 odemg 178.62.61.231/ytglitch.mp4
19:05 🔗 Aoede https://www.youtube.com/watch?v=9E6dWfVwFCI
19:10 🔗 Muad-Dib has quit IRC (Ping timeout: 260 seconds)
19:22 🔗 ItsYoda has quit IRC (Ping timeout: 260 seconds)
19:25 🔗 ItsYoda has joined #archiveteam-bs
19:33 🔗 Muad-Dib has joined #archiveteam-bs
20:08 🔗 Stiletto has quit IRC (Ping timeout: 250 seconds)
20:09 🔗 odemg has quit IRC (Remote host closed the connection)
20:42 🔗 odemg has joined #archiveteam-bs
20:49 🔗 bsmith093 has quit IRC (Remote host closed the connection)
20:50 🔗 SchroSct is there a team to get pewdiepie to negative?
20:50 🔗 odemg has quit IRC (Remote host closed the connection)
20:52 🔗 bsmith093 has joined #archiveteam-bs
21:03 🔗 kristian_ has joined #archiveteam-bs
21:04 🔗 ndiddy has joined #archiveteam-bs
21:13 🔗 kristian_ hi all
21:14 🔗 kristian_ can I do something so that a website is archived in full regularly?
21:14 🔗 xmc not with our existing tools, but you're welcome to make new tools
21:15 🔗 xmc how big of a site, what is it, how often?
21:15 🔗 kristian_ xmc, I can barely code ;)
21:15 🔗 kristian_ http://starwarsmesse.dk/
21:15 🔗 kristian_ I'm thinking ... once every 60 days or so
21:16 🔗 rocode kristian_, I do something similar with several websites, where I will archive them every 30 days. You can use grab-site and a cron job.
21:16 🔗 kristian_ ah, rocode ... I see
21:18 🔗 Sanqui kristian_: the website is tiny. you can could by #archivebot every 60 days yourself and ask for it to be archived :p
21:18 🔗 Sanqui err
21:18 🔗 Sanqui you could stop by #archivebot*
21:18 🔗 kristian_ thanks, Sanqui ... will look into that
21:19 🔗 kristian_ it's quite small, yes ... and the genius webmaster (me) tried to make it future proof ;)
21:30 🔗 dashcloud has quit IRC (Read error: Operation timed out)
21:35 🔗 odemg has joined #archiveteam-bs
21:36 🔗 Stil3tt0 has joined #archiveteam-bs
21:46 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
21:48 🔗 pizzaiolo has joined #archiveteam-bs
21:52 🔗 dashcloud has joined #archiveteam-bs
22:01 🔗 icedice2 has quit IRC (Quit: Leaving)
22:13 🔗 dashcloud kristian_: make sure if you are using a robots.txt file it doesn't block the Internet Archive crawler (ia_archiver I believe)
22:16 🔗 kristian_ hurm ... the archiving does not show up here: http://web.archive.org/web/*/starwarsmesse.dk
22:17 🔗 kristian_ I can't see a robots.txt: http://starwarsmesse.dk/robots.txt
22:20 🔗 Frogging what doesn't show up?
22:21 🔗 Frogging I see snapshots there
22:21 🔗 Frogging such ast this one http://web.archive.org/web/20170204114255/http://www.starwarsmesse.dk/
22:21 🔗 VADemon Wayback's robots.txt parser is insanely broken or outdated - whatever you call it.
22:21 🔗 Frogging there's no robots issue here however
22:21 🔗 VADemon just in case. e.g. whitelisting it won't actually "allow" the access
22:23 🔗 kristian_ Frogging, I requested an archiving about an hour ago
22:26 🔗 Frogging stuff from archivebot won't instantly show up in wayback
22:26 🔗 Frogging it takes time. days at least, I think
22:26 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:26 🔗 kristian_ thanks, Frogging ... I'll check in in a few days
22:27 🔗 Frogging archivebot isn't the IA, it just uploads there ultimately
22:27 🔗 Frogging :)
22:36 🔗 SchroSct neat, how deep does it crawl?
22:39 🔗 Frogging infinitely (on the specified domain) unless you tell it not to
22:47 🔗 pizzaiolo has quit IRC (Ping timeout: 506 seconds)
22:53 🔗 BlueMaxim has joined #archiveteam-bs
23:01 🔗 kristian_ much swifter than the waybackmachine interface, though
23:21 🔗 BlueMaxim has quit IRC (Quit: Leaving)
23:24 🔗 Stil3tt0 has quit IRC (Read error: Operation timed out)
23:30 🔗 GE has quit IRC (Remote host closed the connection)
23:34 🔗 kristian_ has quit IRC (Quit: Leaving)

irclogger-viewer