Time |
Nickname |
Message |
00:13
🔗
|
|
Stil3tt0 is now known as Stiletto |
00:14
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
01:00
🔗
|
|
__sagitai has joined #archiveteam-bs |
01:05
🔗
|
|
_sagitair has quit IRC (Ping timeout: 370 seconds) |
01:15
🔗
|
|
db420 is now known as dboard |
01:16
🔗
|
|
icedice2 has joined #archiveteam-bs |
01:17
🔗
|
|
icedice has quit IRC (Read error: Connection reset by peer) |
01:20
🔗
|
|
rocode_ has joined #archiveteam-bs |
01:27
🔗
|
|
rocode has quit IRC (Ping timeout: 246 seconds) |
01:27
🔗
|
|
rocode_ is now known as rocode |
01:30
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |
02:29
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
02:41
🔗
|
|
_sagitair has joined #archiveteam-bs |
02:47
🔗
|
|
__sagitai has quit IRC (Ping timeout: 370 seconds) |
02:55
🔗
|
|
SchroSct has joined #archiveteam-bs |
02:55
🔗
|
SchroSct |
I made it! |
02:58
🔗
|
|
schbirid2 has joined #archiveteam-bs |
03:03
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
03:29
🔗
|
|
odemg has joined #archiveteam-bs |
03:37
🔗
|
|
pizzaiolo has left |
04:43
🔗
|
|
NONSS has joined #archiveteam-bs |
04:48
🔗
|
|
Nons has quit IRC (Read error: Operation timed out) |
05:08
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
05:18
🔗
|
|
icedice2 has quit IRC (Quit: Leaving) |
05:28
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
05:34
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:42
🔗
|
|
User405 has joined #archiveteam-bs |
05:43
🔗
|
|
User404 has quit IRC (Read error: Connection reset by peer) |
06:22
🔗
|
|
GE has joined #archiveteam-bs |
06:33
🔗
|
|
unkn0wn_ has quit IRC () |
07:05
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
07:21
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
07:32
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
07:33
🔗
|
|
odemg has joined #archiveteam-bs |
07:48
🔗
|
joepie91 |
the gory details of why gitlab failed: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ |
07:48
🔗
|
joepie91 |
(very good write-up) |
08:35
🔗
|
|
paparus has joined #archiveteam-bs |
08:36
🔗
|
Sanqui |
okay so, i'd say we're interested |
08:36
🔗
|
Sanqui |
(though interest is pretty much defined by the number of people willing to help) |
08:37
🔗
|
Sanqui |
search is of course an achilles heel |
08:37
🔗
|
paparus |
I think the problem here is that specialized enumeration is needed |
08:37
🔗
|
paparus |
for each site |
08:37
🔗
|
Sanqui |
but: is all the important data available on GET endpoints, like how you linked http://courtindex.sdcourt.ca.gov/CISPublic/casedetail?casenum=SCA153865&casesite=SD&applcode=C |
08:37
🔗
|
Sanqui |
? |
08:37
🔗
|
paparus |
no |
08:38
🔗
|
paparus |
it depends on the specific site |
08:38
🔗
|
Sanqui |
ah |
08:38
🔗
|
|
namespace has joined #archiveteam-bs |
08:38
🔗
|
paparus |
that's just an example |
08:38
🔗
|
paparus |
what would the result be on archive.org? |
08:39
🔗
|
Sanqui |
we have archivebot, which allows for websites to be archived and absorbed into the wayback machine |
08:39
🔗
|
paparus |
but there is no link structure leading to this specific page |
08:39
🔗
|
Sanqui |
so my idea was that the searches could be scraped locally in order to gather the URLs, then those would be put into archivebot |
08:40
🔗
|
Sanqui |
so wayback wouldn't have the search but would have the detail pages |
08:40
🔗
|
paparus |
is the data in the wayback machine full text searchable even if there is no link structure? |
08:40
🔗
|
Sanqui |
i believe there are plans for that |
08:40
🔗
|
Sanqui |
and either way, the entire collection could be downloaded |
08:41
🔗
|
paparus |
also some results will not even have a unique url, it would be a result of some cgi script |
08:41
🔗
|
paparus |
on another site |
08:41
🔗
|
Sanqui |
yeah, that's a bigger issue |
08:42
🔗
|
Sanqui |
realistically, the best thing you could do right now is to start a wiki article with a list of different sites, url structures, and requirements |
08:42
🔗
|
paparus |
ok, let me think this over |
08:43
🔗
|
paparus |
would the archive.org have problems with this type of information? |
08:43
🔗
|
paparus |
I mean it has some personal names and stuff, but it's all public |
08:43
🔗
|
Sanqui |
i (and the majority of archive team) don't speak for archive.org |
08:43
🔗
|
paparus |
ok, but in your opinion? |
08:44
🔗
|
paparus |
like I've done some research and there were cases where people got in trouble for similar stuff |
08:44
🔗
|
paparus |
for instance this is a similar case: https://www.reddit.com/r/Denmark/comments/42w67s/i_am_the_person_who_made_tingbogenstatistikorg/ |
08:45
🔗
|
Sanqui |
it's government websites, i think currently those are not just accepted but welcomed. personally, i don't have enough of a conscience to know what sort of data is present and what dangers it could pose to people |
08:45
🔗
|
paparus |
a guy crawled the danish property registry, and published a site online |
08:45
🔗
|
paparus |
with the data |
08:45
🔗
|
paparus |
but the danish apparently have a thing called address protection where you register to have your address not showing in the registry for some time |
08:46
🔗
|
paparus |
but when he crawled it it was still showing |
08:46
🔗
|
paparus |
so it cause a piss storm in denmark and he had to bring the site down |
08:46
🔗
|
Sanqui |
i see |
08:47
🔗
|
namespace |
Well. |
08:47
🔗
|
Sanqui |
yeah, well, i remember news stories saying you can be sued just for accessing a website that wasn't "supposed" to be public, so |
08:47
🔗
|
namespace |
In the US, protected information basically isn't a thing AFAIK. |
08:47
🔗
|
namespace |
The access thing can be an issue depending on interpretation of the CFAA. |
08:47
🔗
|
namespace |
But meh. |
08:47
🔗
|
namespace |
ArchiveTeam deals with that all the time. |
08:48
🔗
|
namespace |
As does IA. Whether IA wants to host the info just on decency grounds is a different story though. |
08:49
🔗
|
Sanqui |
i think you could page somebody from IA with an exact description of what's to be uploaded. |
08:50
🔗
|
paparus |
do you have a contact? |
08:50
🔗
|
Sanqui |
SketchCow |
08:51
🔗
|
paparus |
ok, I'll try |
08:56
🔗
|
joepie91 |
paparus: Sanqui: there are plans for full-text search, but archive as if there aren't |
08:56
🔗
|
joepie91 |
it's quite likely to take quite some time before it appears |
08:56
🔗
|
joepie91 |
I'd imagine that stuff like the Canada backup is higher-priority right now |
08:56
🔗
|
joepie91 |
and full-text search on a dataset of this magnitude is *really expensive* |
08:56
🔗
|
joepie91 |
(ie. it's likely a question of resources, not of tech) |
08:57
🔗
|
Sanqui |
fair |
08:58
🔗
|
Sanqui |
search or not, 1. it'd be in wayback, 2. warcs would be up for download; somebody could make their own site with fulltext search if desired |
09:07
🔗
|
paparus |
I am reading the comments on the danish guy's website and apparently was faster and better than the gov site it crawled |
09:08
🔗
|
paparus |
the gov site only had search by address but he added a full text search including by name |
09:08
🔗
|
paparus |
that's government for you |
09:14
🔗
|
|
paparus has left |
09:15
🔗
|
|
paparus has joined #archiveteam-bs |
09:17
🔗
|
paparus |
was archive.org ever sued for violation website TOS? |
10:22
🔗
|
|
GE has joined #archiveteam-bs |
10:50
🔗
|
|
__sagitai has joined #archiveteam-bs |
11:02
🔗
|
|
_sagitair has quit IRC (Read error: Operation timed out) |
11:26
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
11:50
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
12:04
🔗
|
|
icedice has joined #archiveteam-bs |
12:06
🔗
|
|
odemg has joined #archiveteam-bs |
12:32
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
12:38
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:41
🔗
|
SchroSct |
archive.org isn't a user, how could they? |
12:47
🔗
|
SchroSct |
it should be noted that I was on an intercept path with Nyany until we ran out of work. |
13:05
🔗
|
|
GE has joined #archiveteam-bs |
13:09
🔗
|
godane |
so i have about 215 more episodes to go with Tech News Today |
13:09
🔗
|
godane |
i feel alot better now with that collection |
13:13
🔗
|
|
yan has quit IRC (Quit: leaving) |
13:39
🔗
|
|
BiggieJon has quit IRC (Quit: Page closed) |
13:44
🔗
|
godane |
i'm uploading the nhk world radio japan english news |
13:44
🔗
|
godane |
for 2017-01 |
15:06
🔗
|
|
VADemon has joined #archiveteam-bs |
15:10
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
15:17
🔗
|
|
SmileyG has quit IRC (Ping timeout: 250 seconds) |
15:19
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
15:50
🔗
|
|
Aranje has joined #archiveteam-bs |
15:50
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
16:09
🔗
|
|
VADemon has joined #archiveteam-bs |
16:09
🔗
|
|
odemg has joined #archiveteam-bs |
16:15
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
16:22
🔗
|
|
odemg has joined #archiveteam-bs |
16:35
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
16:36
🔗
|
|
odemg has joined #archiveteam-bs |
16:44
🔗
|
|
icedice has joined #archiveteam-bs |
16:47
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
16:48
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
16:48
🔗
|
|
pizzaiol1 has joined #archiveteam-bs |
16:49
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
16:49
🔗
|
|
pizzaiol1 has quit IRC (Remote host closed the connection) |
16:49
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
17:00
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
17:05
🔗
|
|
odemg has joined #archiveteam-bs |
17:35
🔗
|
|
icedice2 has joined #archiveteam-bs |
17:38
🔗
|
|
icedice has quit IRC (Ping timeout: 260 seconds) |
17:39
🔗
|
|
ItsYoda has quit IRC (Ping timeout: 260 seconds) |
17:44
🔗
|
|
ItsYoda has joined #archiveteam-bs |
17:58
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
18:14
🔗
|
|
Smiley has joined #archiveteam-bs |
18:31
🔗
|
|
ItsYoda has quit IRC (Ping timeout: 260 seconds) |
18:32
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
18:38
🔗
|
|
ItsYoda has joined #archiveteam-bs |
18:41
🔗
|
|
GE has joined #archiveteam-bs |
18:43
🔗
|
|
odemg has joined #archiveteam-bs |
19:01
🔗
|
odemg |
178.62.61.231/ytglitch.mp4 |
19:05
🔗
|
Aoede |
https://www.youtube.com/watch?v=9E6dWfVwFCI |
19:10
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 260 seconds) |
19:22
🔗
|
|
ItsYoda has quit IRC (Ping timeout: 260 seconds) |
19:25
🔗
|
|
ItsYoda has joined #archiveteam-bs |
19:33
🔗
|
|
Muad-Dib has joined #archiveteam-bs |
20:08
🔗
|
|
Stiletto has quit IRC (Ping timeout: 250 seconds) |
20:09
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
20:42
🔗
|
|
odemg has joined #archiveteam-bs |
20:49
🔗
|
|
bsmith093 has quit IRC (Remote host closed the connection) |
20:50
🔗
|
SchroSct |
is there a team to get pewdiepie to negative? |
20:50
🔗
|
|
odemg has quit IRC (Remote host closed the connection) |
20:52
🔗
|
|
bsmith093 has joined #archiveteam-bs |
21:03
🔗
|
|
kristian_ has joined #archiveteam-bs |
21:04
🔗
|
|
ndiddy has joined #archiveteam-bs |
21:13
🔗
|
kristian_ |
hi all |
21:14
🔗
|
kristian_ |
can I do something so that a website is archived in full regularly? |
21:14
🔗
|
xmc |
not with our existing tools, but you're welcome to make new tools |
21:15
🔗
|
xmc |
how big of a site, what is it, how often? |
21:15
🔗
|
kristian_ |
xmc, I can barely code ;) |
21:15
🔗
|
kristian_ |
http://starwarsmesse.dk/ |
21:15
🔗
|
kristian_ |
I'm thinking ... once every 60 days or so |
21:16
🔗
|
rocode |
kristian_, I do something similar with several websites, where I will archive them every 30 days. You can use grab-site and a cron job. |
21:16
🔗
|
kristian_ |
ah, rocode ... I see |
21:18
🔗
|
Sanqui |
kristian_: the website is tiny. you can could by #archivebot every 60 days yourself and ask for it to be archived :p |
21:18
🔗
|
Sanqui |
err |
21:18
🔗
|
Sanqui |
you could stop by #archivebot* |
21:18
🔗
|
kristian_ |
thanks, Sanqui ... will look into that |
21:19
🔗
|
kristian_ |
it's quite small, yes ... and the genius webmaster (me) tried to make it future proof ;) |
21:30
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
21:35
🔗
|
|
odemg has joined #archiveteam-bs |
21:36
🔗
|
|
Stil3tt0 has joined #archiveteam-bs |
21:46
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
21:48
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
21:52
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:01
🔗
|
|
icedice2 has quit IRC (Quit: Leaving) |
22:13
🔗
|
dashcloud |
kristian_: make sure if you are using a robots.txt file it doesn't block the Internet Archive crawler (ia_archiver I believe) |
22:16
🔗
|
kristian_ |
hurm ... the archiving does not show up here: http://web.archive.org/web/*/starwarsmesse.dk |
22:17
🔗
|
kristian_ |
I can't see a robots.txt: http://starwarsmesse.dk/robots.txt |
22:20
🔗
|
Frogging |
what doesn't show up? |
22:21
🔗
|
Frogging |
I see snapshots there |
22:21
🔗
|
Frogging |
such ast this one http://web.archive.org/web/20170204114255/http://www.starwarsmesse.dk/ |
22:21
🔗
|
VADemon |
Wayback's robots.txt parser is insanely broken or outdated - whatever you call it. |
22:21
🔗
|
Frogging |
there's no robots issue here however |
22:21
🔗
|
VADemon |
just in case. e.g. whitelisting it won't actually "allow" the access |
22:23
🔗
|
kristian_ |
Frogging, I requested an archiving about an hour ago |
22:26
🔗
|
Frogging |
stuff from archivebot won't instantly show up in wayback |
22:26
🔗
|
Frogging |
it takes time. days at least, I think |
22:26
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:26
🔗
|
kristian_ |
thanks, Frogging ... I'll check in in a few days |
22:27
🔗
|
Frogging |
archivebot isn't the IA, it just uploads there ultimately |
22:27
🔗
|
Frogging |
:) |
22:36
🔗
|
SchroSct |
neat, how deep does it crawl? |
22:39
🔗
|
Frogging |
infinitely (on the specified domain) unless you tell it not to |
22:47
🔗
|
|
pizzaiolo has quit IRC (Ping timeout: 506 seconds) |
22:53
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:01
🔗
|
kristian_ |
much swifter than the waybackmachine interface, though |
23:21
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
23:24
🔗
|
|
Stil3tt0 has quit IRC (Read error: Operation timed out) |
23:30
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
23:34
🔗
|
|
kristian_ has quit IRC (Quit: Leaving) |